Backup upload speed

gpatel-fr · August 19, 2023, 4:30pm

Warning, very long post ahead !

In the following, b == BIT, B == BYTE (8 times more) so MB = Megabyte, Gb = Gigabit and so on.

This how to is the result of first reading Duplicati code, and understanding (parts) of it, and then second doing experiments.
Most of these experiments have been done on Windows servers.
The case is initial backup, when you are faced with a upload of a lot of existing data.
This is quite frequently mentioned in posts on this forum, talking about days if not weeks of backup even with ‘fast’ links (If your network is slow, this post is not relevant: you will have trouble finding a system slow enough to make an ADSL link - 1Mb upload - wait for Duplicati)

To understand how to get decent performance of Duplicati, it is necessary to have a good high level idea of how it works and what are the design constraints.

A naive idea is to compare the reading disk performance of the computer to the network speed of the backend and expect the same performance of Duplicati.

Duplicati is NOT reading files and sending them to the backend.

To backup data, Duplicati is reading files, and then WRITING temporary compressed files, encrypting these files, and sending the resulting files to the backend.
For each block of data, there are quite a few database operations, all of this to manage deduplication and it can’t be worked around else it would not be Duplicati but another product.

This is quite some work and many problems to solve - there is no silver bullet.

Basic compression speed

First of all, compressing is slow. The library used (SharpCompress) is not multithreaded, so a single core will do all the processing for a block of data.
The default compression used by Duplicati is ‘best compression’ (that’s the words used in the user interface).
What is meant by ‘best compression’ is MAXIMUM compression. That is, 9 out of a scale of 0 (no compression) to 9.
It is necessary here to remind of a pecularity of Zip compression, maximum compression is VERY expensive in terms of Cpu.
Basically, to get 5% more compression compared to 1 (minimal compression), you have to throw at it a lot of Cpu power.
This can absolutely harm performance.
How much ? all things equal, for a given level of performance, I have seen the following results:

For a source backup size of 28GB :
12.4 GB compressed in typically 11’45" for a compress level of 1.
24.3 GB ‘compressed’ in typically 11’20" for a compress level of 0.
11.97 GB compressed in typically 38’ for a compress level of 9.

Remember that 9 is the default level (“bestcompression”) when the compression level is NOT set.

I have written a small utility to test the compression of one file with the SharpCompress library.
Here are the typical results compressing the ‘busybox’ Linux utility, a file of about 2 MB.

compression 0: 150ms; result: same size
compression 1: 350ms; result: 52% size
compression 9: 800ms (Linux) / 1000ms (Windows); result: 49% size

This utility made me able to easily isolate SharpCompress performance with any file.
Results depends on the fact that the file is easy to compress.
If using a file that compresses badly (verified with other tools such as the zip utility and the file explorer windows utility), the speed difference between 1 and 9 levels is very small (10% instead of 100%).
So the better the files are compressed, the higher the speed advantage of compressing them less.
Maybe this explains why this has not been the subject of a lot of posts here. And the fact
that there are other blockers (see below).

I am seriously considering changing the default compression level to 1. I can’t fathom why this ‘Best’ compression is the default.
If you disagree and consider that winning 3% is compressed size is worth a 100% penalty in performance, you can argue here.

Anyway, if you want to upload data in the Terabyte range, you have better to set the compression level to 1, and to review your files versus the default extension set in Duplicati to not compress. If you have a bunch of files in the GB range that you know will compress very badly and are not in the default Duplicati extension exclusion list, you can get a big speed advantage in adding them to the exclusion list. No need to burn Cpu and wait for hours for very small space wins. The extension list can’t (unfortunately) handle fine tuned parameters, such as not compressing backup.bak for
example even if a .bak file can be both a backup extension for some text editors (compress fine) and a backup file for sql server (compress not well at all).

Remember that in current Duplicati, there are quite a few blockers and the 100% performance penalty for excessive compression will not magically make the global speed faster just by removing it. It’s true that compression speed is the basis for everything and if your Cpu is slow at compression
you will never get good performance, but global performance is hindered by many other things.

To wring decent performance out of Duplicati there are other brakes to remove.
Read on.

Encryption

A simple idea is to remove the encryption. If you don’t encrypt files, this will save some unnecessary time !
Basically exact, but it will not necessarily get great results.
Always remember the basic Duplicati backup scheme: read files, compress Zip files, send to the network.
We can’t compress any given Zip file in parallel, it’s sequential by nature (it’s not possible to compress the end of the file before the beginning).
We can’t compress a block in parallel, because the Zip library used can’t do that.
To get decent compression performance, we NEED to create several Zip files in parallel. This is slow because Zip files are big (50 Mb by default).
When removing the encryption step, what happens is during the time the Zip(s) files(s) are created, no data at all can be sent to the network link by the backend because it has not yet a file to send.
When a Zip file is sent to the network (that is what happens if encryption is disabled), no data can be sent to the corresponding Zip buffer, and we have a limited number of them (by default half the core number).
The time saved by not encrypting files will be offset by the loss of parallelism incurred by not writing to buffers while sending them.
How much depend on the parallelism that can be used. The more cores avaiable, the more removing the encryption will be a penalty.
If there are only a few cores, removing encryption could be a win. But not so much when you have a lot of cores.
Also, on most recent processors (say less than 10 years old…) AES encryption is hardware assisted.
Zip compression is not. So compression is very much dominating the scene as of Duplicati performance.
Removing encryption is not something that I have found very useful. And it’s also a big argument for Duplicati.

disk contention

Next step depends on the hardware. Has the computer write contention, that is, classic hard disks, or SSD ?
If rotating hard disks, writing on several Zip files concurrently on hard disks will be BAD. These are big files so it’s usually beyond the small capacity of hardware Raid controllers.
Fortunately, on servers there can be a way of hacking better performance: Ram disks. A Ram disk of say 1 GB will be enough to hold a few buffers of 50 MB while they are written, encrypted and sent to the network.
If you have say 10 vcpu with 8 buffers of 50MB, you will not fill the Ram disk even with encryption (can x 2) and it will solve all contention problems.
Remember that the big problem is to initially dump all the data on the backend, it will not be necessary to use a Ram disk for regular backups.
If a Windows server is the platform, setting up the Ram disk could be automated in a before job script and removing it after the backup.
Then it’s possible to set TMP=V:\ (if V: is the Ram disk) for the server process (in the service environment). Or use tmpdir in the advanced job options.
Beware: if it’s an environment variable, unset it if the Ram disk don’t exist anymore. If not, it will seem to work but later cause
some bad problems (notably import a job will fail without any error message).

database

Next step is database performance. As before, the classic hard disk is dominating the picture.
If rotating hard disks is used for the database, having several database inserts and checks for each block can be a real issue.
For a one-time initial backup, setting the database on a Ram disk could be a way. If there are several gigabytes of Ram available of course.
Change the database location before starting the initial backup, and when finished, replace the path with the normal location and click the Change database path button.

As writes to database can’t be parallelized (this time it’s Sqlite’s fault), better database performance is a must if maximum performance is wanted.
Slowing insertion speed with database growth is a well-known Sqlite problem. What I have found is that it is a much bigger problem with classic hard disks, that is, it makes everything slower much faster.
The symptom is a speed that slows quite rapidly to a crawl (3 Mb/s…).

So for the last 2 steps (3 and 4) using a Ram disk is very useful when using classic hard disks, but less so with SSD, especially good performance ones.
If you have rotating disks but not enough Ram to use a Ram disk for your initial backup, you are in a sad situation.
Remember that the database size is linked to the block size (bigger block size, smaller database).
Maybe adding a USB 3.1 SSD for the database could be better than trying to manage the initial backup all on a classic hard disk (I don’t know, I did not try it).

parallel compression

Next step involves a code change in Duplicati.
Currently everything necessary is done to allow high parallelism in writing Zip temp files, encrypting them and sending the result to the backend and the network.
The only thing that is currently NOT parallelized is … actually reading files.
More specifically current Duplicati is only reading ONE file at a time.
Even for rotating hard drives this is limiting performance.
The basic process is that Duplicati reads a file, one block at a time, then send it to the compress stage.
At the compress stage several cores are stomping their feet waiting impatiently to compress blocks.
The thing is, the whole bunch of them is only receiving ONE block at a time.
So these compressors are twiddling their thumbs doing nothing, only ONE of them will be allowed to handle a block.
When a block has been generated by the file reader, it is sent to the first available compressor.
This lucky compressor is doing its thing and writes its block to the first available Zip file buffer.
Then the top is given to the file reader to read the next block.
In short, all this superb parallelism down the line (compressing, encrypting, sending to the network) is basically useless.
Only ONE core is used to upload backups and Duplicati is basically stuttering in a stop and go one threaded process.
Adding more buffers to the rest of the pipeline makes just writing of the current file distributed to more block files.
This is not seen by looking on system monitor, because it’s not necessarily the same core that is used for each block, so you won’t see a core burning at 100% while other are waiting.
All that can be seen is a lot of cores basically doing very very little work.
Impact on performance:
all things equal, for the same backup I have given numbers for the compression (see above for details), restricting Duplicati to read one file at a time can inflate backup time from 11’45 to 24’ on a 8 core systems so while it’s not quite as bad as excessive compression, it is nonetheless a serious limitation.
Once you have that, adding more cores to a VM can also boost performance (it’s not doing much otherwise).

So I have added a concurrency-fileprocessors option to Duplicati.
See PR add simultaneous file processors by gpatel-fr · Pull Request #5010 · duplicati/duplicati · GitHub

The price is that the backup activity is not exactly reflected on the Web UI. The Web UI displays the currently backuped file.
If several files are backed up concurrently, this is not a meaningful display. It’s a matter of taste but when uploading TB of data, bad taste becomes relative IMO.
I have not looked up the code for this in Duplicati, what seems to happen is that a file is displayed, and if during this time other files are backed up it is happening behind the back of Duplicati, nothing is displayed but the file number at the top of the screen changes.
This is a bit of a mess, but if one prefers to have bad performance it’s always possible to disable this and set the concurrency-fileprocessors to 1.

network

The final step is… the backend.
Usually a default tuned Duplicati is generating data as a snail pace. Even better configured Duplicati will struggle to saturate a 1 Gb link.
Normally when considering the Ram disk where the buffers (temporary files) are stored, if say, 9 buffers are setup in Duplicati, one sees 9 files with a 0 length under Windows
(this is not the case under Linux, the size is updated in real time).
That don’t mean that they are empty, in fact they are written to but as Duplicati is filling them all simultaneously, it can take some time before one is filled up.
Then as soon as it is finished, it is copied to a AES file and sent to the network, it disappears and one comes to the usual display of N files with 0 length.
If Duplicati has been finely tuned, the system has a 80% load and buffers are filled at a furious pace (furious compared to the default one obviously).

However, the reverse problem can happen: all buffers are filled up. Permanently or it seems so.
There is no change in the Ram disk directory for seconds or even tens of seconds.
This is the blocked backend case.

It happened to me for the biggest test I have done so far.
10 processors VM with 16 GB Ram, classic hard disks (not especially fast), a Gigabit link to a Qnap Nas.
I have included the patch mentioned in 5) and set a 6 GB Ram disk for database and temporary files (set to 9 simultaneous).
Encryption enabled, Tls enabled.

Backends: I used the alternate FTP backend (with latests patches from Taz-il and even some of my own added up).
I solved it (in a hackish way) by forcing the aftp-connection-type to EPSV. Then I was able to get some reliable transfers. It was some stroke of luck finding this out.
Ftp is fast but has quite a few compatibility issues.
I did not have any such problem with the Webdav backend.

End result

On this system (the 10 processors one), with default Duplicati with a Ram disk for only the Zip buffer files,
I was getting about 6 MB/s transfer speed, slowing to 3 Mb/s (fast == after uploading about 30 GB).
This resulted in about 90 GB uploaded after 3h20. I stopped it then.
I have never attempted seriously to do the initial backup of 800 GB I was targetting first because it was so slow.
After solving the file parallelization and changing the compression level I was beginning to get more serious speed.
I ‘solved’ the Ftp backend problem, set the database on the Ram disk, used a Wal file and raised the cache_size for Sqlite, raised the block size to 400 KB to limit the space used in the Ram disk.

I then did a new backup of 1.7 Tb (compressed to 1 Tb) in 6h30 / 7h depending on the backend (6h30 with Webdav, 7h with alternate FTP),

So Duplicati with naive optimization => 30GB/hour (and slowing), final optimization => 250GB/hour

Possible (or not) further enhancements:

Note about a false good idea: set the advanced option use-block-cache. This option has no effect since years (although
I have found it advised in a much later post on this forum).
I have found no explanation for this removal. A possibility could be that it was using too much
memory for big backups.

Another false good idea: use 7-zip to get better compression. Well, it is said somewhere that this module has ‘issues’.
It has more than that: it does not work at all for me (crash immediately). I found that it was broken around 2019, and nobody has complained - if it was a good way to fix performance problems someone would have raised hell I think.
I have tried to update the managed-lzma driver and failed miserably so far. Not sure it’s worth the bother since it has not been touched since 2017. The SevenZipSharp project has now an incompatible license, so it seems very complicated to get an updated solution.

There is also in the SharpCompress doc (or in an issue maybe) that using the thing that used to be named ‘netcore’ could lead to better compression speed because it could use hardware assistance. That could be a net win of migration to Net 7 (or 8). However that is not an immediate possibility, and I have no idea of the reality of the performance win in this case.

It’s true that getting a better compression speed for a single thread could be significant in the case where one is backuping a single huge file: in this case there is no way to use multithreading with sharpcompress, so using a library using several cores for a single compress operation or having better raw performance could be really important here.
Another possibility would be to switch to call a native library when it’s possible. ‘Zip’ utility under Linux can be 2 times faster than SharpCompress. Using more modern algorithms could get better results.
Unfortunately SharpCompress library is not going forward with better performance.

There is currently no indication about the backup speed (Mb / s) in Duplicati reporting - however I think that it could be calculated and reported and it would be an interesting information to have.
I have absolutely no input on how the Duplicati reporting works so it’s wishful thinking…

final note: looking at the Duplicati speed indicator can be very misleading

the worse of it is that it is giving only the raw network speed. Compression is not taken in account. So if you have good compression, Duplicati will struggle to feed the network and you will get a rather low speed displayed, while Duplicati is really handling data much faster than it shows: you will see, say, a disappointing 20MB/s while the effective transfer rate is 60Mb/s.
Beside the speed displayed by Duplicati, it shows you the remaining data to backup: this is the data size before compression.
So the 2 informations given in the Duplicati status don’t correlate tightly.

Jojo-1000 · August 19, 2023, 6:01pm

About the ramdisk for temp files:
I played around a bit with replacing temp files for encryption/compression/upload with in-memory buffers. In most cases this is possible, except when the backend only supports file operations, or streaming is disabled. In my tests, I didn’t see the speedup I expected (compared to SSD), but maybe that would help on systems with a hard disk and enough ram.
The only struggle is deciding how big these buffers can be, to prevent running out of memory on low-ram systems.

If you think this is worth pursuing, I can do more in that direction. It would obviously require lots of testing that nothing breaks.

It was also suggested on GitHub to use Zstd as a more efficient compression algorithm, but that would require integrating a new compression library.

gpatel-fr · August 19, 2023, 6:33pm

Yes, exactly, I think that SSD begins to be a problem when backups are huge in modern terms (say, 200 TB), for more limited loads it’s not worth the bother IMO.

By me, certainly not, I’d easily find a lot of things more worth of doing. If speed is good enough for 5/10 TB it’s certainly good enough for this year at least. If the simplistic change for parallel file reading is really working correctly, IMO it’s quite enough for a limited performance enhancement. This said, speed problems can sometimes become a reliability problem especially when it’s especially bad (db rebuilding comes to mind). But for initial upload, if weeks could be changed to days, I would be very happy with that in 2023.

The compression library is just that, it is not managing archive. There is a quite hideous amount of work here, absolutely insane for a generic backup project.
And well, sharpcompress seems to be stuck here, it can only read zstd zip files and the PR enabling write is not pretty to watch, I have no hope it will be solved any time soon.
I have given thought of using 7-zip, since there is a project that enable zstd or even l4z with this format, however sevenZipSharp has switched to GPL 3.0 so we can’t use it, and the original project is dead. Too bad, I compared performance with my csharp test compressor and sigh… It would be a game changer (400% with lz4). The Microsoft library (system.io.compress) has brotli support, that is supposed to be vastly faster than deflate even if it’s not matching more modern libraries, but I have not looked if it’s worth it (I am not even sure that the archive support is comprehensive enough).
Also, using brotli in zip format is not ‘standard’ (defined by pkware).

By and large I don’t see a compelling reason to make an effort enhancing compression, unless a new contributor were already an expert (I am not holding my breath).