Troubleshooting Slow Restoration?

Hi @darkan !

I’m the author of the reworked restore flow. 2 TB over 3 days ( ~7 MB/s), 1 TB over half a day ( ~23 MB/s), and 1 TB over 14 hours ( ~19 MB/s) I agree seem quite slow for a local restore on SSDs.

Which version of Duplicati are you using? The current stable release (2.2.0.3) has some issues related to cache management, that lead to volumes being downloaded multiple times (see this discussion). The default cache size value is 100 volumes (or dblock), which I think is why you see better performance with larger volume sizes. E.g. the default being 50 MB volumes means a cache size of 5 GB. If you increase the volume size to 1 GB, the default cache is 100 GB. You can tune this using --restore-volume-cache-hint (e.g. --restore-volume-cache-hint=200GB to set the cache hint to 200 GB). When this cap is reached, volumes are evicted from the cache based on least recently used. However, since the volumes may still be in use, the actual disk space used by the cache may temporarily exceed the hint until the volumes are no longer in use and can be deleted. The problem arises when files are scattered across many volumes and the restore order is unlucky, leading to volumes being prematurely evicted, thus requiring them to be downloaded again.

So my first suggestion would be to increase the --restore-volume-cache-hint value to something larger (e.g. 10 TB if you have the disk space for it).

Secondly, I’ll touch on the comments from @RianKellyIT :

This is for old versions of Duplicati, the current default is 1 MB. Increasing block size can help reduce database pressure (as rightly pointed out by @RianKellyIT: larger block size results in less blocks, which means less database operations) and it may also increase hashing performance (as hash work is batched into larger blocks). The downsides to larger block sizes is that it hurts deduplication (as the probability of blocks being identical shrinks with increasing block sizes) and provides larger overhead for blocks that are smaller than the block size (e.g. a 500 KB file will still take up 1 MB of space in the backup if the block size is 1 MB). So there is a tradeoff to consider here.

While the statements hold true on their own, I’d argue that this isn’t the issue for the slow restore performance. The restore operation mostly reads from the database and uses parallel connections for its queries. In this blog post I did some benchmarks on SQLite performance where for the slowest single core performance machine (1950X) SQLite could reach >250 KOps for select queries on a 10 million row table. If we assume 10x worse performance, that’s still 25 KOps, which for 2 TB / 1 MB block size (2 million blocks) would mean 80 seconds of database time, which is negligible compared to the 3 days restore time.

I doubt that encryption is the bottleneck here since most CPUs carry hardware acceleration for AES. However, I may be wrong and it’s worth testing!

As rightly pointed out, the new restore flow uses parallelism on files being restored, blocks being decompressed, volumes being downloaded, and volumes being decrypted. To what degree of parallelism each step utilizes can be tuned, with the default being number_of_processor_cores / 2 for each. You can try the old restore flow using --restore-legacy=true. The restore flow has its advantages in that it only ever downloads and keeps one volume at a time, which is beneficial for low disk space / low memory environments.

I’d argue that this is no longer a good recommendation for the new restore flow. For one, larger volumes reduce the amount of parallelism that can be leveraged by the volume downloaders, decryptors, and decompressors as there’s less volumes to work with and because a single volume (as it is now) cannot be processed in parallel. Secondly, larger volume sizes will require a larger cache size to be effective. Thirdly, if a volume is prematurely evicted from the cache, downloading it again will have a larger impact than with smaller volumes. The only real benefit that remains for larger volumes is that they may compress better (since compression is only applied within a single volume) and the overhead of metadata (e.g. zip headers) is reduced.

This is quite concerning, and I can understand why you would want to avoid this. I think this points back to the cache management issue. For a good case, a restore will incur at least 2 writes of each volume (one for the initial download and one for the decrypted volume), but with the cache management issue, some volumes may be downloaded and written multiple times, which can lead to excessive wear on the SSD. So again, I would suggest increasing the --restore-volume-cache-hint value to something larger until the cache management issue is resolved in a future release. It has been handled and merged in this pull request, but it hasn’t made it into a release yet.

As a final note, if you want more insight to what’s taking time during the restore operation, you can run with --internal-profiling=true and --log-level=profiling then you’ll see the internal timers of each of the processes within the flow.