How can I minimize time spent verifying large amounts of files in a backup?

keboose · December 17, 2017, 5:31am

I started using Duplicati not too long ago, and I’ve been loving the granularity of the settings so far. My backup scheme is as follows: I send all my backups to a USB hard drive connected to my PC, and in the background WinSCP mirrors the files to a remote computer via SFTP.

One of the backups I have set up is of all my installed video games (mostly Steam library.) Game folders usually contain a mixture of very large (>1GB) texture files and numerous small files (some games use hundreds of files less than 1MB in size), so I kept a small block size of 50KB so that the large number of small files wouldn’t take up too much extra space. The backup runs once a week.

My issue is that, now the backup files total nearly 2TB, and every time it runs, it takes more than a day to complete the step “Verifying backend data.” The last time the backup ran, it clocked just shy of 19 hours, which is ridiculous considering that just a few weeks ago the backup would finish in 15 minutes (I haven’t made any significant changes in the data source since I started using Duplicati. Each version after the initial backup is likely less than 20GB.)

I don’t understand why the backup would suddenly take so long to verify. The control panel says I have 16 versions of the backup, and up until the 14th or so version, the average completion time was about 15 minutes. Now it is significantly longer, and I’m not sure what happened to make that the case. What can I do to reduce the time spent verifying a backup? I’m thinking I should split the backup into smaller chunks, one for big files, one for small files. Would that speed up verification in the future?

drakar2007 · December 17, 2017, 7:01am

I’m not sure I understand the justification here. Small files get packed into common zip files with all the other files (or pieces of files too big to fit in one zip file) - so really by having a 50KB volume size, you’ve caused it to take up loads and loads more space since every single block will be listed individually in the database by my understanding. I recommend not going under the 50MB block size unless for a very specific reason, and I tend to recommend (for very fast / local storage destinations) a much bigger block size, i.e. between 200 MB and 1 GB. If I were you I’d consider implementing a bigger block size and doing some cleanup - that might pare down the database to something more reasonable.

Also - do you anticipate that it’ll be directly useful to back up your installed steam games? To me these take lowest priority in backing up my system, since in a worst-case scenario I’d usually be able to reinstall from steam. I used to have my savegame files backed up, but I’m reasonably content with the cloud save feature these days.

keboose · December 17, 2017, 7:29am

I set my backups with the assumption of block size being the division of a file into discreet chunks, like the cluster size used by Windows. Am I incorrect to use that comparison? I made the block size 50KB because I assumed all files less than the block size would take a minimum of one block of space, so having a 200MB block size would mean thousands of 200MB blocks with 5KB of data in it (this is what Windows does with cluster size, but the default is 4KB.) If that’s not the case, then I will change my settings accordingly. My volume size is 200MB, so that will probably have to be increased, as well.

I do find it useful to back up my games, especially my Steam library. You do not OWN any game purchased through Steam, you merely purchase a license to use it, which can be revoked at any time. There are already several games that I paid for that have been removed from Steam, usually due to publisher licenses expiring (ie, in-game music licenses,) and they can no longer sell the game in any form. They are still downloaded and available in my library, but if I ever lose the files, I cannot download it again, and I cannot buy it anywhere else, so my only option to play the game again would be to resort to piracy.

keboose · December 17, 2017, 9:43am

I see re-reading the documentation has answered my question for me:

If a file is smaller than the chunk size, or the size is not evenly divisible by the block size, it will generate a block that is smaller than the chunk size.

It is then beneficial to have a larger block size, as the block will be limited to the actual size of the file. I will have to start my backup from scratch, but I will increase my block size to 100MB, and my volume size to 500MB. I would go larger, but the end result is still going to be sent via FTPS, and the destination PC has a limited speed cap, so I prefer smaller, more numerous files.

JonMikelV · December 18, 2017, 4:44am

Thanks for finding that description of the block size!

I think the default is 100KB, so jumping to 100MB is pretty aggressive. Remember that the block (chunk) size is the smallest change that can be stored, this means that if you have a 1 byte change in a 200MB file it will still cause a 100MB upload.

The plus size of that is Duplicati should be a bit faster as, at present, some of the database operations are a bit on the slow side - and few blocks means a smaller database. Of course the drawback of that 100MB block is that if you have 1 byte changes in that 200MB files for 5 days in a row you’ll be storing 500MB of changed blocks on your destination (depending on your specific retention settings, of course).

Regarding the sudden backup speed changing, my GUESS that it’s related to the database calls. Many databases, I believe including sqlite, adjust how they run commands based on what’s in the database - including numbers of records, amount of memory, etc. So it’s possible when you reached a certain number of records sqlite just decided internally to stop running the code one way and switched to another, slower, way.

drakar2007 · December 19, 2017, 4:09am

Alas, you may have caught me once again confusing block size with volume size, which is what I was actually thinking of. Volume size is the only part I’ve messed with the settings for, I’ve left block size at the default setting thus far. Sorry if I’ve contributed to the confusion some.

keboose · December 19, 2017, 5:53am

No worries, I made my own mistake by equating block size with chunk size. They are similar, but with the critical difference of chunks being a literal minimum file size, as opposed to an upper limit on each piece of a file (the block.)

That is true, but this backup I am re-doing is relatively static. It is set to remove versions older than 3 months, so at most I expect 1-2 old versions of maybe a dozen archives at any point (games don’t update often.) I expect most blocks to be sub 10MB, because of the large number of small files.

That does put the situation to perspective, though. I will re-do my other backups as well, with block sizes at 4MB (documents) and 25MB (music and video files).

JonMikelV · December 19, 2017, 6:11am

That sounds like a reasonable design - and kudos to you for having your files so well layed out / named that you can split them up like that!