Block Size for Cloud Backup

(Originally posted to a thread on local backups, but started as a new thread for cloud backups)

I have read the block sizing article and tried using the storage calculator and am still confused about what block sizes to use for cloud storage (B2 in my case). (No offense intended, kenkendk and kees-z. The article and tool are very well written. This is clearly a PEBCAK problem.) I did an analysis of the files I need to back up and here’s what I came up with:

image

3/4 of the files are under 1 meg in size, with almost all the rest between 1 meg and 500 meg. I am willing to submit to the wisdom of Duplicati’s default settings for block & dblock sizes, but I am hoping one of you “pros” might see something in the file/size distribution which would steer me in another direction.

I am also considering using the backup-test-samples option to verify backups, but by back-end provider (B2) charges for download bandwidth, so I obviously want to minimize the amount of data downloaded to verify backups. Does using that option change how I would set the block and dblock sizes?

I’m about to “fully commit” to Duplicati and begin a cloud backup of almost 1.5T of data, so I want to be sure I get it right the first time. Any advice would be greatly appreciated.

FWIW, I’ve carefully considered block size as I’m using B2 also, using one account to back up about 5 different machines at this point, most of them set to daily backups.

B2 only starts charging you if you exceed 1GB in daily download bandwidth. After doing an initial backup of 50 megabyte files per Duplicati’s default, I decided that’s much too small, as it creates tons of small files in the backup set, and I’m pretty sure it bogs things down as it has to take a break from uploading every 5 seconds to create a new backup volume. I toyed around with 500 megabyte volumes briefly but then realized that even 2 verifications would then consume the full day’s free d/l bandwidth allocation, and I’ve been doing lots of cycles as I start up as I’ve been adding chunks to my main backup set (a few GB at a time, and so on).

I eventually decided on 200MB volumes for my main PC with the biggest data size, and for other computers I’m backing up to the same account have done anything from 150 to slightly over 200 MB. I’ve only exceeded the free d/l allotment a few days, and we’re talking about an extra $.01 or so on those days, so nothing that’s gonna have any real impact (and also nothing that’ll be routine). The 200MB size seems to work well and has sufficient speed, though of course I’d be curious to see someone run benchmark tests across a range of sizes to get official numbers.

I also have a second backup set on my main PC which goes to a 3TB external drive, and since there’s no download allotment for local drives, I set the volume size there to 2GB, and that’s working pretty well too. Obviously for B2 there would be lots of cons to doing that (any sort of verify or restore operation would start incurring fees instantly), but no such concerns for local of course.

I’ve noticed B2 stores the SHA1 hashes of all individual files, and Duplicati supposedly has some sort of capability to use these for verification (instead of and/or in addition to physically downloading the backup volumes) - but from what I’ve read it doesn’t actually do this yet. I’m hoping this feature gets implemented sometime soon as it could get us a lot more verification of our remote backup sets for minimal download cost. I said in another thread that I’d like the option to have it download 1 set to verify and do hash checks on ~4 other sets everytime a backup runs, where both of those numbers are adjustable by the end user. Hopefully it won’t be that hard to do once the main functionality is implemented.

2 Likes

If you want to restore a 1MB file, you will have to download the entire 150 or 200 MB block. Was that part of your consideration?

Also this (just replace 100kb with 200MB):

Yes. Download bandwidth from B2 is super cheap until you start talking about multiple (dozens+) gigabytes, and it’s pretty fast. It’s a trade-off of course, but one that I don’t expect to ever have an issue with.

Thanks, drakar, this is great information! It also mirrors my setup pretty well, so I appreciate your sharing your analysis.

Do you use the backup-test-samples option for your B2 backups? I’ve been pondering how many files to sample for each backup. Is just 1 file check OK? Given the block size issue, I hate to waste all that downloading for just one file. 2? 5? 10? I can’t seem to come up with a number that sounds logical and practical. Did you add this calculation to your backup sets?

Also, did you change the dblock size in your configurations, or just the local block size?

1 Like

I’ve left it at the default 1 for now. I don’t expect there should be too much trouble with B2 messing up the backup sets, and I read somewhere that Duplicati is optimized somehow to select from the least-often downloaded backup volumes for verification.

I’m not too sure what configuration you’re referring to - I changed the main configuration to 200MB volume sizes, which dictates the size of the remote dblock volumes, but I haven’t added any advanced options in relation to this (but i’d be open to suggestions if there are any good ones i’ve missed, tbh i haven’t looked through them all very carefully) - are you thinking of any in particular?

The only advanced option I’ve selected so far is auto-cleanup, only because it seemed like a good idea and mostly harmless.