Does duplicati download any data?

I am considering which cloud storage to use and as noted already, services like Amazon S3 and Backblaze B2 not only charge for the amount of data stored, but also for retrieval (so far as I can see upload to cloud is not charged).

I can make a fair estimate as to how much data I would need to store in the cloud, but how do I go about estimating how much download bandwidth I would use (assuming a perfect scenario where I never need to restore data from the cloud)? Does duplicati perform any downloads during its normal backing-up operations?

I gather B2 has the ability to report checksums of files in the cloud. Does this mean that duplicati will do a check on the integrity of the cloud data without needing to download the data?

Yes, it does and I think it downloads one randomly selected volume every time a backup runs. The default volume size is 50MB.

Many thanks for your response. May I ask why duplicati would do that?

(I don’t want to appear rude in asking this, I merely want to alleviate my ignorance!)

Because Duplicati has almost no requirements for the backend (only requirements are get, put, list and delete operations), the only way to test the integrity of backed up volumes is to re-download some random files to check if the contents can be read and verify if everything looks as expected.
You can set the number of files to download after a backup with --backup-test-samples=n. replace n with the number of files you want to download for verification. Use option --no-backend-verification=true to disable downloading test samples completely.

3 Likes

More on this here:

Also researching this topic - looking to a backup of approx 200gb and backblaze does look cheapest in that respect for storage.

From the above it seems that duplicati will check one block (50mb default) each time. So if set to run daily then should be well under the allowance from backblaze (1gb per day)

So just have to worry about storage cost. Did I get that right?

1 Like

Oh, thanks for pointing that out:

The first 1 GB of data downloaded each day is free.
Source: Cloud Storage Pricing Comparison: Calculate Your Costs

I believe you did.

It’s worth pointing out that you can do a lot more than one backup per day but not hourly (if you have the machine running 24 hours a day). So for a NAS, it would probably be a good idea to reduce the Volume size to 40 or 35 MB in order to be able to do hourly backups.

It’d be worth noting here that B2 stores the SHA1 hashes for individual files, so if duplicati implements a handler for that at some point soon hopefully, there would be the potential to reduce download (during backups) to almost nothing, and still check a bunch of files when running backup jobs. Or preferably a hybrid approach (with configurability), where duplicati could download-and-check one file, but hash check 5 files, etc.

2 Likes

@dave, I’m new to this as well but I think you also need to consider your versioning/retention period since deleting “old” versions (eventually) requires the download of multiple existing backup files (dblocks) which are then locally decompressed then re-compressed into fewer files (with only the versions being kept) and re-uploaded.

So the total bandwidth requirements could get a fair bit higher with maintenance (unless you’re going for unlimited, in which case you shouldn’t see anything being the validation transfers already discussed).

1 Like

You can disable this with --no-auto-compact.

Alternatively, you can set --threshold=100 to never consider partially unused data, but only delete full volumes. This will prevent download of dblock files, except for small files.

The options --small-file-size and --small-file-max-count can be used to further control when Duplicati will download small files and merge them into a large volume.

1 Like

As an aside, and mainly out of curiosity, why does “small file max count” have a byte/Kbyte/Mbyte/Gbyte/Tbyte selector? I assumed this selector was more of a tally, not necessarily related to filesize. And if it is filesize, I can’t think of what that actually means.
image

That looks like a bug. It should be a number, as in “number of files”.

Edit: fixed with

1 Like

@kenkendk, --small-file-size and --small-file-max-count look like awesome options, thanks!

Just out of curiosity (as this doesn’t apply to my needs) is there any aggregated timeframe bandwidth usage reporting available for those working with bandwidth limited connections (such as cellular hotspots) or destination costs (such as above mentioned B2 usage past 1G)?

Thanks, that makes a lot more sense (and was what I was expecting). Since I’m new at your system (and pretty new at gitHub), when should I expect this and other incremental fixes to be seen in a release version? Would I need to jump over to the ‘canary’ update track to see it anytime soon?

Yes, new features are deployed in the canary channel as often as I can find time for it (usually weekly, but it varies).

There is the --upload-limit and --download-limit options but nothing with reporting.