So far pretty impressed with Duplicati. I was wondering if anyone knew how often it does the test of the backup archive?
The reason I ask is that I set up a test backup archive (about 3gb) to an sftp server, and it worked beautifully. It finished it’s backup, and then verified the test. While doing the verification I watched my network performance monitor and noted that it seemed to be re-downloading the backup archive to verify it (makes sense, I don’t know how else you could verify it.)
If i back up everything, it will be considerably larger, and my ISP limits the amount of data that I can transfer in a month. If it verifies often, I could see it hitting my cap pretty quick.
Duplicati uses two levels of verification. Before doing anything, it lists all the remote files and checks that all files are found and have the correct size. It also checks that there are no new files. There is a feature request to also check the remote files against an MD5 or SHA-1 sum if the storage provider reports it.
The second step is actually downloading a file and testing that it can be decrypted, decompressed and that its contents are as expected. This is done with a random sampling for each backup. The advanced setting --backup-test-samples=1 can be used to set how many ([0…inf]) samples are performed. You can set it to zero if you are not concerned about sudden changes and want to make it faster.
Internally, Duplicati keeps a count for verifications to make sure it spreads out the testing as much as possible (as opposed to always choose randomly, it chooses randomly amongst those least verified).
BTW: A good way of showing your appreciation for a post is to like it: just press the button under the post. This also helps the forum software distinguish interesting from less interesting posts when compiling summary emails.
It looks like BackBlaze (B2) provides the Sha1 hash for individual files within the backup folder (looking at it now) - i’d be curious for verificaton that duplicati utilizes this, as that seems like a pretty good feature
Edit: upon further reading I eventually realized that Duplicati does not do this, but that the potential is there.
S3 compatible storage systems should provide the hash via the etag response. In my job, we develop software against object storage systems, such as Dell EMC Atmos, ECS, and other S3 targets. In practice, validating the hashes is very important, as data corruption does occur, whether in transit or in the backend storage itself.
Just to make sure I’m understanding this, what’s being discussed around hashes is that rather than Duplicati having to download the whole archive and test it (or does it just do a hash check?) we can download only the hash as provided by the cloud service to compare against the local database store of expected hashes?
If hash-only checking is used, does that mean more archives can be hash-checked each run (since the overhead is so much lower than downloading an entire archive)?
On top of that, if one is using a hash-providing cloud should that effect “recommended” archive sizes (in the larger direction) since testing wouldn’t normally require all the downloads? (Obviously restores would still be expensive.)
Anyone have any idea how these types of verifications compare to what CrashPlan does? I have a 600GB Duplicati backup and I set the volume size to 100MB, but now that I understand that it downloads some of those 100MB files to verify them, that seems like it’s slowing down the verification process substantially (internet and PC speed are not that great on this machine). I’ve always wondered if CrashPlan has been doing the same thing.
I believe CrashPlan does these types of verifications (hash checks) on the “server” (what Duplicati calls a Destination).
If CrashPlan is backing up to a local disk (such as USB or mapped network/NAS drive) then all the overhead associated with the hash checks (CPU, memory, disk IO) would be happening on your computer since it’s the source AND the destination.
If CrashPlan is backing up to another user, then THEIR computer gets all the hash check overhead. This means that if you’re letting somebody else do a CrashPlan backup to YOUR computer, then at some point your computer is “taking the hit”.
If CrashPlan is backup up to their cloud storage, the it’s the CrashPlan cloud servers that get the hash check overhead.
Since Duplicati doesn’t (currently) have an destination code, all the overhead of hash checks has to happen on your machine. And to do a hash check, Duplicati has to process the actual archive file - thus the need for a download.
Note that while some cloud providers that DO provide hashes, support for using their hashes is not currently in Duplicati - it’s being discussed, but no definitive decision has been made.
If you’re worried about the overhead involved in verification, check out the 2nd post in this topic on how to minimize it.
Even if standard S3 compatible tools didn’t provide the hash Duplicati wanted, it would be pretty awesome if something like Minio could “partner” with Duplicati and provide a build with services optimized for use as a Duplicati destination.
The Amazon S3 API varies the etag value based on the operation.:
The entity tag is a hash of the object. The ETag reflects changes only to the contents of an object, not its metadata. The ETag may or may not be an MD5 digest of the object data. Whether or not it is depends on how the object was created and how it is encrypted as described below:
Objects created by the PUT Object, POST Object, or Copy operation, or through the AWS Management Console, and are encrypted by SSE-S3 or plaintext, have ETags that are an MD5 digest of their object data.
Objects created by the PUT Object, POST Object, or Copy operation, or through the AWS Management Console, and are encrypted by SSE-C or SSE-KMS, have ETags that are not an MD5 digest of their object data.
If an object is created by either the Multipart Upload or Part Copy operation, the ETag is not an MD5 digest, regardless of the method of encryption.
If the Duplicati driver for S3 uses multi-part upload, then no, Etag will not be useful. However, if it is a standard PUT, and the encryption is performed on the client side so a plaintext object is uploaded (as I believe is the case), then the etag will be the MD5 digest.
For validation, the Etag is returned with object HEAD operations, and follows the same rules as the PUT operation for Etag. For standard PUT, client-side encrypted data it should be the MD5.
For uploads, you should also send the Content-MD5 header with the PUT, so S3 will validate the content on receipt.
From the S3 API docs:
To ensure that data is not corrupted traversing the network, use the Content-MD5 header. When you use this header, Amazon S3 checks the object against the provided MD5 value and, if they do not match, returns an error. Additionally, you can calculate the MD5 while putting an object to Amazon S3 and compare the returned ETag to the calculated MD5 value.
It appears the Backblaze B2 API has equivalent support in the b2_upload_file and b2_get_file_info operations.
Like Duplicati, we perform long-term data durability tests by choosing sample data objects to download and validate against the original checksum from object creation. We also use the results from the PUT operations to validate data creation, and from HEAD operations to validate much larger samples without downloading the actual data.
This page from Google also discusses this topic, as Google Cloud Storage also supports these conventions:
The idea of using a HEAD request is great, that will certainly speed things up.
Duplicati does not currently use chunked uploads, so we should be able to rely on MD5 checksums.
I do not like that it is not a special header, says X-MD5-Content but rather using ETag (which is correct usage). When parsing this, it would be possible to change the ETag implementation to return something else (say, SHA1 + salt) and still honor the ETag rules.
But … there is a big speed bonus to win here, so I guess we will just rely on the tag if it is present.