Backup failing when compact fails; unable to resolve

aureliandevel · March 7, 2021, 9:36am

Hi,

This is a long one.

TL;DR Backups are failing due to compact operation failing, presumably due to extra encryption blocks, and all attempts to repair fail.

I have a large backup I run daily to capture all my active VMs just for belt-and-suspenders safety in case something goes really wrong, which is to say, if this backup got corrupted, it wouldn’t be the end of the world. But, I’ve found issues that raise concerns for the safety of other backups I run where corruption could potentially be a very big issue. I haven’t yet run the restore (and/or recovery) process against all my backups (in progress) but I’m seeing similar problems in terms of hash errors and also some decryption errors. I will be confirming that the compact operation will run successfully against each of my backups.

So, to the details.

The flag was raised when my VM backups started failing, not because of the backup itself but because the compact operation is encountering remote files with AES errors: many files with numerous, sometimes hundreds, of ‘Extra’ blocks. Being around 450GB, it’s been a while since the automatic compact process was run (it runs at 25% wasted space, right?) and that would cover at least a month or 2 of backups and multiple Duplicati releases. On the plus side, this has forced me to review Duplicati’s test, restore and recover processes and I’ve learned a lot, established a repeatable process and, despite the following, gained confidence in the safety of my data.

Running a ‘compact’ operation manually fails with “SharpCompress.Compressors.Deflate.ZlibException: Bad state (invalid bit length repeat)” errors. This does not occur within a Started/Completed log block which logs the file being operated on so it’s not clear which file is causing the issue. I guess I could run the operation with ‘Profiling’ output instead of my default ‘Verbose’.

Running the ‘test’ operation with ‘all’ and ‘–full-remote-verification’ parameters finds numerous files with ‘extra’ errors, eg
duplicati-if156b20ed6294ca489683ef9c797c9b7.dindex.zip.aes: 2 errors
Extra: 8TH/3t7nLKah0ip0k+0fpssGiFrtTvjKKF148VL5gps=
Extra: LuanbLNdlITPNV3MFGaVxyzvyPjun8P/umJH9tFEEuI=

that I assume to be the root cause of the ‘compact’ operation failing. I’ve read elsewhere that these can be ignored, despite being described as a ‘should never happen’ condition, the data is still present and not corrupted and the successful completion of the following restore would confirm this to be correct. Note that the 2 extra errors listed here is only for illustration; some files have hundreds of these errors. I’ve run the ‘list-broken-files’ operation which states “Found no broken filesets, but 1 missing remote files. Run purge-broken-files.”; the ‘purge-broken-files’ operation states “Marked 1 remote files for deletion” and “Found no broken filesets, but 1 missing remote files. Purging from database.” but that process doesn’t appear to actually do anything as repeating the list/purge cycle yields the same result. So, how do I clear these errors so the compact operation can complete?

I’ve tested the restore process and, although it took 18hrs to restore 455GB (moving the temp folder from an SSD slowed things down a bit too), it did apparently restore successfully. I think the key here is the “Found 1 missing volumes; attempting to replace blocks from existing volumes” action which uses blocks from other versions (I’m assuming that’s the case).

I also tested the full recover process (download/index/list/restore) on my VM backup via Duplicati.CommandLine.RecoveryTool but that didn’t go well. The download phase recorded multiple “Failed to decrypt data (invalid passphrase?): Message has been altered, do not trust content” errors and the restore phase recorded monstrous numbers of “Failed to read hash/Unable to locate block with hash” errors when rebuilding the files: many didn’t survive this process, being absent from the restore location. Other backups I ran the recover process on didn’t record errors per se but did have “Restored file hash mismatch - restoring file in damaged condition” messages and again leaving the file absent from the restore location.

The recovery tool also raises “Unable to locate block with hash: …” errors when the affected file is zero bytes, so that’s an easy ignore especially considering the hash is the same for each error, but still a pain to verify/validate when you’re looking at a wall of text.

Having read the online docs, I’m aware that the recovery tool has a recompress action with --reupload and --reencrypt options however I haven’t yet tried that because:
a) given the issues listed above, low confidence of it actually succeeding,
b) I really want to avoid deleting and re-uploading around 400GB to my rcloned offsite copy of duplicati data,
c) In the end, I’d be better off just blowing away the current backend data and running a new backup anyway

So, I’d like to aim to have say a monthly scheduled process that runs the ‘test all’ and/or ‘restore’ operations against all my backups to non-original location (and, for extra points, maybe even a less often ‘recovery’) but I’d really like to know how to work around or resolve these hash and encryption issues when they arise.

I run Duplicati Canary, but don’t necessarily update to a new release straight away, so from the timestamps on the %LOCALAPPDATA%\Duplicati\updates folders:
* v2.0.5.112 since 13/Feb
* v2.0.5.111 since Nov/20
* v2.0.5.108 since Jul/20

All my backups run via command line scripts (I just find the management easier) with ‘Verbose’ console logging output captured.

My local backup data lives on a FreeNAS box running mirrored ZFS pools, scrubbed weekly, with ECC RAM so bitrot is not a thing as far as storage goes. My main PC does not run ECC RAM so it’s not outside the realm of possibility that something local could be causing corrupted encrypted files. Running a full MemTest pass is on my TODO list but I do run parchive (par2) against large static files and archives and I’ve yet to see one require repair.

Thanks for reading

ts678 · March 8, 2021, 2:42am

Network-attached? What Storage Type is this? If there’s a network protocol moving files, what is used?

Will FreeNAS let you run a Python script over Duplicati backup? If so, set upload-verification-file, backup, then run DuplicatiVerify.py which you can get from the utility-scripts folder of a Duplicati installation.

is probably a good start even without the extra noise of the Extra blocks. Your Verbose log should show some retries and error details for bad files, so this is sort of the same check-every-file described above.

You might also get size differences. I think those get picked up first before the decryption attempt begins. Surveying the damaged files to see what kind of situation you’re dealing with would be a good initial step.

After the survey, this is probably going to need a detailed exam of some files, looking at size and content.
Content is a bit hard to study after encryption, but at least the expected format has some documentation.

aureliandevel · March 8, 2021, 8:57am

Yeah, I put all that detail in and still forgot something obviously important.

Using S3 protocol over minio. I know there’s now a minio backend but never bothered changing over since everything seemed fine.

I like the idea of the upload-verification file; I’ll look into that.

Regarding the retries and error details in the verbose log, I did notice a number of retries that failed 5 times (so, pretty much had the sword) but there were also a few that failed once or twice and then came good. I’d forgotten that (I’m awash with logs) so I’ll try and find out if there’s something to that.

Thanks for the suggestions.

ts678 · March 8, 2021, 1:05pm

number-of-retries defaults to 5. If you have a genuinely damaged file at the destination, retries won’t help, however if retries help then maybe there’s some sort of communication or transfer issue that cleared up.

I’m not that familiar with S3, but I “think” it’s unlikely to leave partial files. Duplicati before and after backup examines the remote directory list (unless you disable that), and should detect presence and size errors. Content errors can’t be easily examined except by random sampling (which you can crank up if you like).

When you can see a damaged file name, one thing to look at is the time stamp, and that can be checked against when the file was originally uploaded, which is in your logs and also can be gotten from database. Examining old database (saving a copy might be good) can also show previous file size directory listings.

DB Browser for SQLite to read-only open Database will have old list data in the RemoteOperation table. Times can be approximated by looking at nearby dlist file names, or convert UNIX timestamp with tools.
For very old information, you may find that log-retention has aged the history out of DB (but not your logs).
Remotevolume table has some status information on files. Should be similar to the data in verification file.