Why is that strange? With robust systems, all possible failure models and outcomes should be already defined in the unit tests. If there’s something you can break, then something is done wrong. See: Chaos engineering - Wikipedia
There should be two departments, another creating the software and another trying to break it. This will reveal the issues which many programs are so full of. What if something goes wrong, is the system designed and implemented correctly. What should happen etc. Robust systems deal with it, and most of systems break down. There are things you can’t recover from. But all recoverable issues should be dealt with automatically.
Of course this includes the classic question, which kind of errors should be automatically managed and which ones should just say this fails is now so serious that somebody has to check this out and fix it manually. And if possible, automatically recover to last possible consistent state, where data integrity can be verified.
Tests continued
I continued by testing the situation I had about year or two ago. Where the backup-system file system got corrupted due to RAID & NTFS failure. In that case, I just restored the backup of the backups. Yet due to backup schedules, I did lose some of the latest files. Back then it forced me to restart all backups, because recovery was totally borken, even if the data storage was simply rolled back “one version”.
I just deleted latest files from the backup data storage.
Backup started at 25.9.2019 8.12.09
Checking remote backup ...
Listing remote folder ...
Missing file: duplicati-20190924T091416Z.dlist.zip.aes
Missing file: duplicati-bea66de64699443fca42451aad590ce71.dblock.zip.aes
...
Found 43 files that are missing from the remote storage, please run repair
Backend verification failed, attempting automatic cleanup => Found 43 files that are missing from the remote storage, please run repair
repairing with --rebuild-missing-dblock-files ← Looks good, testing it out.
Ok, as expected rebuild wasn’t possible in this specific case, because I did also delete lots of blocks created by compaction.
Let’s see what steps are needed to fix the state. (I’ll automate this in my own scripts, when necessary, that’s why I need to know). I do deeply hate manual fixes.
list-broken-files, purge-broken-files, repair, verify, restore
Yet I could have done one more test which is the most complex I think. Where I would have normal valid backup, then I would intentionally compact it and then restore old database version. Now the database and data storage are out of sync. But there are some extra files, which contain data, which belongs to the missing files. Does the recovery process check the extra file content, if the files contain needed blocks when repairing / rebuilding blocks?
While running verify after repair, this looks bit strange:
Failed to process file duplicati-20190918T153015Z.dlist.zip.aes => Value cannot be null.
Parameter name: stream
Downloading file (541 tavua) ...
...
Failed to process file duplicati-20190921T153015Z.dlist.zip.aes => Value cannot be null.
Something is slightly off, why it checks the same file several times? Or is it assumed it’s a data transfer issue and that’s why it’s retried?
Interesting, the verify is showing many more errors especially file size / hash mismatches, which I really weren’t expecting. The changes I did, didn’t touch those files or sizes at all. - This also seems to be slightly off? It’s also strange that the files are downloaded multiple times, probably once again, because system thinks it’s data transfer issue causing the file mismatch. I’ve also seen similar issues earlier, where there’s just a slight size mismatch. Afaik, Duplicati won’t rewrite or append to files, so what’s causing the size / hash issue? Shouldn’t all files / storage objects are static. What’s causing this?
After all of these steps verify says, that there are files with incorrect size / hashes, repair says:
remote file duplicati-ia64c5ccb08d842119f94f49905c9187d.dindex.zip.aes is listed as Verified with size 82973 but should be 103165, please verify the sha256 hash “v5G9my6/LS7EIXQpMUVhtBg9brZIiETRJhKQBb+pLZw=”
…
remote file duplicati-if5ee7b555c664c749c2e7f72e7337d39.dindex.zip.aes is listed as Verified with size 1773933 but should be 180893, please verify the sha256 hash “bXRXEEyHYOawL7FafVvbySMm5Z9dUhJ2TNUPzEyzhWs=”
Failed to perform verification for file: duplicati-ia64c5ccb08d842119f94f49905c9187d.dindex.zip.aes, please run verify; message: Remote verification failure: [Extra, +0bP8nxrab2YH1spbtqSTNSq01AOuIKZQ7JBP+35Das=] => Remote verification failure: [Extra, +0bP8nxrab2YH1spbtqSTNSq01AOuIKZQ7JBP+35Das=]
...
Failed to perform verification for file: duplicati-if5ee7b555c664c749c2e7f72e7337d39.dindex.zip.aes, please run verify; message: Remote verification failure: [Extra, ++JiEgbuo4dt+cJ58dQHGVB4v8P6Or+4BOxhDPqI5do=] => Remote verification failure: [Extra, ++JiEgbuo4dt+cJ58dQHGVB4v8P6Or+4BOxhDPqI5do=]
And the state can’t be restored. The repair / recovery still doesn’t work as well as it could work. Because just deleting files for one day, should be quite simple to recover from. Yet, in this case compaction was also run between. So sure, it might lead to situation where there aren’t any fully restorable backup versions left. Because I did delete new files. If I would have rolled back the data storage, then there should be old blocks (extra files) which contain the data require to recover the old versions.
But now we’re again in the situation where the backup won’t run without full backup reset.
Just to play with it, after those steps, I did restore the files I had removed to test the recovery. Now repair deleted the files as extra files. Interestingly it detected new index file. But rejected it.
After all tests these errors remain:
remote file duplicati-idff8102b359c48118ecc150105e40256.dindex.zip.aes is listed as Verified with size 1028509 but should be 1450173, please verify the sha256 hash “0f6My7NXgw2NYcC+E7ovwvROs9WmDRiLMNLafXHkmPU=”
…
remote file duplicati-iae3eae383b3a40a4b591091050c6772d.dindex.zip.aes is listed as Verified with size 142109 but should be 297501, please verify the sha256 hash “ammoBgBhtDeP5tNqQcC1N/d2XKdZJXsHFJ8znFn1ndw=”
I’ll try one more thing. I just manually delete those files. Let’s see what happens. Then I’ll run repair, verify and restore test.
After repair, those files were re-uploaded. But still the state isn’t properly recovered. Repair says that everything is ok. But verify fails:
Failed to process file duplicati-20190920T153015Z.dlist.zip.aes => Value cannot be null.
Parameter name: stream
Downloading file (31,91 MT) ...
duplicati-20190920T153015Z.dlist.zip.aes: 1 errors
Error: Value cannot be null.
Parameter name: stream
These are quite simple failure models, and shouldn’t cause any trouble.
Anyway, I’ll just restored the backup database and the data storage state, so everything resumes as normal. But this was still a good test of the recovery not being nearly perfect yet. And tests for these kind of situations are missing.
Afaik rolling back the data storage one or a few versions should be in the basic recovery tests. Sure it sucks, but that’s once again something that can happen in real world and it should be recoverable.
So both cases where database is rolled back or data storage is rolled back should be recoverable. Extra bonus, if the compaction between those events is correctly handled by reading the available data from the new / old files which are out of sync.
Anyway, personally I feel that reverse engineering / testing these cases is bit strange. Best way to approach these things is to make sure that the logic handling the situations already covert these issues. - Yet this is long term problem with software development. Do we write some code, and see what happens. Or do we first design what we want to do, specify failure models and properly design, implement and test those. - Sure it’s tedious. But I think the later model is the only way to make really solid software.