Backup valid, but still unrestorable?

Again, this is the worst of Duplicati.
list-broken-files -> ok
purge-broken-files -> ok
repair -> ok

Running full restore -> FAIL

Running full restore says registering a missing remote file, then it tries to recover using blocks from existing volumes and after that it fails.

Error id on restore DataabaseIsBrokenConsiderPurge. I just did that (again), nothing changed.

There’s something inherently and logically broken here. How the backup can be broken and valid at the same time?

Please fix issues like this, because these are absolute trust and reliability killers. It’s very bad that everything seems to be ok, until you really need it and then boom, you’re let down. This is the worst kind of trap you can build into backup system.

Version 2.0.4.15

2 Likes

If you’re talking about the names of the commands used, I think the short names can lead one to think they’re more universal than they are. They’re in some cases rather specific (but I haven’t seen a good writeup of capabilities and limitations – almost certainly one could not fit all of the details in the names).

At least the long repair description says it’s “Tries to repair” (and a rewriting of recreate has begun that possibly will also help repair, however I haven’t heard any news recently, so can’t offer you any details).

–list-broken-files sounds like it’s mainly a preview of –purge-broken-files, which I think are mostly done using easily obtained data such as the database and remote file listing. Testing remote files completely would require downloading everything, and rarely will anyone be willing to tolerate the time and transfer.

The TEST command of all files with –full-remote-verification is quite thorough – and maybe very slow. Longer-term, the hash capabilities that some storage provides may be a shortcut to limited better tests.

How often does duplicati test your backup archives? gets into that future idea and other sorts of testing, with different tradeoffs. Unfortunately, I don’t think there’s any test that actually tries a source file restore. This could help users who don’t try their own test restore, but could also fall victim to a huge source file.

It sounds like you might have lost a dblock file (thus the scramble for blocks), but do you have the actual error message? Sometimes exact messages can help find in the source the code that raised that issue. Restore is sometimes harder to get logs out of than backup though, but you can try pasting screenshots.

What’s the current system status? You were trying a full restore. Is there system damage in the picture?

Sorry things went badly, but with help from you, we can see if we can get the restore going better for you. How much of a hurry is this? Are you aiming for a recent version, or trying to get some historical version? Duplicati has lots of different ways to do restores (difficulty varies), and your needs can guide the choice.

One more warning, testing, is still done wrong. It’s possible that full tests do pass but full restore still fails. This is the ultimate trap, I were talking about. Which undermines confidence and creates very dangerous false trust for users until the very end. I’ve seen this situation also, that’s why I do always full restore, not full test. Test is unfortunately dangerously misleading.

But back to this specific case I’m now talking about. I guess the problem might be that there’s for some reason left over i file. When the database is available, the file is ignored, but while rebuilding database it probably causes the badly written logic to fail.

Remote file referenced as duplicati-b745030a03fa640d29f2daa1849cf0f2e.dblock.zip.aes by duplicati-i59a736b5d13d47d0b91cdfdfa1ddf8cb.dindex.zip.aes, but not found in list, registering a missing remote file.

This file should get deleted, because it’s not probably even necessary and the restore process should ignore it and it shouldn’t cause it to fail.

Also about testing process, testing backup when the local database is present is also stupid, because it’s not available when you’re doing disaster recovery. These are very bad and inherent flaws in the current “solution”. It seems that the bad reliability is absolutely worst part, but let’s get the problems get fixed. Yet these are obvious and logical problems which should get fixed without reports. If the backup works transactionally and correctly, there shouldn’t be there problems ever and if there is, automatic recovery should occure. - I think I’ve written exactly same around year ago or so.

Luckily this isn’t backup set in terabytes, so it’s relatively viable to run all the tests over 10 gigabit network.

Found 1 missing volumes; attempting to replace blocks from existing volumes

Here’s fresh retested and confirmed results:
Full Restore: FAIL - Code 100, unable to recover
Full Test: OK - 0
List-Broken-Files: OK - 0
Purge-Broken-Files: OK - 0
Repair: OK - 0
Full Restore (again after all checks and tests): FAIL - Code 100, unable to recover

This is perfect example of ‘trap’ product, which wastes your time and resources, and doesn’t provide any value when actually needed. That’s why this kind of critical flaws shouldn’t exists by design.

About version I’m restoring in this case, always the latest version, which makes sense in case of DR restoration. That’s also the reason why if such problem is detected, the potentially missing blocks should be automatically and instantly replaced (when running next backup) because the required data is still available on the sources system.

As mentioned, I’m just TESTING, I’ve got all the data I need and no actual need to restore. So this isn’t actually major problem for me. But as programmer, data administrator, it manager and guy who greatly cares about data integrity and system reliability, I find these problems absolutely devastating.

  • I remember that compression program, which gave great compression results in all tests. It worked really great, it was fast, and gave incredible compression ratios. Under the cover it just referenced the source files. As soon as the user deleted the source files, extracting the archive failed leading to data loss. - Funny stuff.

Edit and continued: test was run with “all” and --full-remote-verification=true even adding --no-local-db=true doesn’t change anything, still passes. --no-local-blocks=true, still passes.

Which option removes excess files from destination folder? I think I’ve seen discussion about this, but I couldn’t find the parameter quickly. I’m also strong advocate of software which goes and “poops” around, leaving junk everywhere. If something isn’t necessary, it should get deleted.

1 Like

You might be thinking of –auto-cleanup but I’m not totally convinced you have excess files when it says you have a missing file. You could try feeding both the missing dblock name and the index file referencing it into the affected command, for example by opening the job’s Commandline. Change the command and put the remote files where the source file list would have been when it was the backup command which is default.

Inventory of files that are going to be corrupted gives an example where a damage forecast is done before intentionally making a missing dblock file then recovering. Your --list-broken-files having no complaint does suggest maybe you have an extra dindex. You could also test moving it elsewhere to see if it helps or hurts.

If you really like, you could make a database bug report, put it somewhere, then link. I can attempt to study.

Your failure, if I follow correctly, was on direct restore, so the following might be relevant to getting your log:

View logs for direct restore from backup files

Whether or not even –log-file-log-level=profiling might help me (there’s SQL but it’s hard to read) isn’t clear.

It would be wonderful if Duplicati someday got better at its repair capabilities, but writing code takes time…

That’s what I thought before. But nice that you thought the same way. Now I had time to recheck it and confirm. Restore was completed in 30 seconds, I just had to manually remove the “offending file”. Which clearly proved three points.

  1. File deletion is broken, it doesn’t work transactionally
  2. File cleanup is broken, after step one failed
  3. Restore / recovery logic is broken, after steps 1 and 2 failed.

That’s three overlapping bugs causing one problem. Correctly working program should have had three changes to fix the situation before I posted to this forum.

It’s confirmed and there are three bugs to fix.

Today when I’m running full DR restore tests I found again backup sets which are claimed to be good, but restore still completely fails. - Nothing has changed since.

Just cross linking: Is Duplicati 2 ready for production?

Edit, more than one such case already. Let’s see what the final statistics will be. I’ll be updating this post later. Tests will take several days, but I’ll get back.

Finally some results, only 1 Gigabit connection so restores take a long time. But here are the results rounded to hide the actual numbers, from bunch of backup sets:

81% Good (Code 0)
04% Restore completely failed (Code 100)
15% Repair required (Code 2)

Of course I very much would hope the good to be 100%, and any other results should be really rare. And the reasons for complete failures are:

  • DatabaseIsBrokenConsiderPurge
  • ExtraRemoteFiles

Other common warnings:

  • MissingVolumes (Scary, can lead to failure at any time later)

This is just a obvious reminder, regularly run full restore. Otherwise you’ll end up with very bad situation sooner or later. And do not use Duplicati as the only backup solution.

I’m going to purge and repair the problem cases, and retest those to see if integrity gets fully restored.

I’ve still got one huge (several terabytes) backup to restore. It has running for several days.

And again two more backup sets. Which are good based on all of the tests duplicati provides, but still impossible to restore because database is broken.

Program & logic is inherently broken.

I guess that’s all about this topic so far. Until the issue gets fixed. But in general, such software gives me zero trust.

Also on top of this, I don’t know what status of one backup is, because restoring it would have taken more than two weeks, I aborted at two weeks. I decided to reset that backup set as well, just to be sure.

Out of curiosity what back end are you using?

Currently FTPS, but it’s not about that. Similar issues also occur with local file system, etc. As all developers know, all it takes is one rarely occurring edge case, and that’s it. Then it will happen sooner or later.

As example with my current testing cycle, there will be around 20k+ backups, and separate compacts (which mostly do not of course run) between full restore test runs. I do log everything, and I’ve collected extensive statistical data etc. All tests will be logged and collected in single table for further analysis.

After all this, it comes to statistical question, what is acceptable failure level. If after 20k+ backup runs 4% of backups are broken, is that acceptable or not. I think it’s not, but honestly situation could be much worse, in system as complex as the Duplicati is by design.

My personal opinion is that the current flaw is systemic, and it could be remedied in several stages. Of course if all of the stages would be able to handle situation like this, it would be the best outcome. Now the issue is that none of the stages handle the case.

Classic design / reliability problem. Trust me, I’ve seen similar issues in countless programs and many of my own programs as well. My motto is just that if there’s clear malfunction which requires manual intervention, that’s something that needs to be fixed. But I’ve seen so many IT guys out there, whom “fix it manually” and claim that now it’s fixed. Which is of course bs, because the root cause isn’t fixed. It will happen sooner or later again. So it isn’t fixed.

Classic example is like writing a file to disk, and not syncing it and then updating some other state based on that. If the systems is hard reset or loses power just at the right moment, the file and (or maybe not) data will be lost. You can deal with this in two ways, sync the data so the file is confirmed by OS to be written to disk, and not stored only in write cache. And or you can make the code check after wards that the file is ok, and if it isn’t then re-do the task which created the file. In critical production systems, I’ve often implemented both options. As example, someone might have manually done something crazy like deleted the temp file. If the is written badly, it will look for the temp file and crash, because it’s missing. One bad option would be ignoring the lack of the file and data. But best thing is to re-do the task correctly.

More you’ve got systems and transactions & data, the more often you’re going to see the problem.

Due that fact and many key production systems, I’ve very careful with tasks like that. Because fixing that mess what bad code can do, is enormous and extremely painful task. Which unfortunately especially if done manually, can cause even worse inconsistent state. Sometimes special ad-hoc case specific repair programs are required. But that’s really time consuming, and even that won’t guarantee perfect results, if the primary process has failed badly. - That’s why doing it once, and correctly, is so important.

I did intentionally run a job backup in a way which would have totally messed up older Duplicati versions, and it finally worked as expected. In this case, the remote storage was updated from another client a few times (from old server), before I started using old copy of it on a new server. Which of course creates logical version / integrity conflict.

But the recovery worked as expected. I wonder if the situation would have been worse, if I would have compacted the remote storage between. But if recovery works correctly even that shouldn’t have made a major difference.

Version: 2.0.4.30 canary 2019-09-20

Backup
Checking remote backup …
Listing remote folder …
Extra unknown file: duplicati-20190923T153013Z.dlist.zip.aes
Extra unknown file: duplicati-b9a9cc69c87d84c4aa32a3c2e1cb3d38b.dblock.zip.aes
Extra unknown file: duplicati-baa03eaed547e43449d315084131277e3.dblock.zip.aes
Extra unknown file: duplicati-ia18977bcbb524e41a096df73f2c090ad.dindex.zip.aes
Extra unknown file: duplicati-iaa9629029c614d03a2107ff52defd276.dindex.zip.aes
Missing file: duplicati-20190821T153014Z.dlist.zip.aes
Found 5 remote files that are not recorded in local storage, please run repair
Backend verification failed, attempting automatic cleanup => Found 5 remote files that are not recorded in local storage, please run repair
This is something which shouldn’t happen - There’s still some kind of code logic error in the code base.
Failed to read local db XXX.sqlite, error: database is locked
database is locked => database is locked
database is locked
Fatal error => The process cannot access the file because it is being used by another process.

Repair
Listing remote folder …
Deleting file duplicati-20190923T153013Z.dlist.zip.aes (39,20 KT) …
Deleting file duplicati-b9a9cc69c87d84c4aa32a3c2e1cb3d38b.dblock.zip.aes (31,91 MT) …
Deleting file duplicati-baa03eaed547e43449d315084131277e3.dblock.zip.aes (139,42 KT) …
Downloading file (unknown) …
Failed to accept new index file: duplicati-ia18977bcbb524e41a096df73f2c090ad.dindex.zip.aes, message: Volume duplicati-b9a9cc69c87d84c4aa32a3c2e1cb3d38b.dblock.zip.aes has local state Deleting => Volume duplicati-b9a9cc69c87d84c4aa32a3c2e1cb3d38b.dblock.zip.aes has local state Deleting
Deleting file duplicati-ia18977bcbb524e41a096df73f2c090ad.dindex.zip.aes (1,35 MT) …
Downloading file (unknown) …
Failed to accept new index file: duplicati-iaa9629029c614d03a2107ff52defd276.dindex.zip.aes, message: Volume duplicati-baa03eaed547e43449d315084131277e3.dblock.zip.aes has local state Deleting => Volume duplicati-baa03eaed547e43449d315084131277e3.dblock.zip.aes has local state Deleting
Deleting file duplicati-iaa9629029c614d03a2107ff52defd276.dindex.zip.aes (52,37 KT) …
Uploading file (541 tavua) …

But after running the repair manually backup worked. Let’s restore the backup and see if it works as well. Usually in these kind of situations the backup got messed up and became unrestorable.

Test
Listing remote folder …
Examined 87 files and found no errors

So far so good… Let’s restore the backup and see the final verdict, which used to often fail in this kind of situation.

Restore
Restored 785 (4,54 GT) files to XXX
Duration of restore: 00:07:09

Phew… Honestly this starts looking really good, finally. When next beta comes out, I’ll need push update to all clients. Hoping this will also fix the unrestorable databases cases in future.

TempDir - Once again
Anyway, --tempdir is once again totally broken and doesn’t work at least when restoring. Not via TMPDIR environment variable nor as command-line parameter. Kind of legendary. I just seriously dislike inconsistent code. But that’s a really minor thing, it just would vastly improve restore performance in this case. Yet, at the very same time, it’s so simple thing, I wonder how it can be broken all the time. Must tell something about bad development processes and habits?

I do find it puzzling that you are intentionally trying to do things like that… but I guess it’s good it was able to recover!

Can you open an issue for this on github? If as you say it’s a simple fix, it should be easy to resolve.

Thanks for the update!

Why is that strange? With robust systems, all possible failure models and outcomes should be already defined in the unit tests. If there’s something you can break, then something is done wrong. See: Chaos engineering - Wikipedia

There should be two departments, another creating the software and another trying to break it. This will reveal the issues which many programs are so full of. What if something goes wrong, is the system designed and implemented correctly. What should happen etc. Robust systems deal with it, and most of systems break down. There are things you can’t recover from. But all recoverable issues should be dealt with automatically.

Of course this includes the classic question, which kind of errors should be automatically managed and which ones should just say this fails is now so serious that somebody has to check this out and fix it manually. And if possible, automatically recover to last possible consistent state, where data integrity can be verified.

Tests continued

I continued by testing the situation I had about year or two ago. Where the backup-system file system got corrupted due to RAID & NTFS failure. In that case, I just restored the backup of the backups. Yet due to backup schedules, I did lose some of the latest files. Back then it forced me to restart all backups, because recovery was totally borken, even if the data storage was simply rolled back “one version”.

I just deleted latest files from the backup data storage.

Backup started at 25.9.2019 8.12.09
Checking remote backup ...
  Listing remote folder ...
Missing file: duplicati-20190924T091416Z.dlist.zip.aes
Missing file: duplicati-bea66de64699443fca42451aad590ce71.dblock.zip.aes
...
Found 43 files that are missing from the remote storage, please run repair
Backend verification failed, attempting automatic cleanup => Found 43 files that are missing from the remote storage, please run repair

repairing with --rebuild-missing-dblock-files <- Looks good, testing it out.

Ok, as expected rebuild wasn’t possible in this specific case, because I did also delete lots of blocks created by compaction.

Let’s see what steps are needed to fix the state. (I’ll automate this in my own scripts, when necessary, that’s why I need to know). I do deeply hate manual fixes.

list-broken-files, purge-broken-files, repair, verify, restore

Yet I could have done one more test which is the most complex I think. Where I would have normal valid backup, then I would intentionally compact it and then restore old database version. Now the database and data storage are out of sync. But there are some extra files, which contain data, which belongs to the missing files. Does the recovery process check the extra file content, if the files contain needed blocks when repairing / rebuilding blocks?

While running verify after repair, this looks bit strange:

Failed to process file duplicati-20190918T153015Z.dlist.zip.aes => Value cannot be null.
Parameter name: stream
  Downloading file (541 tavua) ...
...
Failed to process file duplicati-20190921T153015Z.dlist.zip.aes => Value cannot be null.

Something is slightly off, why it checks the same file several times? Or is it assumed it’s a data transfer issue and that’s why it’s retried?

Interesting, the verify is showing many more errors especially file size / hash mismatches, which I really weren’t expecting. The changes I did, didn’t touch those files or sizes at all. - This also seems to be slightly off? It’s also strange that the files are downloaded multiple times, probably once again, because system thinks it’s data transfer issue causing the file mismatch. I’ve also seen similar issues earlier, where there’s just a slight size mismatch. Afaik, Duplicati won’t rewrite or append to files, so what’s causing the size / hash issue? Shouldn’t all files / storage objects are static. What’s causing this?

After all of these steps verify says, that there are files with incorrect size / hashes, repair says:
remote file duplicati-ia64c5ccb08d842119f94f49905c9187d.dindex.zip.aes is listed as Verified with size 82973 but should be 103165, please verify the sha256 hash “v5G9my6/LS7EIXQpMUVhtBg9brZIiETRJhKQBb+pLZw=”

remote file duplicati-if5ee7b555c664c749c2e7f72e7337d39.dindex.zip.aes is listed as Verified with size 1773933 but should be 180893, please verify the sha256 hash “bXRXEEyHYOawL7FafVvbySMm5Z9dUhJ2TNUPzEyzhWs=”

Failed to perform verification for file: duplicati-ia64c5ccb08d842119f94f49905c9187d.dindex.zip.aes, please run verify; message: Remote verification failure: [Extra, +0bP8nxrab2YH1spbtqSTNSq01AOuIKZQ7JBP+35Das=] => Remote verification failure: [Extra, +0bP8nxrab2YH1spbtqSTNSq01AOuIKZQ7JBP+35Das=]
...
Failed to perform verification for file: duplicati-if5ee7b555c664c749c2e7f72e7337d39.dindex.zip.aes, please run verify; message: Remote verification failure: [Extra, ++JiEgbuo4dt+cJ58dQHGVB4v8P6Or+4BOxhDPqI5do=] => Remote verification failure: [Extra, ++JiEgbuo4dt+cJ58dQHGVB4v8P6Or+4BOxhDPqI5do=]

And the state can’t be restored. The repair / recovery still doesn’t work as well as it could work. Because just deleting files for one day, should be quite simple to recover from. Yet, in this case compaction was also run between. So sure, it might lead to situation where there aren’t any fully restorable backup versions left. Because I did delete new files. If I would have rolled back the data storage, then there should be old blocks (extra files) which contain the data require to recover the old versions.

But now we’re again in the situation where the backup won’t run without full backup reset.

Just to play with it, after those steps, I did restore the files I had removed to test the recovery. Now repair deleted the files as extra files. Interestingly it detected new index file. But rejected it.

After all tests these errors remain:
remote file duplicati-idff8102b359c48118ecc150105e40256.dindex.zip.aes is listed as Verified with size 1028509 but should be 1450173, please verify the sha256 hash “0f6My7NXgw2NYcC+E7ovwvROs9WmDRiLMNLafXHkmPU=”

remote file duplicati-iae3eae383b3a40a4b591091050c6772d.dindex.zip.aes is listed as Verified with size 142109 but should be 297501, please verify the sha256 hash “ammoBgBhtDeP5tNqQcC1N/d2XKdZJXsHFJ8znFn1ndw=”

I’ll try one more thing. I just manually delete those files. Let’s see what happens. Then I’ll run repair, verify and restore test.

After repair, those files were re-uploaded. But still the state isn’t properly recovered. Repair says that everything is ok. But verify fails:

Failed to process file duplicati-20190920T153015Z.dlist.zip.aes => Value cannot be null.
Parameter name: stream
  Downloading file (31,91 MT) ...
duplicati-20190920T153015Z.dlist.zip.aes: 1 errors
        Error: Value cannot be null.
Parameter name: stream

These are quite simple failure models, and shouldn’t cause any trouble.

Anyway, I’ll just restored the backup database and the data storage state, so everything resumes as normal. But this was still a good test of the recovery not being nearly perfect yet. And tests for these kind of situations are missing.

Afaik rolling back the data storage one or a few versions should be in the basic recovery tests. Sure it sucks, but that’s once again something that can happen in real world and it should be recoverable.

So both cases where database is rolled back or data storage is rolled back should be recoverable. Extra bonus, if the compaction between those events is correctly handled by reading the available data from the new / old files which are out of sync.

Anyway, personally I feel that reverse engineering / testing these cases is bit strange. Best way to approach these things is to make sure that the logic handling the situations already covert these issues. - Yet this is long term problem with software development. Do we write some code, and see what happens. Or do we first design what we want to do, specify failure models and properly design, implement and test those. - Sure it’s tedious. But I think the later model is the only way to make really solid software.

I absolutely agree with you when the product is being used in a supported way, but I don’t feel as strongly when it is being used incorrectly (in this case - two separate backups writing to the same back end location).

As you noted, this was not a good test due to a compaction event. You didn’t roll back one version, you manually deleted files on the back end with the most recent timestamps. If instead you actually did use a true older version of your backup (backup of backup), it probably would have worked fine. You’d be able to recreate the database and be on your merry way. (I have done this test successfully more than once.)

Don’t get me wrong, I’m not saying the devs should make no effort to improve robustness of Duplicati when it comes to user error or otherwise intentionally doing something like deleting back end files manually. I personally would prefer they spend time fixing the remaining bugs in core product functionality - there are still some bugs in using supported Duplicati functionality without intentionally trying to break things.

1 Like

This may be related to a new issue that was introduced in 2.0.4.29:

True. Yet that still shouldn’t prevent recovery from working. Even if this probably with my data sets which got lots of long term data will lead to situation that none of the versions is completely intact. Yet, even in this case, technically state recovery should be successful. Even if that state means that files X, Y, Z will be broken in all versions.

Backup programs are the genre where I’m especially expecting that. In normal circumstances if there’s detectable corruption, recommendation is always to stop, and return to known last good state. But in case of backup systems, the data is so important, that complete failure without good reason shouldn’t happen.

Yet of course there needs to be a very good notice to the user if restoring X won’t lead to prefect restore. But whatever your content is, and in case of disaster recovery. Getting back 98% of data, even if only 90% of files would be completely intact, is still 98% more than no data at all.

That’s one of the reasons why I’m using with 7-zip backups additional layer of par2 files. Which allows me to restore all data even in case of bit rot in some of the files. - Yet this ECC hamming isn’t what I’m asking for. Only the option to recover state (as well as possible) and allow restoring of state which contains known issues. Of course this is an exception, but should be possible. Of course on something like de-duplication block granularity. The blocks which are corrupt or missing, could be left to zeroes. So it’s obvious that something is missing from the file.

I’ve been in disaster recovery situations which haven’t been pretty. As example reading as much data from disk as possible, where the metadata has been completely lost. In some of those cases, with known file format, we’ve scanned all the copied clusters from the disk with custom software, and reassembled the data as well as possible. I’ve recovered source code, logs, database tables, etc like that. - That’s a freaking nightmare, very expensive, slow, no guarantee of good results. But still a lot better than no data all. And also a very good reason to keep your backups working. Haha… As bonus if you never defrag your drive, get it often full and the delete bunch of small files to free up some space. That’s the way to making the job much more fun.