Backup valid, but still unrestorable?

Sami_Lehtinen · February 26, 2019, 7:31am

Again, this is the worst of Duplicati.
list-broken-files -> ok
purge-broken-files -> ok
repair -> ok

Running full restore -> FAIL

Running full restore says registering a missing remote file, then it tries to recover using blocks from existing volumes and after that it fails.

Error id on restore DataabaseIsBrokenConsiderPurge. I just did that (again), nothing changed.

There’s something inherently and logically broken here. How the backup can be broken and valid at the same time?

Please fix issues like this, because these are absolute trust and reliability killers. It’s very bad that everything seems to be ok, until you really need it and then boom, you’re let down. This is the worst kind of trap you can build into backup system.

Version 2.0.4.15

ts678 · February 26, 2019, 6:02pm

If you’re talking about the names of the commands used, I think the short names can lead one to think they’re more universal than they are. They’re in some cases rather specific (but I haven’t seen a good writeup of capabilities and limitations – almost certainly one could not fit all of the details in the names).

At least the long repair description says it’s “Tries to repair” (and a rewriting of recreate has begun that possibly will also help repair, however I haven’t heard any news recently, so can’t offer you any details).

–list-broken-files sounds like it’s mainly a preview of –purge-broken-files, which I think are mostly done using easily obtained data such as the database and remote file listing. Testing remote files completely would require downloading everything, and rarely will anyone be willing to tolerate the time and transfer.

The TEST command of all files with –full-remote-verification is quite thorough – and maybe very slow. Longer-term, the hash capabilities that some storage provides may be a shortcut to limited better tests.

How often does duplicati test your backup archives? gets into that future idea and other sorts of testing, with different tradeoffs. Unfortunately, I don’t think there’s any test that actually tries a source file restore. This could help users who don’t try their own test restore, but could also fall victim to a huge source file.

It sounds like you might have lost a dblock file (thus the scramble for blocks), but do you have the actual error message? Sometimes exact messages can help find in the source the code that raised that issue. Restore is sometimes harder to get logs out of than backup though, but you can try pasting screenshots.

What’s the current system status? You were trying a full restore. Is there system damage in the picture?

Sorry things went badly, but with help from you, we can see if we can get the restore going better for you. How much of a hurry is this? Are you aiming for a recent version, or trying to get some historical version? Duplicati has lots of different ways to do restores (difficulty varies), and your needs can guide the choice.

Sami_Lehtinen · February 27, 2019, 7:05am

One more warning, testing, is still done wrong. It’s possible that full tests do pass but full restore still fails. This is the ultimate trap, I were talking about. Which undermines confidence and creates very dangerous false trust for users until the very end. I’ve seen this situation also, that’s why I do always full restore, not full test. Test is unfortunately dangerously misleading.

But back to this specific case I’m now talking about. I guess the problem might be that there’s for some reason left over i file. When the database is available, the file is ignored, but while rebuilding database it probably causes the badly written logic to fail.

Remote file referenced as duplicati-b745030a03fa640d29f2daa1849cf0f2e.dblock.zip.aes by duplicati-i59a736b5d13d47d0b91cdfdfa1ddf8cb.dindex.zip.aes, but not found in list, registering a missing remote file.

This file should get deleted, because it’s not probably even necessary and the restore process should ignore it and it shouldn’t cause it to fail.

Also about testing process, testing backup when the local database is present is also stupid, because it’s not available when you’re doing disaster recovery. These are very bad and inherent flaws in the current “solution”. It seems that the bad reliability is absolutely worst part, but let’s get the problems get fixed. Yet these are obvious and logical problems which should get fixed without reports. If the backup works transactionally and correctly, there shouldn’t be there problems ever and if there is, automatic recovery should occure. - I think I’ve written exactly same around year ago or so.

Luckily this isn’t backup set in terabytes, so it’s relatively viable to run all the tests over 10 gigabit network.

Found 1 missing volumes; attempting to replace blocks from existing volumes

Here’s fresh retested and confirmed results:
Full Restore: FAIL - Code 100, unable to recover
Full Test: OK - 0
List-Broken-Files: OK - 0
Purge-Broken-Files: OK - 0
Repair: OK - 0
Full Restore (again after all checks and tests): FAIL - Code 100, unable to recover

This is perfect example of ‘trap’ product, which wastes your time and resources, and doesn’t provide any value when actually needed. That’s why this kind of critical flaws shouldn’t exists by design.

About version I’m restoring in this case, always the latest version, which makes sense in case of DR restoration. That’s also the reason why if such problem is detected, the potentially missing blocks should be automatically and instantly replaced (when running next backup) because the required data is still available on the sources system.

As mentioned, I’m just TESTING, I’ve got all the data I need and no actual need to restore. So this isn’t actually major problem for me. But as programmer, data administrator, it manager and guy who greatly cares about data integrity and system reliability, I find these problems absolutely devastating.

I remember that compression program, which gave great compression results in all tests. It worked really great, it was fast, and gave incredible compression ratios. Under the cover it just referenced the source files. As soon as the user deleted the source files, extracting the archive failed leading to data loss. - Funny stuff.

Edit and continued: test was run with “all” and --full-remote-verification=true even adding --no-local-db=true doesn’t change anything, still passes. --no-local-blocks=true, still passes.

Which option removes excess files from destination folder? I think I’ve seen discussion about this, but I couldn’t find the parameter quickly. I’m also strong advocate of software which goes and “poops” around, leaving junk everywhere. If something isn’t necessary, it should get deleted.

ts678 · March 1, 2019, 12:58am

You might be thinking of –auto-cleanup but I’m not totally convinced you have excess files when it says you have a missing file. You could try feeding both the missing dblock name and the index file referencing it into the affected command, for example by opening the job’s Commandline. Change the command and put the remote files where the source file list would have been when it was the backup command which is default.

Inventory of files that are going to be corrupted gives an example where a damage forecast is done before intentionally making a missing dblock file then recovering. Your --list-broken-files having no complaint does suggest maybe you have an extra dindex. You could also test moving it elsewhere to see if it helps or hurts.

If you really like, you could make a database bug report, put it somewhere, then link. I can attempt to study.

Your failure, if I follow correctly, was on direct restore, so the following might be relevant to getting your log:

View logs for direct restore from backup files

Whether or not even –log-file-log-level=profiling might help me (there’s SQL but it’s hard to read) isn’t clear.

It would be wonderful if Duplicati someday got better at its repair capabilities, but writing code takes time…

Sami_Lehtinen · March 6, 2019, 11:02am

That’s what I thought before. But nice that you thought the same way. Now I had time to recheck it and confirm. Restore was completed in 30 seconds, I just had to manually remove the “offending file”. Which clearly proved three points.

File deletion is broken, it doesn’t work transactionally
File cleanup is broken, after step one failed
Restore / recovery logic is broken, after steps 1 and 2 failed.

That’s three overlapping bugs causing one problem. Correctly working program should have had three changes to fix the situation before I posted to this forum.

It’s confirmed and there are three bugs to fix.

Sami_Lehtinen · August 1, 2019, 11:21am

Today when I’m running full DR restore tests I found again backup sets which are claimed to be good, but restore still completely fails. - Nothing has changed since.

Just cross linking: Is Duplicati 2 ready for production?

Edit, more than one such case already. Let’s see what the final statistics will be. I’ll be updating this post later. Tests will take several days, but I’ll get back.

Sami_Lehtinen · August 9, 2019, 6:43am

Finally some results, only 1 Gigabit connection so restores take a long time. But here are the results rounded to hide the actual numbers, from bunch of backup sets:

81% Good (Code 0)
04% Restore completely failed (Code 100)
15% Repair required (Code 2)

Of course I very much would hope the good to be 100%, and any other results should be really rare. And the reasons for complete failures are:

DatabaseIsBrokenConsiderPurge
ExtraRemoteFiles

Other common warnings:

MissingVolumes (Scary, can lead to failure at any time later)

This is just a obvious reminder, regularly run full restore. Otherwise you’ll end up with very bad situation sooner or later. And do not use Duplicati as the only backup solution.

I’m going to purge and repair the problem cases, and retest those to see if integrity gets fully restored.

I’ve still got one huge (several terabytes) backup to restore. It has running for several days.

Sami_Lehtinen · August 29, 2019, 7:18am

And again two more backup sets. Which are good based on all of the tests duplicati provides, but still impossible to restore because database is broken.

Program & logic is inherently broken.

I guess that’s all about this topic so far. Until the issue gets fixed. But in general, such software gives me zero trust.

Also on top of this, I don’t know what status of one backup is, because restoring it would have taken more than two weeks, I aborted at two weeks. I decided to reset that backup set as well, just to be sure.

drwtsn32 · August 29, 2019, 12:33pm

Out of curiosity what back end are you using?

Sami_Lehtinen · August 29, 2019, 2:01pm

Currently FTPS, but it’s not about that. Similar issues also occur with local file system, etc. As all developers know, all it takes is one rarely occurring edge case, and that’s it. Then it will happen sooner or later.

As example with my current testing cycle, there will be around 20k+ backups, and separate compacts (which mostly do not of course run) between full restore test runs. I do log everything, and I’ve collected extensive statistical data etc. All tests will be logged and collected in single table for further analysis.

After all this, it comes to statistical question, what is acceptable failure level. If after 20k+ backup runs 4% of backups are broken, is that acceptable or not. I think it’s not, but honestly situation could be much worse, in system as complex as the Duplicati is by design.

My personal opinion is that the current flaw is systemic, and it could be remedied in several stages. Of course if all of the stages would be able to handle situation like this, it would be the best outcome. Now the issue is that none of the stages handle the case.

Classic design / reliability problem. Trust me, I’ve seen similar issues in countless programs and many of my own programs as well. My motto is just that if there’s clear malfunction which requires manual intervention, that’s something that needs to be fixed. But I’ve seen so many IT guys out there, whom “fix it manually” and claim that now it’s fixed. Which is of course bs, because the root cause isn’t fixed. It will happen sooner or later again. So it isn’t fixed.

Classic example is like writing a file to disk, and not syncing it and then updating some other state based on that. If the systems is hard reset or loses power just at the right moment, the file and (or maybe not) data will be lost. You can deal with this in two ways, sync the data so the file is confirmed by OS to be written to disk, and not stored only in write cache. And or you can make the code check after wards that the file is ok, and if it isn’t then re-do the task which created the file. In critical production systems, I’ve often implemented both options. As example, someone might have manually done something crazy like deleted the temp file. If the is written badly, it will look for the temp file and crash, because it’s missing. One bad option would be ignoring the lack of the file and data. But best thing is to re-do the task correctly.

More you’ve got systems and transactions & data, the more often you’re going to see the problem.

Due that fact and many key production systems, I’ve very careful with tasks like that. Because fixing that mess what bad code can do, is enormous and extremely painful task. Which unfortunately especially if done manually, can cause even worse inconsistent state. Sometimes special ad-hoc case specific repair programs are required. But that’s really time consuming, and even that won’t guarantee perfect results, if the primary process has failed badly. - That’s why doing it once, and correctly, is so important.

Sami_Lehtinen · September 24, 2019, 10:03am

I did intentionally run a job backup in a way which would have totally messed up older Duplicati versions, and it finally worked as expected. In this case, the remote storage was updated from another client a few times (from old server), before I started using old copy of it on a new server. Which of course creates logical version / integrity conflict.

But the recovery worked as expected. I wonder if the situation would have been worse, if I would have compacted the remote storage between. But if recovery works correctly even that shouldn’t have made a major difference.

Version: 2.0.4.30 canary 2019-09-20

Backup
Checking remote backup …
Listing remote folder …
Extra unknown file: duplicati-20190923T153013Z.dlist.zip.aes
Extra unknown file: duplicati-b9a9cc69c87d84c4aa32a3c2e1cb3d38b.dblock.zip.aes
Extra unknown file: duplicati-baa03eaed547e43449d315084131277e3.dblock.zip.aes
Extra unknown file: duplicati-ia18977bcbb524e41a096df73f2c090ad.dindex.zip.aes
Extra unknown file: duplicati-iaa9629029c614d03a2107ff52defd276.dindex.zip.aes
Missing file: duplicati-20190821T153014Z.dlist.zip.aes
Found 5 remote files that are not recorded in local storage, please run repair
Backend verification failed, attempting automatic cleanup => Found 5 remote files that are not recorded in local storage, please run repair
This is something which shouldn’t happen - There’s still some kind of code logic error in the code base.
Failed to read local db XXX.sqlite, error: database is locked
database is locked => database is locked
database is locked
Fatal error => The process cannot access the file because it is being used by another process.

Repair
Listing remote folder …
Deleting file duplicati-20190923T153013Z.dlist.zip.aes (39,20 KT) …
Deleting file duplicati-b9a9cc69c87d84c4aa32a3c2e1cb3d38b.dblock.zip.aes (31,91 MT) …
Deleting file duplicati-baa03eaed547e43449d315084131277e3.dblock.zip.aes (139,42 KT) …
Downloading file (unknown) …
Failed to accept new index file: duplicati-ia18977bcbb524e41a096df73f2c090ad.dindex.zip.aes, message: Volume duplicati-b9a9cc69c87d84c4aa32a3c2e1cb3d38b.dblock.zip.aes has local state Deleting => Volume duplicati-b9a9cc69c87d84c4aa32a3c2e1cb3d38b.dblock.zip.aes has local state Deleting
Deleting file duplicati-ia18977bcbb524e41a096df73f2c090ad.dindex.zip.aes (1,35 MT) …
Downloading file (unknown) …
Failed to accept new index file: duplicati-iaa9629029c614d03a2107ff52defd276.dindex.zip.aes, message: Volume duplicati-baa03eaed547e43449d315084131277e3.dblock.zip.aes has local state Deleting => Volume duplicati-baa03eaed547e43449d315084131277e3.dblock.zip.aes has local state Deleting
Deleting file duplicati-iaa9629029c614d03a2107ff52defd276.dindex.zip.aes (52,37 KT) …
Uploading file (541 tavua) …

But after running the repair manually backup worked. Let’s restore the backup and see if it works as well. Usually in these kind of situations the backup got messed up and became unrestorable.

Test
Listing remote folder …
Examined 87 files and found no errors

So far so good… Let’s restore the backup and see the final verdict, which used to often fail in this kind of situation.

Restore
Restored 785 (4,54 GT) files to XXX
Duration of restore: 00:07:09

Phew… Honestly this starts looking really good, finally. When next beta comes out, I’ll need push update to all clients. Hoping this will also fix the unrestorable databases cases in future.

TempDir - Once again
Anyway, --tempdir is once again totally broken and doesn’t work at least when restoring. Not via TMPDIR environment variable nor as command-line parameter. Kind of legendary. I just seriously dislike inconsistent code. But that’s a really minor thing, it just would vastly improve restore performance in this case. Yet, at the very same time, it’s so simple thing, I wonder how it can be broken all the time. Must tell something about bad development processes and habits?

drwtsn32 · September 25, 2019, 3:57am

I do find it puzzling that you are intentionally trying to do things like that… but I guess it’s good it was able to recover!

Can you open an issue for this on github? If as you say it’s a simple fix, it should be easy to resolve.

Thanks for the update!

Sami_Lehtinen · September 25, 2019, 6:15am

Why is that strange? With robust systems, all possible failure models and outcomes should be already defined in the unit tests. If there’s something you can break, then something is done wrong. See: Chaos engineering - Wikipedia

There should be two departments, another creating the software and another trying to break it. This will reveal the issues which many programs are so full of. What if something goes wrong, is the system designed and implemented correctly. What should happen etc. Robust systems deal with it, and most of systems break down. There are things you can’t recover from. But all recoverable issues should be dealt with automatically.

Of course this includes the classic question, which kind of errors should be automatically managed and which ones should just say this fails is now so serious that somebody has to check this out and fix it manually. And if possible, automatically recover to last possible consistent state, where data integrity can be verified.

Tests continued

I continued by testing the situation I had about year or two ago. Where the backup-system file system got corrupted due to RAID & NTFS failure. In that case, I just restored the backup of the backups. Yet due to backup schedules, I did lose some of the latest files. Back then it forced me to restart all backups, because recovery was totally borken, even if the data storage was simply rolled back “one version”.

I just deleted latest files from the backup data storage.

Backup started at 25.9.2019 8.12.09
Checking remote backup ...
  Listing remote folder ...
Missing file: duplicati-20190924T091416Z.dlist.zip.aes
Missing file: duplicati-bea66de64699443fca42451aad590ce71.dblock.zip.aes
...
Found 43 files that are missing from the remote storage, please run repair
Backend verification failed, attempting automatic cleanup => Found 43 files that are missing from the remote storage, please run repair

repairing with --rebuild-missing-dblock-files ← Looks good, testing it out.

Ok, as expected rebuild wasn’t possible in this specific case, because I did also delete lots of blocks created by compaction.

Let’s see what steps are needed to fix the state. (I’ll automate this in my own scripts, when necessary, that’s why I need to know). I do deeply hate manual fixes.

list-broken-files, purge-broken-files, repair, verify, restore

Yet I could have done one more test which is the most complex I think. Where I would have normal valid backup, then I would intentionally compact it and then restore old database version. Now the database and data storage are out of sync. But there are some extra files, which contain data, which belongs to the missing files. Does the recovery process check the extra file content, if the files contain needed blocks when repairing / rebuilding blocks?

While running verify after repair, this looks bit strange:

Failed to process file duplicati-20190918T153015Z.dlist.zip.aes => Value cannot be null.
Parameter name: stream
  Downloading file (541 tavua) ...
...
Failed to process file duplicati-20190921T153015Z.dlist.zip.aes => Value cannot be null.

Something is slightly off, why it checks the same file several times? Or is it assumed it’s a data transfer issue and that’s why it’s retried?

Interesting, the verify is showing many more errors especially file size / hash mismatches, which I really weren’t expecting. The changes I did, didn’t touch those files or sizes at all. - This also seems to be slightly off? It’s also strange that the files are downloaded multiple times, probably once again, because system thinks it’s data transfer issue causing the file mismatch. I’ve also seen similar issues earlier, where there’s just a slight size mismatch. Afaik, Duplicati won’t rewrite or append to files, so what’s causing the size / hash issue? Shouldn’t all files / storage objects are static. What’s causing this?

After all of these steps verify says, that there are files with incorrect size / hashes, repair says:
remote file duplicati-ia64c5ccb08d842119f94f49905c9187d.dindex.zip.aes is listed as Verified with size 82973 but should be 103165, please verify the sha256 hash “v5G9my6/LS7EIXQpMUVhtBg9brZIiETRJhKQBb+pLZw=”
…
remote file duplicati-if5ee7b555c664c749c2e7f72e7337d39.dindex.zip.aes is listed as Verified with size 1773933 but should be 180893, please verify the sha256 hash “bXRXEEyHYOawL7FafVvbySMm5Z9dUhJ2TNUPzEyzhWs=”

Failed to perform verification for file: duplicati-ia64c5ccb08d842119f94f49905c9187d.dindex.zip.aes, please run verify; message: Remote verification failure: [Extra, +0bP8nxrab2YH1spbtqSTNSq01AOuIKZQ7JBP+35Das=] => Remote verification failure: [Extra, +0bP8nxrab2YH1spbtqSTNSq01AOuIKZQ7JBP+35Das=]
...
Failed to perform verification for file: duplicati-if5ee7b555c664c749c2e7f72e7337d39.dindex.zip.aes, please run verify; message: Remote verification failure: [Extra, ++JiEgbuo4dt+cJ58dQHGVB4v8P6Or+4BOxhDPqI5do=] => Remote verification failure: [Extra, ++JiEgbuo4dt+cJ58dQHGVB4v8P6Or+4BOxhDPqI5do=]

And the state can’t be restored. The repair / recovery still doesn’t work as well as it could work. Because just deleting files for one day, should be quite simple to recover from. Yet, in this case compaction was also run between. So sure, it might lead to situation where there aren’t any fully restorable backup versions left. Because I did delete new files. If I would have rolled back the data storage, then there should be old blocks (extra files) which contain the data require to recover the old versions.

But now we’re again in the situation where the backup won’t run without full backup reset.

Just to play with it, after those steps, I did restore the files I had removed to test the recovery. Now repair deleted the files as extra files. Interestingly it detected new index file. But rejected it.

After all tests these errors remain:
remote file duplicati-idff8102b359c48118ecc150105e40256.dindex.zip.aes is listed as Verified with size 1028509 but should be 1450173, please verify the sha256 hash “0f6My7NXgw2NYcC+E7ovwvROs9WmDRiLMNLafXHkmPU=”
…
remote file duplicati-iae3eae383b3a40a4b591091050c6772d.dindex.zip.aes is listed as Verified with size 142109 but should be 297501, please verify the sha256 hash “ammoBgBhtDeP5tNqQcC1N/d2XKdZJXsHFJ8znFn1ndw=”

I’ll try one more thing. I just manually delete those files. Let’s see what happens. Then I’ll run repair, verify and restore test.

After repair, those files were re-uploaded. But still the state isn’t properly recovered. Repair says that everything is ok. But verify fails:

Failed to process file duplicati-20190920T153015Z.dlist.zip.aes => Value cannot be null.
Parameter name: stream
  Downloading file (31,91 MT) ...
duplicati-20190920T153015Z.dlist.zip.aes: 1 errors
        Error: Value cannot be null.
Parameter name: stream

These are quite simple failure models, and shouldn’t cause any trouble.

Anyway, I’ll just restored the backup database and the data storage state, so everything resumes as normal. But this was still a good test of the recovery not being nearly perfect yet. And tests for these kind of situations are missing.

Afaik rolling back the data storage one or a few versions should be in the basic recovery tests. Sure it sucks, but that’s once again something that can happen in real world and it should be recoverable.

So both cases where database is rolled back or data storage is rolled back should be recoverable. Extra bonus, if the compaction between those events is correctly handled by reading the available data from the new / old files which are out of sync.

Anyway, personally I feel that reverse engineering / testing these cases is bit strange. Best way to approach these things is to make sure that the logic handling the situations already covert these issues. - Yet this is long term problem with software development. Do we write some code, and see what happens. Or do we first design what we want to do, specify failure models and properly design, implement and test those. - Sure it’s tedious. But I think the later model is the only way to make really solid software.

drwtsn32 · September 25, 2019, 1:26pm

I absolutely agree with you when the product is being used in a supported way, but I don’t feel as strongly when it is being used incorrectly (in this case - two separate backups writing to the same back end location).

As you noted, this was not a good test due to a compaction event. You didn’t roll back one version, you manually deleted files on the back end with the most recent timestamps. If instead you actually did use a true older version of your backup (backup of backup), it probably would have worked fine. You’d be able to recreate the database and be on your merry way. (I have done this test successfully more than once.)

Don’t get me wrong, I’m not saying the devs should make no effort to improve robustness of Duplicati when it comes to user error or otherwise intentionally doing something like deleting back end files manually. I personally would prefer they spend time fixing the remaining bugs in core product functionality - there are still some bugs in using supported Duplicati functionality without intentionally trying to break things.

warwickmm · September 26, 2019, 1:40am

Sami_Lehtinen:

While running verify after repair, this looks bit strange:

Failed to process file duplicati-20190918T153015Z.dlist.zip.aes => Value cannot be null.
Parameter name: stream
  Downloading file (541 tavua) ...
...
Failed to process file duplicati-20190921T153015Z.dlist.zip.aes => Value cannot be null.

This may be related to a new issue that was introduced in 2.0.4.29:

github.com/duplicati/duplicati

Purge operations result in broken dlist files

opened 02:10AM - 25 Sep 19 UTC

closed 05:05PM - 19 Oct 19 UTC

warwickmm

bug core logic

**Duplicati version**: 2.0.4.29 ## Description A purge operation results i…n broken dlist files, as well as possibly other issues. ## Steps to reproduce 1. Run a backup. 2. Purge a file. This _appears_ to run without error. However, notice that most of the dlist files are 541 bytes in size. 3. Recreate the database. This should result in an exception and reveal that the purge did not behave properly: ``` Failed to process file: duplicati-20190925T020411Z.dlist.zip.aes => Value cannot be null. Parameter name: stream Failed to process file: duplicati-20190925T020414Z.dlist.zip.aes => Value cannot be null. Parameter name: stream Downloading file (1.39 KB) ... ``` This appears to have been introduced in pull request #3836, possibly between revisions e8f3bccef8da469f3b00f228a77b1965b16d0781 and 153a97285315d9562f21384365fc45b783d862ec. See forum discussion [here](https://forum.duplicati.com/t/purge-resulted-in-damaged-dlist-files/8114).

Sami_Lehtinen · October 3, 2019, 4:52am

True. Yet that still shouldn’t prevent recovery from working. Even if this probably with my data sets which got lots of long term data will lead to situation that none of the versions is completely intact. Yet, even in this case, technically state recovery should be successful. Even if that state means that files X, Y, Z will be broken in all versions.

Backup programs are the genre where I’m especially expecting that. In normal circumstances if there’s detectable corruption, recommendation is always to stop, and return to known last good state. But in case of backup systems, the data is so important, that complete failure without good reason shouldn’t happen.

Yet of course there needs to be a very good notice to the user if restoring X won’t lead to prefect restore. But whatever your content is, and in case of disaster recovery. Getting back 98% of data, even if only 90% of files would be completely intact, is still 98% more than no data at all.

That’s one of the reasons why I’m using with 7-zip backups additional layer of par2 files. Which allows me to restore all data even in case of bit rot in some of the files. - Yet this ECC hamming isn’t what I’m asking for. Only the option to recover state (as well as possible) and allow restoring of state which contains known issues. Of course this is an exception, but should be possible. Of course on something like de-duplication block granularity. The blocks which are corrupt or missing, could be left to zeroes. So it’s obvious that something is missing from the file.

I’ve been in disaster recovery situations which haven’t been pretty. As example reading as much data from disk as possible, where the metadata has been completely lost. In some of those cases, with known file format, we’ve scanned all the copied clusters from the disk with custom software, and reassembled the data as well as possible. I’ve recovered source code, logs, database tables, etc like that. - That’s a freaking nightmare, very expensive, slow, no guarantee of good results. But still a lot better than no data all. And also a very good reason to keep your backups working. Haha… As bonus if you never defrag your drive, get it often full and the delete bunch of small files to free up some space. That’s the way to making the job much more fun.

Sami_Lehtinen · April 28, 2021, 5:04am

I’ve got similar situation again. I’m wondering what are exactly the correct steps to repair the backup data files? To make it really short and compact:

Backup works
Restore (with db works)
Testing backup (with db) doesn’t indicate any problems
Restore (without db doesn’t work) - Ends with error that the backup is corrupted
If I run repair and then restore, it fixes the step four.
This is really annoying systemic inconsistency, which runs anyone interested about logic and data consistency and integrity totally crazy.

What’s the process to bring the backup state to consistent again? The repair does change bunch of files? Is it safe to run the repair on secondary system, will it break the primary backup? Why doesn’t the primary system fix the problems. It always says that everything is good, except it just isn’t.

I think problems like this undermine trust and reliability badly, when situation is very confusing and there’s no way to ok, I knowledge the problem, fix it. - So this is exactly like the situation in the first post. Nothing has changed since that unfortunately.

I’m now running the latest version: 2.0.6.0_experimental_2021-04-09.

Thank you

ts678 · April 28, 2021, 2:04pm

Need some details. Did it make a log in the usual spot for per-backup logs? How about server log?
I’d think one of those would have something, but if it’s the per-backup log it might not have enough.
Log is summarized to the first line. Later line detail is skipped. Server live log or log file or reporting
(such as email or http) are pretty good at detail, but if it gets to actual debug, even more is needed.

Best way to debug is to figure out how to reproduce failure, then let an expert examine with favorite debugging tools active. They might use a profiling log, series of databases (and a DB browser), etc.

It is possible to submit a database bug report which is sanitized for privacy, but hand analysis gets impossible quickly if it’s a situation where thousands of things must be checked. Much easier to let computer do such checks and give messages to guide manual search, so first step is messages.

For lightweight message set, you might set up About → Show log → Live → Verbose or similar file.
This log level might reveal source paths, unlike with Information level or Retry level which are safer.
If posting more detailed logs, you can also just look to see if it’s just random destination names, etc.

You can also get logs of repair. It might say what it did. You can also check destination file content.
Did the number of files change? Sorting by timestamp, are there any new files? This level of thing
suits Information and Retry situations well. You can also make DB bug reports before and after the
Repair operation to compare them, however (as usual) the easiest hope is that action was logged.

Agreed, but little information is given. Some is difficult to gather, thus the focus on getting messages.
The database should always be recreatable from the destination files, typically using Recreate button.
Direct restore from backup files is a Recreate tailored for that restore. It’s also a temporary database,
which makes me wonder how you get from step 4 (restore) to step 5 (Repair). Could you clarify that?

Although clearly next Beta (assuming Experimental goes Beta) isn’t fixing everything (that will be slow, especially with the difficulty of reproducing problems for analysis, and the field debugging challenges), hopefully the “similar situation” just means not everything was fixed (expected), and it’s not a new bug.

If I recall, you have multiple systems. If across all of them you get failures at a pretty steady rate, these possibly could be good chances for investigation. Investigations can do little without failure specifics…

Sami_Lehtinen · May 5, 2021, 8:16am

Ok, I did repeat all the stuff, I’ve done now probably three times. And I think I’ve got a good clue.

Here’s the summary from the end:

After all this testing only one conclusion:

Backup is corrupted and good at the same time. There’s clearly flawed and very incosistent logic somewhere. - Again, this should run anyone responsible for the data integrity absolutely mad. And I personally find this extremely worrying.

This is the reason why I recommend that everyone should run full restore tests on alternate system, because the test feature (?) seems to be very dangerously and misleadingly broken.

Either the database rebuild is somehow broken? And if it isn’t, then the test feature is fatally flawed. Unfortunately I can’t exactly tell where the disparity occurs.

If I test the data set with the original database (before any of these things I’ve done here) the test claims the backup is good.

I’ve also confirmed and checked at the data source where the backups are created, that the file does currently exist (corrupted-file.dat). And if there’s any corruption, it could have been easily refreshed any day to the backup, but that isn’t happening. Yet the file remains broken in the backup set for extended time. Which probably means the corruption is in the blocks which aren’t being changed often.

… And then the nitty gritty details …

I started by creating test environment, where I tried a few “expected” scenarios which would cause this problem. But of course none of my tests triggered the issue. It would have been just so nice to work with datasets in kilobytes instead of tens of gigabytes. - Oh well…

… To be continued … Several hours later …

Log from external test without local db:
"
2021-05-04 11.40.54 +03 - [Verbose-Duplicati.Library.Main.Operation.RecreateDatabaseHandler-ProcessingBlocklistVolumes]: Pass 3 of 3, processing blocklist volume 149 of 149
2021-05-04 11.42.30 +03 - [Information-Duplicati.Library.Main.Operation.RecreateDatabaseHandler-RecreateCompletedCheckingDatabase]: Recreate completed, verifying the database consistency
"

Output:
"
2021-05-04 11:42:37,899: Duplicati error output:
ErrorID: DatabaseIsBrokenConsiderPurge
Recreated database has missing blocks and 12 broken filelists. Consider using “list-broken-files” and “purge-broken-files” to purge broken data from the remote store and the database.
"

But if I run these tasks from the primary server with the database, it says there’s nothing to fix (test and repair both passes quickly without any action).

Next: Ok, I really don’t want to modify the real production backup, because it’s just slightly broken. I’ll create a copy of the backup ste, and let’s run repair with it in separate test environment.

… To be continued … Several hours later …

It progresses as expected to the point quite quickly:

Output:
"
Backend event: Get - Completed: duplicati-ife2631cc922e4e909388950c6c7acf7f.dindex.zip.aes (173,28 KT)
Processing indexlist volume 151 of 151
"

And now it’s taking surprisingly long without any seeming progress. Yet file-system io-statistics reveal that it’s clearly working with sqlite database and writing data to it’s temp-file.

The actions the repair will take, should indicate what the problem with the data set it. Question remains, why the database rebuild failed earlier? With the errors, there’s nothing wrong with the data?

I’ll need to check out what that rebuilt database contains, if I can send it to you guys. Or maybe it’s not even necessary after this analysis.

… To be continued … Several hours later …

Now it says:
"
Pass 3 of 3, processing blocklist volume 1 of 149
"

… To be continued … At least hour later … Still 1 of 149?

It seems that this might take well, more than several hours. Maybe tomorrow …

I reviewed the verbose log, and there’s nothing wrong with the data or the recreate process. Ok. Yet the final verdict is.

Based on earlier error summary, it can be assumed that this is the same case:
"
ErrorID: DatabaseIsBrokenConsiderPurge
Recreated database has missing blocks and 12 broken filelists. Consider using “list-broken-files” and “purge-broken-files” to purge broken data from the remote store and the database.
"

I’ll took full copy of all state data at this point. DB / backup blocks etc. I’ll now run those steps, if I remember correctly this is quite quick operation.

Output from this operation:
"
Toimenpide ListBrokenFiles aloitettiin
11 : 28.3.2021 21.30.26 (1 match(es))
corrupted-file.dat (783,22 MT)
10 : 4.4.2021 21.30.27 (1 match(es))
corrupted-file.dat (783,22 MT)
9 : 11.4.2021 21.30.29 (1 match(es))
corrupted-file.dat (783,22 MT)
8 : 19.4.2021 21.30.29 (1 match(es))
corrupted-file.dat (783,22 MT)
7 : 26.4.2021 21.30.29 (1 match(es))
corrupted-file.dat (783,22 MT)
6 : 27.4.2021 21.30.28 (1 match(es))
corrupted-file.dat (783,22 MT)
5 : 28.4.2021 21.30.26 (1 match(es))
corrupted-file.dat (783,22 MT)
4 : 29.4.2021 21.30.26 (1 match(es))
corrupted-file.dat (783,22 MT)
3 : 30.4.2021 21.30.26 (1 match(es))
corrupted-file.dat (783,22 MT)
2 : 1.5.2021 21.30.26 (1 match(es))
corrupted-file.dat (783,22 MT)
1 : 2.5.2021 21.30.26 (1 match(es))
corrupted-file.dat (783,22 MT)
0 : 3.5.2021 21.30.29 (1 match(es))
corrupted-file.dat (783,22 MT)
"

So there is a file in many of the backup versions. Why test then claims that the backup is valid?

Let’s run purge… And the output:
… several similar segments omitted, as unnecessary …
"
Purging 1 file(s) from fileset 3.5.2021 21.30.29
Starting purge operation
Processing filelist volume 1 of 1
Replacing fileset duplicati-20210503T183029Z.dlist.zip.aes with duplicati-20210503T183030Z.dlist.zip.aes which has with 1 fewer file(s) (783,22 MT reduction)
Purging file corrupted-file.dat (783,22 MT)
Writing files to remote storage
Backend event: Put - Started: duplicati-20210503T183030Z.dlist.zip.aes (43,36 KT)
Uploading file (43,36 KT) …
Backend event: Put - Completed: duplicati-20210503T183030Z.dlist.zip.aes (43,36 KT)
Backend event: Delete - Started: duplicati-20210503T183029Z.dlist.zip.aes ()
Deleting file duplicati-20210503T183029Z.dlist.zip.aes …
Backend event: Delete - Completed: duplicati-20210503T183029Z.dlist.zip.aes ()
Database was previously marked as in-progress, checking if it is valid after purging files
Purge completed, and consistency checks completed, marking database as complete
"

Now when it’s good, I’ll rerun test and full restore. Let’s see what happens.

Repair passed, all good.

Test … Now it passed.

Finally full restore of latest version to confirm that restore is now working.

Well, now restore worked and claims that everything is ok. But unsurprisingly the one file is missing from the restored data set which got pruned out earlier. Technically restore worked, and claimed to be ok. But as we well now know, it isn’t.

I’ll keep it for a while, there’s additional detail needed from any of the steps. I’m not of course excluding potential data corruption in storage system or transport. Even if it should be unlikely. Yet, if that’s the case, testing should catch the problem. - This case is repeatable, and it’s not some random cosmic radiation from the sun.

Also I’m curious, if I should actually proceed and repair the actual backup. Now I only repaired the clone of the backup set. Because as mentioned earlier, test, purge or repair doesn’t indicate that there’s anything wrong. So I haven’t naturally touched the data. Yet the fact that single file is missing after restore from the backup is highly disturbing.

I also tried to repeat the problem with very small backup and tampering with the backup files, but that didn’t lead to this kind of situation, where test would indicate db to be good but restore would fail. It would have been so much nicer.

Would you prefer to get the database from the source, or the recreated database? Or both? I can run both. → create report from data source and create report from rebuilt database. Does that sound like a plan?

P.S. Unfortunately I’m a developer too, so nothing is more annoying than randomly occurring rare issues, which are serious but doesn’t ever happen “on my computer” or with the limited test data sets. Only in production and rarely.

The repair process only touched the dlist files. Removing references to the one file missing from restore.

But I still have the data (i / b, dlist files) and before I did rebuild / repair. Any specific tests or options that would be preferred?

Some background to answer the questions:

How did I initially find out this problem?

As nerd, I don’t trust anything. I do run scheduled restores from backup files on all of my backup sets. - With Duplicati this is essential, otherwise you don’t know if the backups are corrupted (!).
I also run tests with the database from the source servers - But with duplicati this seems to be unreliable and not trust worthy method (!).
At times, I’ll find out broken backups in the step 1. - This is one of the cases. Sometimes it’s something which is not as serious, and running repair on the source system fixes the issue (yet that shouldn’t happen either!) - But in this case there’s the disparity I’ve reported above.
Usually when restore fails, I’ll try to run repair and purges on the source server (yet in this case nothing happened)
After taking copy of the data set (Duplicati data files, no database). I did run repair, because I knew it was already broken. It recommended to run the purges, I did that. And after that the restore works (yet the purged files are now missing).

I hope this clarifies how I found out the problem.

This also makes me very very worried about the fact, that many people might have corrupted backups, yet they don’t know about it. Because running test with the database doesn’t reveal the corruption.

If there are any specific questions, I’m happy to answer.

It also seems that the data set I’m talking about got just compacted last night, without problems. But probably the corrupted blocks weren’t touched. Because if I remember right, I’ve seen some times the compaction catching broken data. - I assume that didn’t change anything.

Let’s see if this can be sorted out. - Thank you!

Edit: Just note, that I’ve delivered the reports to @ts678. Set 1 from origin server Set 2 from the environment where I’ve rebuilt the database from data files.

Edit: Some words about the back end being used.

It’s our own hardware, FTPS is the protocol, it runs common FTPS server and storage is NTFS with RAID5 and WD Gold Enterprise drives.There are some scheduled maintenance tasks, like updates and reboots. There are no time windows reserved for those to avoid overlapping with backups.

Yet as said earlier, with correctly working system that shouldn’t make any difference. Any operation can be halted without warning at any time, and after that the system should continue from last known good state. Either doing roll forward or roll backward.

Actual data corruption in this case should be highly unlikely, yet it’s not impossible, as example the system isn’t using ECC memory.

The main point still is that if there’s actual corruption test should catch it. And if there isn’t why rebuild / restore fails?

Sami_Lehtinen · August 12, 2021, 10:49am

Just bumping, this issue still persists with the very latest version. Never ever trust any test or backup verification. Only clean restore will tell you if data is restorable or not. Just fixing a few backup sets again manually.

delete database
run repair
run list-broken-files
run pruge-broken-files
run repair
run backup
run compact (optional, but recommended)

Slow painful, and without these steps the backup is broken. This should not happen.

Btw. This is happening a lot, around 10% of backups are broken in this specific way, after 6 months since last test with two daily backup runs, which is really worrying. - Yes, the duplicati jobs are aborted at times as well as the file server aborts some transfers in some specific situations. But with correctly working programs, this shouldn’t be a problem. It actually happens so often, I’ve now scripted this.

But, it’s no help, if brown stuff hits the propeller, and there’s a need to restore data which can’t be restored.

It’s worth of noting, that if the tests are run locally on the server with the database, it doesn’t report any errors. So the real horror is only revealed when there’s actual need to do disaster recovery restore.