Backup valid, but still unrestorable?

You might be thinking of –auto-cleanup but I’m not totally convinced you have excess files when it says you have a missing file. You could try feeding both the missing dblock name and the index file referencing it into the affected command, for example by opening the job’s Commandline. Change the command and put the remote files where the source file list would have been when it was the backup command which is default.

Inventory of files that are going to be corrupted gives an example where a damage forecast is done before intentionally making a missing dblock file then recovering. Your --list-broken-files having no complaint does suggest maybe you have an extra dindex. You could also test moving it elsewhere to see if it helps or hurts.

If you really like, you could make a database bug report, put it somewhere, then link. I can attempt to study.

Your failure, if I follow correctly, was on direct restore, so the following might be relevant to getting your log:

View logs for direct restore from backup files

Whether or not even –log-file-log-level=profiling might help me (there’s SQL but it’s hard to read) isn’t clear.

It would be wonderful if Duplicati someday got better at its repair capabilities, but writing code takes time…

That’s what I thought before. But nice that you thought the same way. Now I had time to recheck it and confirm. Restore was completed in 30 seconds, I just had to manually remove the “offending file”. Which clearly proved three points.

  1. File deletion is broken, it doesn’t work transactionally
  2. File cleanup is broken, after step one failed
  3. Restore / recovery logic is broken, after steps 1 and 2 failed.

That’s three overlapping bugs causing one problem. Correctly working program should have had three changes to fix the situation before I posted to this forum.

It’s confirmed and there are three bugs to fix.

Today when I’m running full DR restore tests I found again backup sets which are claimed to be good, but restore still completely fails. - Nothing has changed since.

Just cross linking: Is Duplicati 2 ready for production?

Edit, more than one such case already. Let’s see what the final statistics will be. I’ll be updating this post later. Tests will take several days, but I’ll get back.

Finally some results, only 1 Gigabit connection so restores take a long time. But here are the results rounded to hide the actual numbers, from bunch of backup sets:

81% Good (Code 0)
04% Restore completely failed (Code 100)
15% Repair required (Code 2)

Of course I very much would hope the good to be 100%, and any other results should be really rare. And the reasons for complete failures are:

  • DatabaseIsBrokenConsiderPurge
  • ExtraRemoteFiles

Other common warnings:

  • MissingVolumes (Scary, can lead to failure at any time later)

This is just a obvious reminder, regularly run full restore. Otherwise you’ll end up with very bad situation sooner or later. And do not use Duplicati as the only backup solution.

I’m going to purge and repair the problem cases, and retest those to see if integrity gets fully restored.

I’ve still got one huge (several terabytes) backup to restore. It has running for several days.

And again two more backup sets. Which are good based on all of the tests duplicati provides, but still impossible to restore because database is broken.

Program & logic is inherently broken.

I guess that’s all about this topic so far. Until the issue gets fixed. But in general, such software gives me zero trust.

Also on top of this, I don’t know what status of one backup is, because restoring it would have taken more than two weeks, I aborted at two weeks. I decided to reset that backup set as well, just to be sure.

Out of curiosity what back end are you using?

Currently FTPS, but it’s not about that. Similar issues also occur with local file system, etc. As all developers know, all it takes is one rarely occurring edge case, and that’s it. Then it will happen sooner or later.

As example with my current testing cycle, there will be around 20k+ backups, and separate compacts (which mostly do not of course run) between full restore test runs. I do log everything, and I’ve collected extensive statistical data etc. All tests will be logged and collected in single table for further analysis.

After all this, it comes to statistical question, what is acceptable failure level. If after 20k+ backup runs 4% of backups are broken, is that acceptable or not. I think it’s not, but honestly situation could be much worse, in system as complex as the Duplicati is by design.

My personal opinion is that the current flaw is systemic, and it could be remedied in several stages. Of course if all of the stages would be able to handle situation like this, it would be the best outcome. Now the issue is that none of the stages handle the case.

Classic design / reliability problem. Trust me, I’ve seen similar issues in countless programs and many of my own programs as well. My motto is just that if there’s clear malfunction which requires manual intervention, that’s something that needs to be fixed. But I’ve seen so many IT guys out there, whom “fix it manually” and claim that now it’s fixed. Which is of course bs, because the root cause isn’t fixed. It will happen sooner or later again. So it isn’t fixed.

Classic example is like writing a file to disk, and not syncing it and then updating some other state based on that. If the systems is hard reset or loses power just at the right moment, the file and (or maybe not) data will be lost. You can deal with this in two ways, sync the data so the file is confirmed by OS to be written to disk, and not stored only in write cache. And or you can make the code check after wards that the file is ok, and if it isn’t then re-do the task which created the file. In critical production systems, I’ve often implemented both options. As example, someone might have manually done something crazy like deleted the temp file. If the is written badly, it will look for the temp file and crash, because it’s missing. One bad option would be ignoring the lack of the file and data. But best thing is to re-do the task correctly.

More you’ve got systems and transactions & data, the more often you’re going to see the problem.

Due that fact and many key production systems, I’ve very careful with tasks like that. Because fixing that mess what bad code can do, is enormous and extremely painful task. Which unfortunately especially if done manually, can cause even worse inconsistent state. Sometimes special ad-hoc case specific repair programs are required. But that’s really time consuming, and even that won’t guarantee perfect results, if the primary process has failed badly. - That’s why doing it once, and correctly, is so important.

I did intentionally run a job backup in a way which would have totally messed up older Duplicati versions, and it finally worked as expected. In this case, the remote storage was updated from another client a few times (from old server), before I started using old copy of it on a new server. Which of course creates logical version / integrity conflict.

But the recovery worked as expected. I wonder if the situation would have been worse, if I would have compacted the remote storage between. But if recovery works correctly even that shouldn’t have made a major difference.

Version: 2.0.4.30 canary 2019-09-20

Backup
Checking remote backup …
Listing remote folder …
Extra unknown file: duplicati-20190923T153013Z.dlist.zip.aes
Extra unknown file: duplicati-b9a9cc69c87d84c4aa32a3c2e1cb3d38b.dblock.zip.aes
Extra unknown file: duplicati-baa03eaed547e43449d315084131277e3.dblock.zip.aes
Extra unknown file: duplicati-ia18977bcbb524e41a096df73f2c090ad.dindex.zip.aes
Extra unknown file: duplicati-iaa9629029c614d03a2107ff52defd276.dindex.zip.aes
Missing file: duplicati-20190821T153014Z.dlist.zip.aes
Found 5 remote files that are not recorded in local storage, please run repair
Backend verification failed, attempting automatic cleanup => Found 5 remote files that are not recorded in local storage, please run repair
This is something which shouldn’t happen - There’s still some kind of code logic error in the code base.
Failed to read local db XXX.sqlite, error: database is locked
database is locked => database is locked
database is locked
Fatal error => The process cannot access the file because it is being used by another process.

Repair
Listing remote folder …
Deleting file duplicati-20190923T153013Z.dlist.zip.aes (39,20 KT) …
Deleting file duplicati-b9a9cc69c87d84c4aa32a3c2e1cb3d38b.dblock.zip.aes (31,91 MT) …
Deleting file duplicati-baa03eaed547e43449d315084131277e3.dblock.zip.aes (139,42 KT) …
Downloading file (unknown) …
Failed to accept new index file: duplicati-ia18977bcbb524e41a096df73f2c090ad.dindex.zip.aes, message: Volume duplicati-b9a9cc69c87d84c4aa32a3c2e1cb3d38b.dblock.zip.aes has local state Deleting => Volume duplicati-b9a9cc69c87d84c4aa32a3c2e1cb3d38b.dblock.zip.aes has local state Deleting
Deleting file duplicati-ia18977bcbb524e41a096df73f2c090ad.dindex.zip.aes (1,35 MT) …
Downloading file (unknown) …
Failed to accept new index file: duplicati-iaa9629029c614d03a2107ff52defd276.dindex.zip.aes, message: Volume duplicati-baa03eaed547e43449d315084131277e3.dblock.zip.aes has local state Deleting => Volume duplicati-baa03eaed547e43449d315084131277e3.dblock.zip.aes has local state Deleting
Deleting file duplicati-iaa9629029c614d03a2107ff52defd276.dindex.zip.aes (52,37 KT) …
Uploading file (541 tavua) …

But after running the repair manually backup worked. Let’s restore the backup and see if it works as well. Usually in these kind of situations the backup got messed up and became unrestorable.

Test
Listing remote folder …
Examined 87 files and found no errors

So far so good… Let’s restore the backup and see the final verdict, which used to often fail in this kind of situation.

Restore
Restored 785 (4,54 GT) files to XXX
Duration of restore: 00:07:09

Phew… Honestly this starts looking really good, finally. When next beta comes out, I’ll need push update to all clients. Hoping this will also fix the unrestorable databases cases in future.

TempDir - Once again
Anyway, --tempdir is once again totally broken and doesn’t work at least when restoring. Not via TMPDIR environment variable nor as command-line parameter. Kind of legendary. I just seriously dislike inconsistent code. But that’s a really minor thing, it just would vastly improve restore performance in this case. Yet, at the very same time, it’s so simple thing, I wonder how it can be broken all the time. Must tell something about bad development processes and habits?

I do find it puzzling that you are intentionally trying to do things like that… but I guess it’s good it was able to recover!

Can you open an issue for this on github? If as you say it’s a simple fix, it should be easy to resolve.

Thanks for the update!

Why is that strange? With robust systems, all possible failure models and outcomes should be already defined in the unit tests. If there’s something you can break, then something is done wrong. See: Chaos engineering - Wikipedia

There should be two departments, another creating the software and another trying to break it. This will reveal the issues which many programs are so full of. What if something goes wrong, is the system designed and implemented correctly. What should happen etc. Robust systems deal with it, and most of systems break down. There are things you can’t recover from. But all recoverable issues should be dealt with automatically.

Of course this includes the classic question, which kind of errors should be automatically managed and which ones should just say this fails is now so serious that somebody has to check this out and fix it manually. And if possible, automatically recover to last possible consistent state, where data integrity can be verified.

Tests continued

I continued by testing the situation I had about year or two ago. Where the backup-system file system got corrupted due to RAID & NTFS failure. In that case, I just restored the backup of the backups. Yet due to backup schedules, I did lose some of the latest files. Back then it forced me to restart all backups, because recovery was totally borken, even if the data storage was simply rolled back “one version”.

I just deleted latest files from the backup data storage.

Backup started at 25.9.2019 8.12.09
Checking remote backup ...
  Listing remote folder ...
Missing file: duplicati-20190924T091416Z.dlist.zip.aes
Missing file: duplicati-bea66de64699443fca42451aad590ce71.dblock.zip.aes
...
Found 43 files that are missing from the remote storage, please run repair
Backend verification failed, attempting automatic cleanup => Found 43 files that are missing from the remote storage, please run repair

repairing with --rebuild-missing-dblock-files <- Looks good, testing it out.

Ok, as expected rebuild wasn’t possible in this specific case, because I did also delete lots of blocks created by compaction.

Let’s see what steps are needed to fix the state. (I’ll automate this in my own scripts, when necessary, that’s why I need to know). I do deeply hate manual fixes.

list-broken-files, purge-broken-files, repair, verify, restore

Yet I could have done one more test which is the most complex I think. Where I would have normal valid backup, then I would intentionally compact it and then restore old database version. Now the database and data storage are out of sync. But there are some extra files, which contain data, which belongs to the missing files. Does the recovery process check the extra file content, if the files contain needed blocks when repairing / rebuilding blocks?

While running verify after repair, this looks bit strange:

Failed to process file duplicati-20190918T153015Z.dlist.zip.aes => Value cannot be null.
Parameter name: stream
  Downloading file (541 tavua) ...
...
Failed to process file duplicati-20190921T153015Z.dlist.zip.aes => Value cannot be null.

Something is slightly off, why it checks the same file several times? Or is it assumed it’s a data transfer issue and that’s why it’s retried?

Interesting, the verify is showing many more errors especially file size / hash mismatches, which I really weren’t expecting. The changes I did, didn’t touch those files or sizes at all. - This also seems to be slightly off? It’s also strange that the files are downloaded multiple times, probably once again, because system thinks it’s data transfer issue causing the file mismatch. I’ve also seen similar issues earlier, where there’s just a slight size mismatch. Afaik, Duplicati won’t rewrite or append to files, so what’s causing the size / hash issue? Shouldn’t all files / storage objects are static. What’s causing this?

After all of these steps verify says, that there are files with incorrect size / hashes, repair says:
remote file duplicati-ia64c5ccb08d842119f94f49905c9187d.dindex.zip.aes is listed as Verified with size 82973 but should be 103165, please verify the sha256 hash “v5G9my6/LS7EIXQpMUVhtBg9brZIiETRJhKQBb+pLZw=”

remote file duplicati-if5ee7b555c664c749c2e7f72e7337d39.dindex.zip.aes is listed as Verified with size 1773933 but should be 180893, please verify the sha256 hash “bXRXEEyHYOawL7FafVvbySMm5Z9dUhJ2TNUPzEyzhWs=”

Failed to perform verification for file: duplicati-ia64c5ccb08d842119f94f49905c9187d.dindex.zip.aes, please run verify; message: Remote verification failure: [Extra, +0bP8nxrab2YH1spbtqSTNSq01AOuIKZQ7JBP+35Das=] => Remote verification failure: [Extra, +0bP8nxrab2YH1spbtqSTNSq01AOuIKZQ7JBP+35Das=]
...
Failed to perform verification for file: duplicati-if5ee7b555c664c749c2e7f72e7337d39.dindex.zip.aes, please run verify; message: Remote verification failure: [Extra, ++JiEgbuo4dt+cJ58dQHGVB4v8P6Or+4BOxhDPqI5do=] => Remote verification failure: [Extra, ++JiEgbuo4dt+cJ58dQHGVB4v8P6Or+4BOxhDPqI5do=]

And the state can’t be restored. The repair / recovery still doesn’t work as well as it could work. Because just deleting files for one day, should be quite simple to recover from. Yet, in this case compaction was also run between. So sure, it might lead to situation where there aren’t any fully restorable backup versions left. Because I did delete new files. If I would have rolled back the data storage, then there should be old blocks (extra files) which contain the data require to recover the old versions.

But now we’re again in the situation where the backup won’t run without full backup reset.

Just to play with it, after those steps, I did restore the files I had removed to test the recovery. Now repair deleted the files as extra files. Interestingly it detected new index file. But rejected it.

After all tests these errors remain:
remote file duplicati-idff8102b359c48118ecc150105e40256.dindex.zip.aes is listed as Verified with size 1028509 but should be 1450173, please verify the sha256 hash “0f6My7NXgw2NYcC+E7ovwvROs9WmDRiLMNLafXHkmPU=”

remote file duplicati-iae3eae383b3a40a4b591091050c6772d.dindex.zip.aes is listed as Verified with size 142109 but should be 297501, please verify the sha256 hash “ammoBgBhtDeP5tNqQcC1N/d2XKdZJXsHFJ8znFn1ndw=”

I’ll try one more thing. I just manually delete those files. Let’s see what happens. Then I’ll run repair, verify and restore test.

After repair, those files were re-uploaded. But still the state isn’t properly recovered. Repair says that everything is ok. But verify fails:

Failed to process file duplicati-20190920T153015Z.dlist.zip.aes => Value cannot be null.
Parameter name: stream
  Downloading file (31,91 MT) ...
duplicati-20190920T153015Z.dlist.zip.aes: 1 errors
        Error: Value cannot be null.
Parameter name: stream

These are quite simple failure models, and shouldn’t cause any trouble.

Anyway, I’ll just restored the backup database and the data storage state, so everything resumes as normal. But this was still a good test of the recovery not being nearly perfect yet. And tests for these kind of situations are missing.

Afaik rolling back the data storage one or a few versions should be in the basic recovery tests. Sure it sucks, but that’s once again something that can happen in real world and it should be recoverable.

So both cases where database is rolled back or data storage is rolled back should be recoverable. Extra bonus, if the compaction between those events is correctly handled by reading the available data from the new / old files which are out of sync.

Anyway, personally I feel that reverse engineering / testing these cases is bit strange. Best way to approach these things is to make sure that the logic handling the situations already covert these issues. - Yet this is long term problem with software development. Do we write some code, and see what happens. Or do we first design what we want to do, specify failure models and properly design, implement and test those. - Sure it’s tedious. But I think the later model is the only way to make really solid software.

I absolutely agree with you when the product is being used in a supported way, but I don’t feel as strongly when it is being used incorrectly (in this case - two separate backups writing to the same back end location).

As you noted, this was not a good test due to a compaction event. You didn’t roll back one version, you manually deleted files on the back end with the most recent timestamps. If instead you actually did use a true older version of your backup (backup of backup), it probably would have worked fine. You’d be able to recreate the database and be on your merry way. (I have done this test successfully more than once.)

Don’t get me wrong, I’m not saying the devs should make no effort to improve robustness of Duplicati when it comes to user error or otherwise intentionally doing something like deleting back end files manually. I personally would prefer they spend time fixing the remaining bugs in core product functionality - there are still some bugs in using supported Duplicati functionality without intentionally trying to break things.

1 Like

This may be related to a new issue that was introduced in 2.0.4.29:

True. Yet that still shouldn’t prevent recovery from working. Even if this probably with my data sets which got lots of long term data will lead to situation that none of the versions is completely intact. Yet, even in this case, technically state recovery should be successful. Even if that state means that files X, Y, Z will be broken in all versions.

Backup programs are the genre where I’m especially expecting that. In normal circumstances if there’s detectable corruption, recommendation is always to stop, and return to known last good state. But in case of backup systems, the data is so important, that complete failure without good reason shouldn’t happen.

Yet of course there needs to be a very good notice to the user if restoring X won’t lead to prefect restore. But whatever your content is, and in case of disaster recovery. Getting back 98% of data, even if only 90% of files would be completely intact, is still 98% more than no data at all.

That’s one of the reasons why I’m using with 7-zip backups additional layer of par2 files. Which allows me to restore all data even in case of bit rot in some of the files. - Yet this ECC hamming isn’t what I’m asking for. Only the option to recover state (as well as possible) and allow restoring of state which contains known issues. Of course this is an exception, but should be possible. Of course on something like de-duplication block granularity. The blocks which are corrupt or missing, could be left to zeroes. So it’s obvious that something is missing from the file.

I’ve been in disaster recovery situations which haven’t been pretty. As example reading as much data from disk as possible, where the metadata has been completely lost. In some of those cases, with known file format, we’ve scanned all the copied clusters from the disk with custom software, and reassembled the data as well as possible. I’ve recovered source code, logs, database tables, etc like that. - That’s a freaking nightmare, very expensive, slow, no guarantee of good results. But still a lot better than no data all. And also a very good reason to keep your backups working. Haha… As bonus if you never defrag your drive, get it often full and the delete bunch of small files to free up some space. That’s the way to making the job much more fun.

I’ve got similar situation again. I’m wondering what are exactly the correct steps to repair the backup data files? To make it really short and compact:

  1. Backup works
  2. Restore (with db works)
  3. Testing backup (with db) doesn’t indicate any problems
  4. Restore (without db doesn’t work) - Ends with error that the backup is corrupted
  5. If I run repair and then restore, it fixes the step four.
  6. This is really annoying systemic inconsistency, which runs anyone interested about logic and data consistency and integrity totally crazy.

What’s the process to bring the backup state to consistent again? The repair does change bunch of files? Is it safe to run the repair on secondary system, will it break the primary backup? Why doesn’t the primary system fix the problems. It always says that everything is good, except it just isn’t.

I think problems like this undermine trust and reliability badly, when situation is very confusing and there’s no way to ok, I knowledge the problem, fix it. - So this is exactly like the situation in the first post. Nothing has changed since that unfortunately.

I’m now running the latest version: 2.0.6.0_experimental_2021-04-09.

  • Thank you

Need some details. Did it make a log in the usual spot for per-backup logs? How about server log?
I’d think one of those would have something, but if it’s the per-backup log it might not have enough.
Log is summarized to the first line. Later line detail is skipped. Server live log or log file or reporting
(such as email or http) are pretty good at detail, but if it gets to actual debug, even more is needed.

Best way to debug is to figure out how to reproduce failure, then let an expert examine with favorite debugging tools active. They might use a profiling log, series of databases (and a DB browser), etc.

It is possible to submit a database bug report which is sanitized for privacy, but hand analysis gets impossible quickly if it’s a situation where thousands of things must be checked. Much easier to let computer do such checks and give messages to guide manual search, so first step is messages.

For lightweight message set, you might set up About → Show log → Live → Verbose or similar file.
This log level might reveal source paths, unlike with Information level or Retry level which are safer.
If posting more detailed logs, you can also just look to see if it’s just random destination names, etc.

You can also get logs of repair. It might say what it did. You can also check destination file content.
Did the number of files change? Sorting by timestamp, are there any new files? This level of thing
suits Information and Retry situations well. You can also make DB bug reports before and after the
Repair operation to compare them, however (as usual) the easiest hope is that action was logged.

Agreed, but little information is given. Some is difficult to gather, thus the focus on getting messages.
The database should always be recreatable from the destination files, typically using Recreate button.
Direct restore from backup files is a Recreate tailored for that restore. It’s also a temporary database,
which makes me wonder how you get from step 4 (restore) to step 5 (Repair). Could you clarify that?

Although clearly next Beta (assuming Experimental goes Beta) isn’t fixing everything (that will be slow, especially with the difficulty of reproducing problems for analysis, and the field debugging challenges), hopefully the “similar situation” just means not everything was fixed (expected), and it’s not a new bug.

If I recall, you have multiple systems. If across all of them you get failures at a pretty steady rate, these possibly could be good chances for investigation. Investigations can do little without failure specifics…

Ok, I did repeat all the stuff, I’ve done now probably three times. And I think I’ve got a good clue.

Here’s the summary from the end:

After all this testing only one conclusion:

Backup is corrupted and good at the same time. There’s clearly flawed and very incosistent logic somewhere. - Again, this should run anyone responsible for the data integrity absolutely mad. And I personally find this extremely worrying.

This is the reason why I recommend that everyone should run full restore tests on alternate system, because the test feature (?) seems to be very dangerously and misleadingly broken.

Either the database rebuild is somehow broken? And if it isn’t, then the test feature is fatally flawed. Unfortunately I can’t exactly tell where the disparity occurs.

If I test the data set with the original database (before any of these things I’ve done here) the test claims the backup is good.

I’ve also confirmed and checked at the data source where the backups are created, that the file does currently exist (corrupted-file.dat). And if there’s any corruption, it could have been easily refreshed any day to the backup, but that isn’t happening. Yet the file remains broken in the backup set for extended time. Which probably means the corruption is in the blocks which aren’t being changed often.

… And then the nitty gritty details …

I started by creating test environment, where I tried a few “expected” scenarios which would cause this problem. But of course none of my tests triggered the issue. It would have been just so nice to work with datasets in kilobytes instead of tens of gigabytes. - Oh well…

… To be continued … Several hours later …

Log from external test without local db:
"
2021-05-04 11.40.54 +03 - [Verbose-Duplicati.Library.Main.Operation.RecreateDatabaseHandler-ProcessingBlocklistVolumes]: Pass 3 of 3, processing blocklist volume 149 of 149
2021-05-04 11.42.30 +03 - [Information-Duplicati.Library.Main.Operation.RecreateDatabaseHandler-RecreateCompletedCheckingDatabase]: Recreate completed, verifying the database consistency
"

Output:
"
2021-05-04 11:42:37,899: Duplicati error output:
ErrorID: DatabaseIsBrokenConsiderPurge
Recreated database has missing blocks and 12 broken filelists. Consider using “list-broken-files” and “purge-broken-files” to purge broken data from the remote store and the database.
"

But if I run these tasks from the primary server with the database, it says there’s nothing to fix (test and repair both passes quickly without any action).

Next: Ok, I really don’t want to modify the real production backup, because it’s just slightly broken. I’ll create a copy of the backup ste, and let’s run repair with it in separate test environment.

… To be continued … Several hours later …

It progresses as expected to the point quite quickly:

Output:
"
Backend event: Get - Completed: duplicati-ife2631cc922e4e909388950c6c7acf7f.dindex.zip.aes (173,28 KT)
Processing indexlist volume 151 of 151
"

And now it’s taking surprisingly long without any seeming progress. Yet file-system io-statistics reveal that it’s clearly working with sqlite database and writing data to it’s temp-file.

The actions the repair will take, should indicate what the problem with the data set it. Question remains, why the database rebuild failed earlier? With the errors, there’s nothing wrong with the data?

I’ll need to check out what that rebuilt database contains, if I can send it to you guys. Or maybe it’s not even necessary after this analysis.

… To be continued … Several hours later …

Now it says:
"
Pass 3 of 3, processing blocklist volume 1 of 149
"

… To be continued … At least hour later … Still 1 of 149?

It seems that this might take well, more than several hours. Maybe tomorrow …

I reviewed the verbose log, and there’s nothing wrong with the data or the recreate process. Ok. Yet the final verdict is.

Based on earlier error summary, it can be assumed that this is the same case:
"
ErrorID: DatabaseIsBrokenConsiderPurge
Recreated database has missing blocks and 12 broken filelists. Consider using “list-broken-files” and “purge-broken-files” to purge broken data from the remote store and the database.
"

I’ll took full copy of all state data at this point. DB / backup blocks etc. I’ll now run those steps, if I remember correctly this is quite quick operation.

Output from this operation:
"
Toimenpide ListBrokenFiles aloitettiin
11 : 28.3.2021 21.30.26 (1 match(es))
corrupted-file.dat (783,22 MT)
10 : 4.4.2021 21.30.27 (1 match(es))
corrupted-file.dat (783,22 MT)
9 : 11.4.2021 21.30.29 (1 match(es))
corrupted-file.dat (783,22 MT)
8 : 19.4.2021 21.30.29 (1 match(es))
corrupted-file.dat (783,22 MT)
7 : 26.4.2021 21.30.29 (1 match(es))
corrupted-file.dat (783,22 MT)
6 : 27.4.2021 21.30.28 (1 match(es))
corrupted-file.dat (783,22 MT)
5 : 28.4.2021 21.30.26 (1 match(es))
corrupted-file.dat (783,22 MT)
4 : 29.4.2021 21.30.26 (1 match(es))
corrupted-file.dat (783,22 MT)
3 : 30.4.2021 21.30.26 (1 match(es))
corrupted-file.dat (783,22 MT)
2 : 1.5.2021 21.30.26 (1 match(es))
corrupted-file.dat (783,22 MT)
1 : 2.5.2021 21.30.26 (1 match(es))
corrupted-file.dat (783,22 MT)
0 : 3.5.2021 21.30.29 (1 match(es))
corrupted-file.dat (783,22 MT)
"

So there is a file in many of the backup versions. Why test then claims that the backup is valid?

Let’s run purge… And the output:
… several similar segments omitted, as unnecessary …
"
Purging 1 file(s) from fileset 3.5.2021 21.30.29
Starting purge operation
Processing filelist volume 1 of 1
Replacing fileset duplicati-20210503T183029Z.dlist.zip.aes with duplicati-20210503T183030Z.dlist.zip.aes which has with 1 fewer file(s) (783,22 MT reduction)
Purging file corrupted-file.dat (783,22 MT)
Writing files to remote storage
Backend event: Put - Started: duplicati-20210503T183030Z.dlist.zip.aes (43,36 KT)
Uploading file (43,36 KT) …
Backend event: Put - Completed: duplicati-20210503T183030Z.dlist.zip.aes (43,36 KT)
Backend event: Delete - Started: duplicati-20210503T183029Z.dlist.zip.aes ()
Deleting file duplicati-20210503T183029Z.dlist.zip.aes …
Backend event: Delete - Completed: duplicati-20210503T183029Z.dlist.zip.aes ()
Database was previously marked as in-progress, checking if it is valid after purging files
Purge completed, and consistency checks completed, marking database as complete
"

Now when it’s good, I’ll rerun test and full restore. Let’s see what happens.

Repair passed, all good.

Test … Now it passed.

Finally full restore of latest version to confirm that restore is now working.

Well, now restore worked and claims that everything is ok. But unsurprisingly the one file is missing from the restored data set which got pruned out earlier. Technically restore worked, and claimed to be ok. But as we well now know, it isn’t.

I’ll keep it for a while, there’s additional detail needed from any of the steps. I’m not of course excluding potential data corruption in storage system or transport. Even if it should be unlikely. Yet, if that’s the case, testing should catch the problem. - This case is repeatable, and it’s not some random cosmic radiation from the sun.

Also I’m curious, if I should actually proceed and repair the actual backup. Now I only repaired the clone of the backup set. Because as mentioned earlier, test, purge or repair doesn’t indicate that there’s anything wrong. So I haven’t naturally touched the data. Yet the fact that single file is missing after restore from the backup is highly disturbing.

I also tried to repeat the problem with very small backup and tampering with the backup files, but that didn’t lead to this kind of situation, where test would indicate db to be good but restore would fail. It would have been so much nicer.

Would you prefer to get the database from the source, or the recreated database? Or both? I can run both. → create report from data source and create report from rebuilt database. Does that sound like a plan?

P.S. Unfortunately I’m a developer too, so nothing is more annoying than randomly occurring rare issues, which are serious but doesn’t ever happen “on my computer” or with the limited test data sets. Only in production and rarely. :smiling_face_with_three_hearts:

The repair process only touched the dlist files. Removing references to the one file missing from restore.

But I still have the data (i / b, dlist files) and before I did rebuild / repair. Any specific tests or options that would be preferred?

Some background to answer the questions:

How did I initially find out this problem?

  1. As nerd, I don’t trust anything. I do run scheduled restores from backup files on all of my backup sets. - With Duplicati this is essential, otherwise you don’t know if the backups are corrupted (!).
  2. I also run tests with the database from the source servers - But with duplicati this seems to be unreliable and not trust worthy method (!).
  3. At times, I’ll find out broken backups in the step 1. - This is one of the cases. Sometimes it’s something which is not as serious, and running repair on the source system fixes the issue (yet that shouldn’t happen either!) - But in this case there’s the disparity I’ve reported above.
  4. Usually when restore fails, I’ll try to run repair and purges on the source server (yet in this case nothing happened)
  5. After taking copy of the data set (Duplicati data files, no database). I did run repair, because I knew it was already broken. It recommended to run the purges, I did that. And after that the restore works (yet the purged files are now missing).

I hope this clarifies how I found out the problem.

:exclamation: This also makes me very very worried about the fact, that many people might have corrupted backups, yet they don’t know about it. Because running test with the database doesn’t reveal the corruption. :exclamation:

If there are any specific questions, I’m happy to answer.

It also seems that the data set I’m talking about got just compacted last night, without problems. But probably the corrupted blocks weren’t touched. Because if I remember right, I’ve seen some times the compaction catching broken data. - I assume that didn’t change anything.

Let’s see if this can be sorted out. - Thank you!

Edit: Just note, that I’ve delivered the reports to @ts678. Set 1 from origin server Set 2 from the environment where I’ve rebuilt the database from data files.

Edit: Some words about the back end being used.

It’s our own hardware, FTPS is the protocol, it runs common FTPS server and storage is NTFS with RAID5 and WD Gold Enterprise drives.There are some scheduled maintenance tasks, like updates and reboots. There are no time windows reserved for those to avoid overlapping with backups.

Yet as said earlier, with correctly working system that shouldn’t make any difference. Any operation can be halted without warning at any time, and after that the system should continue from last known good state. Either doing roll forward or roll backward.

Actual data corruption in this case should be highly unlikely, yet it’s not impossible, as example the system isn’t using ECC memory.

The main point still is that if there’s actual corruption test should catch it. And if there isn’t why rebuild / restore fails?

Just bumping, this issue still persists with the very latest version. Never ever trust any test or backup verification. Only clean restore will tell you if data is restorable or not. Just fixing a few backup sets again manually.

delete database
run repair
run list-broken-files
run pruge-broken-files
run repair
run backup
run compact (optional, but recommended)

Slow painful, and without these steps the backup is broken. This should not happen.

Btw. This is happening a lot, around 10% of backups are broken in this specific way, after 6 months since last test with two daily backup runs, which is really worrying. - Yes, the duplicati jobs are aborted at times as well as the file server aborts some transfers in some specific situations. But with correctly working programs, this shouldn’t be a problem. It actually happens so often, I’ve now scripted this.

But, it’s no help, if brown stuff hits the propeller, and there’s a need to restore data which can’t be restored.

It’s worth of noting, that if the tests are run locally on the server with the database, it doesn’t report any errors. So the real horror is only revealed when there’s actual need to do disaster recovery restore.

Debug work hit an impasse when you seemingly would not allow your database bug report to be shared with anybody but me. There is a small chance that Duplicati staff could help, especially if original author could be persuaded to get involved. See the PMs on this for details. I asked twice, on May 8 and May 13.

To summarize the unusual situation I saw in the recreated database, the Block table normally has a Size (limited by your configured blocksize which defaults to 102400, but not all are that big) and information on which destination volume the block is in (e.g. so that a restore knows which dblock file it must download).

Sorting by size, here’s where the recreated database somehow went into not knowing block source, and saying that block size is negative. Most negative was -21487616. I don’t know if this is a specific unusual use that means something to an expert in the recreate process, so has special meaning, or if something just goofed. If so, did goof begin from bad remote data? All this requires more expertise than I can offer…

image

Are you SamiLehtinen on GitHub? I see that person opened an issue for development, but I don’t see this. Perhaps going in that way will get some people involved. Bumping a topic on a user forum might not help. Even if people get involved, they will probably need to work with you a certain amount. I hope you will help.

Sure you can share the backup debug dump with the core Duplicati team.

If required, I can also provide probably new similar reports, if there’s need to confirm that with the latest version. Because this seems to be recurring thing.

This is something which does happen due to the current setup, where we do not “care” about the backup jobs being aborted by some reasons.

  Deleting file duplicati-20210805T161112Z.dlist.zip.aes (41,79 KT) ...
  Deleting file duplicati-20210812T161113Z.dlist.zip.aes ...
  Deleting file duplicati-20210812T161113Z.dlist.zip.aes ...
  Deleting file duplicati-20210812T161113Z.dlist.zip.aes ...
  Deleting file duplicati-20210812T161113Z.dlist.zip.aes ...
  Deleting file duplicati-20210812T161113Z.dlist.zip.aes ...
Fatal error => The remote server returned an error: (550) File unavailable (e.g., file not found, no access).

This happens as example, if the backup destination server boots during the remote backup(s) running. If the client behaves correctly in transactional way, it shouldn’t be a problem. But if the clients logic is broken, situations like this, make it much more likely that I’ll get integrity issues than it would be otherwise. Something like this happening during backup or compaction, could well trigger the situation which I’m seeing. Of course it’s “unlikely” but if there’s 365 backups being run every 6 months per backup set, and I’ve got plenty of backup sets, changes start to accumulate that something that could happen.

Finally confirming, that’s my github account. I still don’t know what the root cause is. But today I fixed tens of backups, using two different methods depending on backup size. Full reset, delete database and destination data and rerun. And for cases where history is important I just deleted the database run the commands mentioned above. Just now I’m running again full restore batch-job(s) to confirm if the measures actually lead to consistent backup state. - I’ll edit update this post with the results when I’ve get those. I’ll assume it’ll take several hours.

The repair (recreate) / rebackup tasks ended up causing around 10 TB of traffic in and out (20 TB combined) of the backup backend, which is btw latest Filezilla Server, in case it happens to ring any bells by causing problems (?).

– Rebuild / Repair + full restore process confirmation data –
Haha, let’s start with confession, one failure was due to invalid decryption phrase.

Some larger backups were too slow to repair, I had to delete the backups and some smaller sets were repaired successfully.

But in general, I assume I’ll find more (fatally!) broken backups when I do this check next time. I think backup testing should be done more often, because current testing clearly shows, backup data corruption beyond restore seems to be “normal feature”.

I’ll have to continue testing later, this is way too slow and got no time for this.

This might be one of the reasons why I’m seeing this pattern, because the database rebuild probably doesn’t handle extra files very well. (Of course database is being rebuilt it’s kind of hard to tell which files should be excluded and which are necessary).