Ok, I did repeat all the stuff, I’ve done now probably three times. And I think I’ve got a good clue.
Here’s the summary from the end:
After all this testing only one conclusion:
Backup is corrupted and good at the same time. There’s clearly flawed and very incosistent logic somewhere. - Again, this should run anyone responsible for the data integrity absolutely mad. And I personally find this extremely worrying.
This is the reason why I recommend that everyone should run full restore tests on alternate system, because the test feature (?) seems to be very dangerously and misleadingly broken.
Either the database rebuild is somehow broken? And if it isn’t, then the test feature is fatally flawed. Unfortunately I can’t exactly tell where the disparity occurs.
If I test the data set with the original database (before any of these things I’ve done here) the test claims the backup is good.
I’ve also confirmed and checked at the data source where the backups are created, that the file does currently exist (corrupted-file.dat). And if there’s any corruption, it could have been easily refreshed any day to the backup, but that isn’t happening. Yet the file remains broken in the backup set for extended time. Which probably means the corruption is in the blocks which aren’t being changed often.
… And then the nitty gritty details …
I started by creating test environment, where I tried a few “expected” scenarios which would cause this problem. But of course none of my tests triggered the issue. It would have been just so nice to work with datasets in kilobytes instead of tens of gigabytes. - Oh well…
… To be continued … Several hours later …
Log from external test without local db:
"
2021-05-04 11.40.54 +03 - [Verbose-Duplicati.Library.Main.Operation.RecreateDatabaseHandler-ProcessingBlocklistVolumes]: Pass 3 of 3, processing blocklist volume 149 of 149
2021-05-04 11.42.30 +03 - [Information-Duplicati.Library.Main.Operation.RecreateDatabaseHandler-RecreateCompletedCheckingDatabase]: Recreate completed, verifying the database consistency
"
Output:
"
2021-05-04 11:42:37,899: Duplicati error output:
ErrorID: DatabaseIsBrokenConsiderPurge
Recreated database has missing blocks and 12 broken filelists. Consider using “list-broken-files” and “purge-broken-files” to purge broken data from the remote store and the database.
"
But if I run these tasks from the primary server with the database, it says there’s nothing to fix (test and repair both passes quickly without any action).
Next: Ok, I really don’t want to modify the real production backup, because it’s just slightly broken. I’ll create a copy of the backup ste, and let’s run repair with it in separate test environment.
… To be continued … Several hours later …
It progresses as expected to the point quite quickly:
Output:
"
Backend event: Get - Completed: duplicati-ife2631cc922e4e909388950c6c7acf7f.dindex.zip.aes (173,28 KT)
Processing indexlist volume 151 of 151
"
And now it’s taking surprisingly long without any seeming progress. Yet file-system io-statistics reveal that it’s clearly working with sqlite database and writing data to it’s temp-file.
The actions the repair will take, should indicate what the problem with the data set it. Question remains, why the database rebuild failed earlier? With the errors, there’s nothing wrong with the data?
I’ll need to check out what that rebuilt database contains, if I can send it to you guys. Or maybe it’s not even necessary after this analysis.
… To be continued … Several hours later …
Now it says:
"
Pass 3 of 3, processing blocklist volume 1 of 149
"
… To be continued … At least hour later … Still 1 of 149?
It seems that this might take well, more than several hours. Maybe tomorrow …
I reviewed the verbose log, and there’s nothing wrong with the data or the recreate process. Ok. Yet the final verdict is.
Based on earlier error summary, it can be assumed that this is the same case:
"
ErrorID: DatabaseIsBrokenConsiderPurge
Recreated database has missing blocks and 12 broken filelists. Consider using “list-broken-files” and “purge-broken-files” to purge broken data from the remote store and the database.
"
I’ll took full copy of all state data at this point. DB / backup blocks etc. I’ll now run those steps, if I remember correctly this is quite quick operation.
Output from this operation:
"
Toimenpide ListBrokenFiles aloitettiin
11 : 28.3.2021 21.30.26 (1 match(es))
corrupted-file.dat (783,22 MT)
10 : 4.4.2021 21.30.27 (1 match(es))
corrupted-file.dat (783,22 MT)
9 : 11.4.2021 21.30.29 (1 match(es))
corrupted-file.dat (783,22 MT)
8 : 19.4.2021 21.30.29 (1 match(es))
corrupted-file.dat (783,22 MT)
7 : 26.4.2021 21.30.29 (1 match(es))
corrupted-file.dat (783,22 MT)
6 : 27.4.2021 21.30.28 (1 match(es))
corrupted-file.dat (783,22 MT)
5 : 28.4.2021 21.30.26 (1 match(es))
corrupted-file.dat (783,22 MT)
4 : 29.4.2021 21.30.26 (1 match(es))
corrupted-file.dat (783,22 MT)
3 : 30.4.2021 21.30.26 (1 match(es))
corrupted-file.dat (783,22 MT)
2 : 1.5.2021 21.30.26 (1 match(es))
corrupted-file.dat (783,22 MT)
1 : 2.5.2021 21.30.26 (1 match(es))
corrupted-file.dat (783,22 MT)
0 : 3.5.2021 21.30.29 (1 match(es))
corrupted-file.dat (783,22 MT)
"
So there is a file in many of the backup versions. Why test then claims that the backup is valid?
Let’s run purge… And the output:
… several similar segments omitted, as unnecessary …
"
Purging 1 file(s) from fileset 3.5.2021 21.30.29
Starting purge operation
Processing filelist volume 1 of 1
Replacing fileset duplicati-20210503T183029Z.dlist.zip.aes with duplicati-20210503T183030Z.dlist.zip.aes which has with 1 fewer file(s) (783,22 MT reduction)
Purging file corrupted-file.dat (783,22 MT)
Writing files to remote storage
Backend event: Put - Started: duplicati-20210503T183030Z.dlist.zip.aes (43,36 KT)
Uploading file (43,36 KT) …
Backend event: Put - Completed: duplicati-20210503T183030Z.dlist.zip.aes (43,36 KT)
Backend event: Delete - Started: duplicati-20210503T183029Z.dlist.zip.aes ()
Deleting file duplicati-20210503T183029Z.dlist.zip.aes …
Backend event: Delete - Completed: duplicati-20210503T183029Z.dlist.zip.aes ()
Database was previously marked as in-progress, checking if it is valid after purging files
Purge completed, and consistency checks completed, marking database as complete
"
Now when it’s good, I’ll rerun test and full restore. Let’s see what happens.
Repair passed, all good.
Test … Now it passed.
Finally full restore of latest version to confirm that restore is now working.
Well, now restore worked and claims that everything is ok. But unsurprisingly the one file is missing from the restored data set which got pruned out earlier. Technically restore worked, and claimed to be ok. But as we well now know, it isn’t.
I’ll keep it for a while, there’s additional detail needed from any of the steps. I’m not of course excluding potential data corruption in storage system or transport. Even if it should be unlikely. Yet, if that’s the case, testing should catch the problem. - This case is repeatable, and it’s not some random cosmic radiation from the sun.
Also I’m curious, if I should actually proceed and repair the actual backup. Now I only repaired the clone of the backup set. Because as mentioned earlier, test, purge or repair doesn’t indicate that there’s anything wrong. So I haven’t naturally touched the data. Yet the fact that single file is missing after restore from the backup is highly disturbing.
I also tried to repeat the problem with very small backup and tampering with the backup files, but that didn’t lead to this kind of situation, where test would indicate db to be good but restore would fail. It would have been so much nicer.
Would you prefer to get the database from the source, or the recreated database? Or both? I can run both. → create report from data source and create report from rebuilt database. Does that sound like a plan?
P.S. Unfortunately I’m a developer too, so nothing is more annoying than randomly occurring rare issues, which are serious but doesn’t ever happen “on my computer” or with the limited test data sets. Only in production and rarely.
The repair process only touched the dlist files. Removing references to the one file missing from restore.
But I still have the data (i / b, dlist files) and before I did rebuild / repair. Any specific tests or options that would be preferred?
Some background to answer the questions:
How did I initially find out this problem?
- As nerd, I don’t trust anything. I do run scheduled restores from backup files on all of my backup sets. - With Duplicati this is essential, otherwise you don’t know if the backups are corrupted (!).
- I also run tests with the database from the source servers - But with duplicati this seems to be unreliable and not trust worthy method (!).
- At times, I’ll find out broken backups in the step 1. - This is one of the cases. Sometimes it’s something which is not as serious, and running repair on the source system fixes the issue (yet that shouldn’t happen either!) - But in this case there’s the disparity I’ve reported above.
- Usually when restore fails, I’ll try to run repair and purges on the source server (yet in this case nothing happened)
- After taking copy of the data set (Duplicati data files, no database). I did run repair, because I knew it was already broken. It recommended to run the purges, I did that. And after that the restore works (yet the purged files are now missing).
I hope this clarifies how I found out the problem.
This also makes me very very worried about the fact, that many people might have corrupted backups, yet they don’t know about it. Because running test with the database doesn’t reveal the corruption.
If there are any specific questions, I’m happy to answer.
It also seems that the data set I’m talking about got just compacted last night, without problems. But probably the corrupted blocks weren’t touched. Because if I remember right, I’ve seen some times the compaction catching broken data. - I assume that didn’t change anything.
Let’s see if this can be sorted out. - Thank you!
Edit: Just note, that I’ve delivered the reports to @ts678. Set 1 from origin server Set 2 from the environment where I’ve rebuilt the database from data files.
Edit: Some words about the back end being used.
It’s our own hardware, FTPS is the protocol, it runs common FTPS server and storage is NTFS with RAID5 and WD Gold Enterprise drives.There are some scheduled maintenance tasks, like updates and reboots. There are no time windows reserved for those to avoid overlapping with backups.
Yet as said earlier, with correctly working system that shouldn’t make any difference. Any operation can be halted without warning at any time, and after that the system should continue from last known good state. Either doing roll forward or roll backward.
Actual data corruption in this case should be highly unlikely, yet it’s not impossible, as example the system isn’t using ECC memory.
The main point still is that if there’s actual corruption test should catch it. And if there isn’t why rebuild / restore fails?