FTP delete not atomic / verified / transactional (?)

I think I’ve been asking what problems that causes. Some are known. Some are in process of fix.

had been aging quite awhile. Credit to @Jojo-1000 for jumping in. Still needs review, release, etc.

Read Error during compact forgot a dindex file deletion, getting Missing file error next run. #4129.

which basically follows

Its planned solution has the dindex files be marked as Deleting (just like its dblock) for later cleanup.
It’s not exactly a DB transaction issue, but a transaction issue because it needs destination rollback.
Getting state information fully (not half) stored in advance lets destination get consistent after crash.

were some of my statistics. Timeout means a kill. A missing is probably the dindex issue from above.
In CLI output, it’s looking for ErrorID: MissingRemoteFiles. An extra is a similar search in output.
After interrupted backup, next backup failed with remote files that are not recorded in local storage #4485 is possibly that one, but although there are steps to reproduce, it’s not clear if a compact is required…

Because you’re a developer, would you have any interest in building the compact PR for your testing?
You do seem to have a good environment to test hard kills more meaningfully than my little test script.

You’ve probably also seen a few of these. Any characterization of what tends to lead to this big pain?
Maybe you have logs? I’m still looking for these, even doing a database recreate after every backup.
Because that’s rather slow, I also wrote a Python script to look over dlist and dindex files, to forecast.
But just like “a watched pot never boils”, the problem seems not to happen with two tools watching it.
If I gave you the script, would you have a slow-to-recreate destination that it could be tested against?

Not in this big thread, for sure. If you have a solid issue, best thing is to open a GitHub issue with steps.

The occasional visits are wonderful. Available time still appears short. It’d be great to see Kenneth more. Meanwhile, interim maintainer @gpatel-fr is doing lots of things and @Jojo-1000 is having a great start. There is still an abundant backlog of issues and pull requests to chip away at. Volunteers are still sought.

I wonder if that would be more clear if we added an extra stage for this to the progress bar, especially because it always takes so long.

Regarding the StopNow issue, I have updated all backends on the .NET 5 cancellation token branch to use a cancellation token. The operations should be able to be cancelled cleanly and quickly (except some backends use libraries that don’t allow cancellation). It still needs to be integrated to the task abort logic, though. I think that just does a thread kill right now.

Obviously such a big change will require a lot of testing for every backend, so that will still take a while until it is ready for release.

and trying to mix in other issues just makes it worse… But here we are – until it moves elsewhere.
Database recreate desperately needs improvement #4041 does offer some other options, such as

at least I would like to see a warning message if there are more than perhaps 10,000 metadata records to the effect that “Hey, this is going to be really, really slow and might not work; you might rather throw away all your savesets and start over instead.”

The specifics might not match, but the ideas are reasonable if they can somehow be wedged into UI.
Wedging into GUI might be more possible than having CLI (maybe in a script) stop to query the user.

The “throw away all” is probably improvable. There are probably specific blocks being looked for, and impact of missing blocks can likely be found by SQL, maybe similar to affected or *-broken-files.

In the 90% to 100% range, it’s tough to watch it move pixel by pixel. Need to use live log to see activity.
My preference would be to figure out a quick way to forecast such a problem. Don’t know if it’s possible.
I haven’t read through the cited issue and all the others to see if anyone has given us clues. And Sami?

My current solution to that recovering blocks, is to do what I’ve said in many other threads. Delete the backups sets and start fresh, assuming that the source data is still available. Attempting anything else seems to be mostly unreliable and consumes lot of time and effort.

Of course the only optimal solution would be that the situation that blocks are lost wouldn’t happen in the very first place. There’s no guarantee that the lost data can be recovered.

But it would be important that users recognize the difference. If and when they say that restore / rebuild db is slow. Because the difference is huge.

Here are the errors from backup restore tests.

ErrorID: DatabaseIsBrokenConsiderPurge
Recreated database has missing blocks and 11 broken filelists. Consider using "list-broken-files" and "purge-broken-files" to purge broken data from the remote store and the database.

ErrorID: DatabaseIsBrokenConsiderPurge
Recreated database has missing blocks and 11 broken filelists. Consider using "list-broken-files" and "purge-broken-files" to purge broken data from the remote store and the database.

ErrorID: DatabaseIsBrokenConsiderPurge
Recreated database has missing blocks and 3 broken filelists. Consider using "list-broken-files" and "purge-broken-files" to purge broken data from the remote store and the database.

ErrorID: DatabaseIsBrokenConsiderPurge
Recreated database has missing blocks and 11 broken filelists. Consider using "list-broken-files" and "purge-broken-files" to purge broken data from the remote store and the database.

This is actually quite interesting. These are all four different backups sets, but errors are strangely very very similar. Yes, I’ve made it sure, even if it’s suppressed, these errors are collected from totally independent backup sets. I did immediately doubt it myself as well.

I’ll rerun the tests 28 hours later. In case there were something going on and the error was temporary. - Let’s see if the errors magically disappear, yet I wouldn’t be so sure about that.

The percentage of corrupted backups has been going down. That missing file issue that was already referenced in this thread, sounded good and when it’s in binary release, then it’ll probably fix the most common reason why taking backups fail, until repair is run.

Let’s say that the number of failed restores was still around 5% (rounded), which isn’t great. It should be like max one failure per year when it’s “working”. Currently I test backups twice a year, and fix all the broken backup sets (by deleting and recreating).

Edit 1:

I did manually check all of the corrupted backups and b and i file counts at least match. As said, there’s very clearly some systemic error causing this issue. It couldn’t otherwise be just the same issue and similar kind of situation on four different backup sets.

Edit 2:

I also did run list-broken-files purge-broken-files and repair, nothing got changed. There’s nothing wrong (as I’ve said earlier) - Yet the backup is broken and can’t be restored.

I also did run these tasks twice, with the previous and latest version.

Is this database or files? If you have a preserved damaged database, that could be helpful.
We can either log Duplicati’s failed recreate better, or let my Python script look for troubles.

This is what I want to be able to forecast. If current tools can’t predict it, it needs better tools.
What to do to solve it is a different issue, but the surprise failures are bad to find at DR time.

People should of course be testing DR and DB recreate sometimes, but very few will do that.
Additionally, things could break between tests, so it’d be nice to understand this corruption…

My general assumption from long recreates and ones that run long and still come up empty is
something is missing. There are others here who know the exact SQL check better, but I took
kind of a brute-force check of reverse-engineering the file formats to check for missing blocks.

Database and destination internals has some output. I’ve decided duplicates are likely normal.

Here’s its ending, where it tries to find missing information of different sorts. Can you use this?

print(len(dindex_block_set), "blocks seen in dindex files")
print(len(dindex_blocklist_set), "blocklists seen in dindex files")
if missing := dindex_blocklist_set - dindex_block_set:
    print("Missing", missing)
print(len(dlist_blocklist_set), "blocks used by dlist files blocklists")
if missing := dlist_blocklist_set - dindex_blocklist_set:
    print("Missing", missing)
print(len(dindex_blocklist_block_set), "blocks used by dlist large files data")
if missing := dindex_blocklist_block_set - dindex_block_set:
    print("Missing", missing)
print(len(dlist_block_set), "blocks used by dlist small files data")
if missing := dlist_block_set - dindex_block_set - {"47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU="}:
    print("Missing", missing)
print(len(dlist_meta_block_set), "blocks used by dlist files metadata")
if missing := dlist_meta_block_set - dindex_block_set:
    print("Missing", missing)
print(len(dindex_block_set - dlist_block_set - dlist_blocklist_set - dindex_blocklist_block_set - dlist_meta_block_set), "blocks unused")
print(len(dlist_blockset_set), "large blocksets in dlist files")
print(len(dlist_block_set), "small blocksets in dlist files")
print("small file blocksets that are also metadata blocksets:", dlist_block_set.intersection(dlist_meta_block_set))
print("small file blocksets that are also blocklists:", dlist_block_set.intersection(dindex_blocklist_set))

Operations log:
Retrieve original databases from source systems to test environment.
Copy full backups from destination repositories to test environment.
Backup set identifiers used in this document: We got four broken backup sets called: CW, JP, TU, WS
Let’s start with the smallest backup set for speed: TU

Modifying the CMD decrypt script (snippet) for my environment. It’s obvious that the cmd script wasn’t complete. No problem.

Run the python script to analyze the back end data.

Decrypted dlists and iblocks & python script:

Duplicate blocklist WbSrRA5LRNDoGge2nAF7xPD34hbnQ8vwkxm3bbzlbdE=
Duplicate blocklist WbSrRA5LRNDoGge2nAF7xPD34hbnQ8vwkxm3bbzlbdE=
Duplicate blocklist WbSrRA5LRNDoGge2nAF7xPD34hbnQ8vwkxm3bbzlbdE=
Duplicate blocklist WbSrRA5LRNDoGge2nAF7xPD34hbnQ8vwkxm3bbzlbdE=
Duplicate blocklist WbSrRA5LRNDoGge2nAF7xPD34hbnQ8vwkxm3bbzlbdE=
Duplicate blocklist WbSrRA5LRNDoGge2nAF7xPD34hbnQ8vwkxm3bbzlbdE=
Duplicate blocklist WbSrRA5LRNDoGge2nAF7xPD34hbnQ8vwkxm3bbzlbdE=
Duplicate blocklist WbSrRA5LRNDoGge2nAF7xPD34hbnQ8vwkxm3bbzlbdE=
Duplicate blocklist WbSrRA5LRNDoGge2nAF7xPD34hbnQ8vwkxm3bbzlbdE=
35969 blocks seen in dindex files
1666 blocklists seen in dindex files
1666 blocks used by dlist files blocklists
19796 blocks used by dlist large files data
2285 blocks used by dlist small files data
4753 blocks used by dlist files metadata
7469 blocks unused
1664 large blocksets in dlist files
2285 small blocksets in dlist files
small file blocksets that are also metadata blocksets: set()
small file blocksets that are also blocklists: set()

And then the classic steps, which I’ve repeated countless times with similar results, which tells that … It’s just as bad as expected. Test say’s ok, repair / rebuilt / restore (without db) says it’s broken, restore fails.

Summary (yet full logs available privately, if necessary).

Test:
Examined 31 files and found no errors

Rebuild / repair:

ErrorID: DatabaseIsBrokenConsiderPurge
Recreated database has missing blocks and 11 broken filelists. Consider using "list-broken-files" and "purge-broken-files" to purge broken data from the remote store and the database.

Restore:
Failed to restore file: "filename". File hash is NCt6A6KvhOJty4/pro6+BqK6wTRVR32xxiR6nO2IUI0=, expected hash is jhc6T6QrE48Uk3Z/B1BQK6Ayza83vSX6ll7e1/0j3DI= => Failed to restore file: "filename". File hash is NCt6A6KvhOJty4/pro6+BqK6wTRVR32xxiR6nO2IUI0=, expected hash is jhc6T6QrE48Uk3Z/B1BQK6Ayza83vSX6ll7e1/0j3DI=

This made me think and I’ve gone one extra question, when and how is the file hash determined for the source file? Actually this could potentially be one source, if it’s not done by using a “single pass read”, which is of course the only viable option.

During checking the destination backup folders I also found some

Because the errors with other datasets where practically similar, I’ll skip doing the same steps with those. But I’ll keep this debug data frozen until the end of month stored statically for all sets.

I’m going to reset all of the backup sets and start fresh, as usual. To repair the broken ones outside this kept state.

Feel free to reference to this post in the rebuild & other discussions.

HTH

The python script in your output does not find any missing blocks. The database verify thinks some blocks are missing. So which analysis is correct?

Are you using test with full remote verification? If yes then you probably found a case where the test command logic is too simple, or maybe it is not supposed to catch these kinds of errors. It’s possible that it only checks that the block files contain what they say, not that all blocks are there somewhere.

These errors are probably due to the missing blocks. I am not sure, but I think the direct restore does not do a full database integrity check (that causes repair to fail), so it does not notice data is missing. When the files are restored, the missing blocks are probably left empty and the hash does not match.

The questions are:

  • are there missing blocks?
  • if so, why are blocks missing in the first place? Especially only a few, so it’s not due to a failed upload. Might be another compact bug
  • why does list/purge broken files not detect these and remove them?

In this case, did you do a database repair or database recreate? Why was repair working, but in your later tests it returned an error?

I wish I would know. But I assume that there’s some data missing, because the restore fails. Exactly this disparity is the issue that has been driving me mad.

TBH, normally when I run tests, it’s with always with full remote verification. But this time I used modified version of the CLI / CMD script provided by ts678 and in this case the full verification option was missing. But because you asked it, I did rerun the test with full-remote-verification option, and let’s see .

Same result, using full-remote-verification does no difference whatsoever.
Examined 31 files and found no errors.

Agreed, sounds very much plausible.

My primary suspect as said. Aborted compact seems to be the situation where things get broken. First compaction takes a very long time, and then it gets stopped by something like windows update / system reboot. At one stage (a long time ago), there was some missing file error, which was interesting. That the database got only broken after compact. So database was ok, with missing files, repair run, it was still ok. But after compact it was broken (restore failed), test still said ok. - I’m not 100% sure, I don’t have facts anymore. But this is very strong intuition based on previous tests. When I were hunting the sources for this “covert corruption”.

It does, if I run this with repaired database. I tried fixing the broken backups at one point by, rebuilding database, running list/purge/repair from the rebuilt database. And then rerunning repair from the “source”. - This was just my bad attempt to recover the database sets as much as possible, without full reset and rebuild. But afaik, it didn’t work out. Even if as plan it sounds plausible. First detect errors, delete broken data. And then from source detect the made changes and then re-upload missing blocks, if possible. Of course this won’t fix “historical” corruption, but it would make the latest backup set sold with fresh data. → Nope, that didn’t work out. Even if everything is working as expected, it should make some sense. Right?

I always use repair command to rebuild the database. Is that wrong? Afaik, if database is missing, recreate is called by repair, restore whatever when the database isn’t available. Did I get something wrong with that? As far as I know, it’s the same thing. If database is missing, it’s first rebuilt, before running repair, restore, etc.

  • Thanks

If there’s any specific tests you want me to run, just let me know. I’ve stored frozen snapshots of all of the backups (4 different similar cases), even if the sets in prod replica have been deleted and restarted.

Yes, detailed by The REPAIR command

Tries to repair the backup. If no local database is found or the database is empty, the database is re-created with data from the storage. If the database is in place but the remote storage is corrupt, the remote storage gets repaired with local data (if available).

so it has two distinct functions. If you say you ran repair, should I take that as the no-database function? Maybe it’d be best not to even mention repair unless you’re doing that try-to-align-things case as well…

I think you hinted somewhere recently, but to ask again – are you finding these in disaster recovery test?
checker12.zip (1.3 KB)
What my script (which I’ll just just post even though you had to change for your environment) is looking for is blocks in aggregate, and not much in specific other things that could be going wrong, yet it’s still disappointing it saw nothing missing. I was manually making missing blocks yesterday, and did find one case it found that list-broken-files did not, and that was a metadata block of a folder. Testing a file, both methods found a missing block in a file’s data or metadata. This all followed expected recreate fail.
As expected, simply breaking the dindex didn’t make that, as it read the dblock instead. To break dblock is more complex because it changes the dblock size and hash, so those need to go back into dindex…

There’s more potential fault insertion testing, e.g. I didn’t try creating a fault in the list folder which is a convenience copy of a blocklist that’s also in the dblock, but one doesn’t want to read dblock in recreate.

So is test sequence running recreate to see if it can (a really good test, IMO), finds it can’t, then all tools (e.g. list-broken-files and my Python script) also can’t explain it? If so, that’s a pretty bad situation.

It sounds like you were trying pretty hard afterwards. The only good news is it seems a pretty solid fail…

Actual restores have more tricks in them, so I want to make sure I know what’s being run, and the order.
Are you doing a direct restore or a recreate, i.e. run a repair without a local database? That’s important.

The source code claims that the broken database is still allowable for restore, so is it a recreate then an ordinary restore from the somewhat broken database? That should be allowed (but might have issues), and it separates actions. Best DR restore test uses no-local-blocks to ensure it’s only using destination:

Duplicati will attempt to use data from source files to minimize the amount of downloaded data. Use this option to skip this optimization and only use remote data.

even if test system has no access to source (I don’t know), it would waste some time trying to get there.

Restore also has extensive logging that you can look at that possibly would shed light on a block failure.
We need as much data as possible, and preferably from logs because it’s designed and is more private.
Verbose level is as high as you need to go I think, and you can sanitize any filenames as seems proper.

So I guess that’s more than the default test, which is 1 set, typically 3 files. Perhaps this was using all?

The file can have a different hash algorithm than its blocks, but its hash is computed one block at a time.
A restore scatters blocks from dblock files as needed. If a source file block is missing, file hash notices it.

This makes more sense, because recreate uses list-broken-files internally to check for missing blocks (which is the exact error you are getting). If I understand correctly your attempt to fix was:

  • Copy/move original backup database to another location
  • Recreate database, this gives an error
  • list-broken-files to get a list of broken files
  • purge-broken-files to get rid of broken files
  • Repair the new database (this should not be necessary, but also should not hurt)
  • Move original backup database back
  • Repair original backup database

Is this what you did? If yes, I see why it would not work. Repair (with an existing database) does not apply remote changes to the local database, but instead will try to restore the remote to the state in the local database. You would need to continue the backup with the new, recreated database instead.


This is probably an edge case of excluding empty files in the (huge and incomprehensible) query, maybe it will also work with a metadata block of an empty file? In any case, if you have good reproduction steps you should open a github issue.

It also means that the missing block detection in your script is working. There has to be something strange in the dataset, that list-broken-files does not find blocks but your script does. Or your script does not notice some blocks are missing, but the approach looks fine to me.

@Sami_Lehtinen it would maybe help to see if the blocks are really missing if you use the RecoveryTool to restore the files that failed in your test:

  • You already have the decrypted dlist and dindex, you also need to decrypt dblock files,
    or use Duplicati.CommandLine.RecoveryTool.exe download to decrypt the files from the source directory
  • Duplicati.CommandLine.RecoveryTool.exe index <localfolder> to create a txt file with all block hashes
  • Duplicati.CommandLine.RecoveryTool.exe restore <localfolder> [version] --targetpath=<restorefolder>

This uses a different restore approach than the normal restore handler, so maybe it does not have the same bugs. If the blocks are really missing, it should fail with the same files.

This has to be the case, otherwise the file could change between recording the blocks and calculating the hash. Then you have blocks that will restore a correct file with a wrong hash. For the hash function it does not matter how large the blocks are, so this will not result in an invalid hash.

I’m glad I’m not the only one who finds it hard to read. Admittedly it’s built in pieces. Formatted total is:

SELECT DISTINCT "B"."Timestamp",
 	"A"."FilesetID",
 	COUNT("A"."FileID") AS "FileCount"
FROM "FilesetEntry" A,
 	"Fileset" B
WHERE "A"."FilesetID" = "B"."ID"
 	AND "A"."FileID" IN (
 	 	SELECT DISTINCT "ID"
 	 	FROM (
 	 	 	SELECT "ID" AS "ID",
 	 	 	 	"BlocksetID" AS "BlocksetID"
 	 	 	FROM "FileLookup"
 	 	 	WHERE "BlocksetID" != - 100
 	 	 	 	AND "BlocksetID" != - 200
 	 	 	
 	 	 	UNION
 	 	 	
 	 	 	SELECT "A"."ID" AS "ID",
 	 	 	 	"B"."BlocksetID" AS "BlocksetID"
 	 	 	FROM "FileLookup" A
 	 	 	LEFT JOIN "Metadataset" B ON "A"."MetadataID" = "B"."ID"
 	 	 	)
 	 	WHERE "BlocksetID" IS NULL
 	 	 	OR "BlocksetID" IN (
 	 	 	 	SELECT DISTINCT "BlocksetID"
 	 	 	 	FROM (
 	 	 	 	 	SELECT "BlocksetID"
 	 	 	 	 	FROM "BlocksetEntry"
 	 	 	 	 	WHERE "BlockID" NOT IN (
 	 	 	 	 	 	 	SELECT "ID"
 	 	 	 	 	 	 	FROM "Block"
 	 	 	 	 	 	 	WHERE "VolumeID" IN (
 	 	 	 	 	 	 	 	 	SELECT "ID"
 	 	 	 	 	 	 	 	 	FROM "RemoteVolume"
 	 	 	 	 	 	 	 	 	WHERE "Type" = "Blocks"
 	 	 	 	 	 	 	 	 	)
 	 	 	 	 	 	 	)
 	 	 	 	 	
 	 	 	 	 	UNION
 	 	 	 	 	
 	 	 	 	 	SELECT "BlocksetID"
 	 	 	 	 	FROM "BlocklistHash"
 	 	 	 	 	WHERE "Hash" NOT IN (
 	 	 	 	 	 	 	SELECT "Hash"
 	 	 	 	 	 	 	FROM "Block"
 	 	 	 	 	 	 	WHERE "VolumeID" IN (
 	 	 	 	 	 	 	 	 	SELECT "ID"
 	 	 	 	 	 	 	 	 	FROM "RemoteVolume"
 	 	 	 	 	 	 	 	 	WHERE "Type" = "Blocks"
 	 	 	 	 	 	 	 	 	)
 	 	 	 	 	 	 	)
 	 	 	 	 	
 	 	 	 	 	UNION
 	 	 	 	 	
 	 	 	 	 	SELECT "A"."ID" AS "BlocksetID"
 	 	 	 	 	FROM "Blockset" A
 	 	 	 	 	LEFT JOIN "BlocksetEntry" B ON "A"."ID" = "B"."BlocksetID"
 	 	 	 	 	WHERE "A"."Length" > 0
 	 	 	 	 	 	AND "B"."BlocksetID" IS NULL
 	 	 	 	 	)
 	 	 	 	WHERE "BlocksetID" NOT IN (
 	 	 	 	 	 	SELECT "ID"
 	 	 	 	 	 	FROM "Blockset"
 	 	 	 	 	 	WHERE "Length" == 0
 	 	 	 	 	 	)
 	 	 	 	)
 	 	)
GROUP BY "A"."FilesetID"

and I verified that both the recreate and the list-broken-files do it. That’d take me awhile to unravel.

Certainly possible, but complaining about deliberately introduced issues does feel a little bit strange.
Still, there’s no knowing whether or not this error could occur naturally. Those are the best targets…

This is a good second (or maybe third, here) opinion. I think it’s pretty vocal about missing blocks.
The commandline help says that filters work, so I guess a small restore is possible if it’s simpler…

Yes, it’s basing everything (block hash, file hash, data backup) on the same reads to be consistent.

EDIT:

RecoveryTool gets its increased robustness by ignoring dindex files and going directly to the dblock.
The regular DB recreate starts with dindex and goes to dblock if needed. This test should add clues.

Doing any kind of other than disaster recovery tests is pointless, because those give you deceptive results. - I think I’ve said this in prettier and less pretty form. - It could be also called covert sabotage, where things are told to be ok, even if things in reality actually are mortally screwed up.

Thanks, this saved my day. Literally lol, good one. Even if it’s true. Having solid fail, means that there’s also solid bug in code somewhere.

Always direct restore, any other method as said, is totally pointless and highly deceptive and outright dangerous thing to do.

I’ve got no local data that could be used on restore testing environment, and of course I’m always using that parameter. The restore tests paths are always clean.

Of course it doesn’t. It’s also important from security point of view. Systems should be enough isolated, that in case of compromise it’s not possible to sabotage different data locations.

I think I’ve done this already in past, but let’s redo this

Of course. That’s to figure out the true extent of the problem. And I say it’s pretty darn bad. All of the backup versions have the same corruption. Even if the original data source would be available, and the error could be fixed at any time, by replacing the corrupted block with non-corrupted one from the source. → Again, very treacherous situation, which should be totally non-tolerable and total show stopper for normal operations. - It’s just accident waiting to happen.

Edit, quote tag fixed

Yes, this was just an attempt to see, if I could fix the broken databases by using something else that the production systems to fix the corruption. But this issue again, shows that there are at least three overlapping problems.

  1. Data is first corrupted. (This is the main reason for whole mess)
  2. The problem isn’t originally detected by the source system.
  3. Then when the data storage is modified (corrupted data is deleted), the primary processes doesn’t detect and correctly catch up.

I did this just so that it would be obvious that the corrupt data is corrupt and it would be quickly replaced from solid source. But nope. - Naturally, this shouldn’t be necessary, it’s also inefficient, but technically if all logic is solid, it still should work. Yet I do understand if it’s not completely automatically working due to data safety reasons. - This is something that the operator should be notified about, because it could potentially indicate serious issues.

Yeah, why not. Let’s see…

Decrypt - Ok
Index - Ok
Restore … Hmm, most interesting. All files restored successfully or with “done”, which is based on what?

No errors, but does that prove something. Trust is so low against different processes and error handling. I’ll now hash compare the files between restore (with Duplicati) and restore with recovery-tool, let’s see what the results are. I’ll do this hash check first to see if there’s only one mismatching file.

Well, this result is absolutely most interesting one I’ve seen for a while. The restore reported that file let’s say ID#3 is corrupted. But based on hash test ID#3 hashes match, between duplicati restore and recovery tool restore. What’s more interesting, is that file ID#2 hashes do not match. - WTF. It’s cartoon time.

I could also technically now restore to copy of the already restored path, and allow local blocks. I’ll do that too. Let’s see what happens.

It still claims it’s broken. - Madness

Only thing I’m sure about is, that this is extremely screwed up. And it’s quite guaranteed to say, that data integrity with Duplicati restores is nightmarishly low.

HeisenBackup would be more descriptive name. Backup is assumed to be good, unless you’ll try to restore it and find out it’s not.

There will be error messages Failed to read hash or Failed to read blocklist hash if data blocks are missing. It seems like the recovery tool does not apply metadata at all.
It only outputs done if the full file hash matches the hash in the dlist file. So based on that, I would say all of the backup data is there.

This is very strange. In both tools there is a hash verification check after they are finished. The only difference is that normal restores uses the database to get the target hash, and recovery restore uses the dlist file directly. Since you have decrypted the dlist files, you can look for the hash in them: Open dlist.zip of the version you restored (probably latest), get filelist.json and search for the file. size and hash should match the restored file. Both of these are read directly from the file on disk originally, not calculated from individual blocks.

So ID#2 is restored successfully by both tools, but has different data? I can’t imagine how that would happen, unless there is a hash collision in the file hash or there is a bug in database recreate that puts wrong file hashes in the database, and your case magically hits exactly this hash.

Local blocks are taken from the database, with their full paths, so if you restore on a different system than the original this will not work.

1 Like

Ok, good. Because it doesn’t clearly state this, I were suspicious. I’ll manually check the sizes and hashes just to be sure.

But this seems to strictly point to direction which indicates that the db-rebuild process before restore itself is broken, and actual data blocks aren’t missing or corrupted. Which is very good news! At least on some level. Of course this will be extremely frustrating especially for non -techies whom might encounter this situation by surprise in very desperate situation while trying to restore backups in rush. Then starting to jump through all the gimmicks to restore the data might be harrowing experience.

:white_check_mark: Verified, Powershell Get-FileHash hashes match the hashes inside original and latest dlist. After RecoveryTool based restore. For the files.

But technically I’m now quite happy now. And how it should be clear where the problem roughly is. As stated the other “corrupted backups” are very likely to be similarly corrupted and restorable using recoverytool. I’ll try another set just to verify that this works.

Edit:
Second backup set test results:
Duplicati restore fails, as expected.
Recovery tool well, this is again interesting I got some errors. I’ll get some log snippets. So the problem scenario isn’t that consistent after all (?).

Wtf. We got different errors… Let’s grep the log and see details…

Line  561: 554: d:\duplicati-debug\Debug-Set2\restore\path_a\file (0 bytes) error: System.Exception: Unable to locate block with hash: 47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=
Line  650: 641: d:\duplicati-debug\Debug-Set2\restore\path_b\file (0 bytes) error: System.Exception: Unable to locate block with hash: 47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=
Line  655: 644: d:\duplicati-debug\Debug-Set2\restore\path_c\file (0 bytes) error: System.Exception: Unable to locate block with hash: 47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU=

After reading and actually looking at the error messages, I were like… REALLY!!! Could this be true.

0 bytes? Same hash…

Let’s see if that holds.

>>> base64.b64encode(hashlib.sha256(b'').digest())
b'47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU='

Hahahah, yeah it does.

Yeah, this might mislead tech n00bs quite badly. :wink: I would say, expert trolling tier stuff. - GG

We can conclude that the second set restored correctly (with errors, haha) using the recovery tool.

Oh, I were hoping it would also use the files existing in restore-to path, and only patch where necessary. → In this case having blocks already in that location would help.

if missing := dlist_block_set - dindex_block_set - {"47DEQpj8HBSa+/TImW+5JCeuQeRkm5NMpJWZG3hSuFU="}:
    print("Missing", missing)

is missing block checker12.py forgiving the dindex for not actually storing empty block after a redesign attempting to avoid problems that such practice created. Some history is around, including open issue

0-byte files are not restored with RecoveryTool #4884

Despite the mayhem found, there are some interesting new findings that might help find where bug lies.

Is this a file restored two different ways, with RecoveryTool silent and hash error from restore?
Size correct? You could certainly run binary compare (e.g. with cmp or Windows fc /b) for clues.

EDIT:

The destination file is always built block by block (for whatever blocksize the backup is using), so
compare errors likely start and end at block boundaries. First block is offset 0. Last may be short.
Assuming the boundaries end that way, the question is what is where the correct data should be?

Independent restore program
Using recovery tool with missing dlist files
both have Python scripts that could be used (or changed) if other opinions or more analysis helps.

1 Like

If this fixes the (very simple!) issue for good. It would be amazing. Just wondering when binary including this fix will be available? - Thank you

After this fix is available, I think the most common reason for data corruption would finally be remedied. I can provide feedback as soon as the binary is out. Update few instances, and then report back if the weekly automated restore test brings up any corruption issues.

At the moment this fix is in main Duplicati trunk. All that remains is to actually cut a release.

I want mostly for this new release to finally include a fix for the ‘Duplicati is taking ages for rebuild my database’ that is bugging Duplicati for years. I have it as a PR but I need still some time to review it and I’m constantly procrastinating by replying to posts on this forum, reviewing some of the many stranded PR fixes for other bugs and enhancements (interesting stuff for sure but less crushing that the database rebuild problem one IMO) and well the life generally.
But I’m decided to be curt on the reviewed PR to cut a new release, maybe this month if the project owner is available to cater for it.

1 Like