Backup valid, but still unrestorable?

@drwtsn32 - Ref: Is Duplicati 2 ready for production? - #73 by drwtsn32

I’ll reply here, because the ready for production thread is wrong for this topic. It was just important that it’s important to people know that there are several trust issues. And that it’s extremely important to actually restore test from destiation and without the local blocks or duplicati database at hand.

About the Backend, I upgrade the Filezilla Server version. I updated all Duplicati’s to the latest version and triple checked everything on the backend, including all backups and other massive files with hash-validation, so I know that the backend itself is not corrupting data. We did also mem-test it and full storage system tests, on several levels. Also, all tranports are over TLS so there’s integrity check.

Then to the problem, why I don’t suspect the backend to begin with, but I also checked it.
I saved the smallest backup + source configuration. So I can repeat the problem. I can’t well reproduce the problem how it initially formed, but I can reproduce the restore error + run tests and repair which both end up saying that everything is good.

Unfortunately I’m also business software developer, so I have decades long history of trouble shooting multiple complex business systems including my own code which fail in multiple different agonizing ways. I just today told my colleague that if this would be my own code, I would be really pissed, because I would know it’s broken, but I couldn’t easily and quickly tell exactly what lead to it.

The trail is similar, something ends up being wrong in the destination, it could be left over file or something. Something which messes up the restore logic. I can’t directly say with confidence, if it’s the backup or the restore which is causing the problem.

Let’s recap what I know. I know that the latest version of Duplicati seems to handle the roll forwards much better than the old versions, which well, in some cases systemically failed. But with the latest version I see that if it uploaded file A and it wasn’t finished, it’s deleted or re-uploaded. And also the compaction even if aborted seems to at least in most of cases work as I would transactional system expect to work.

This is not a single event. I run, well, many backup sets daily. And in this case, I just had two failing sets, but both with exactly the same problem. And now it’s important to remember that the test with full remote verification passes. So there are NO MISSING FILES, and ALL THE FILES GOT CORRECT HASH… If we can assume that the test / repair itself isn’t flawed?

I did also reset the failing sets today, and re-backup from source. And both restores were successful after that, without any configuration changes. So that’s not the problem either.

Based on all this, I’m pretty confident that there’s quite simple (!) logical mistake or implementation error somewhere. We’re very very close to the goal.

Yet, as said, my experience is totally generic about file transfer, database, transactions and other failures with related logic. I’m not Duplicati expert. How could we get forward from this point? I’ve collected all forensic evidence about the event.

It would be all so wonderful easy, if the backup set didn’t contain data, which I couldn’t share with you guys. I would just upload the whole packet somewhere. But in this case that’s not possible. But the data is practically meaningless to you and I’ve got no problem with IP addresses, or even credentials, because it’s all on closed system.

I assume you as Duplicati expert know the best, what I should share with you. I really would love to help, this is very frustrating for me as well. If I wouldn’t like Duplicati I would have dropped and swapped to Restic a good while ago. Yet I really did think about that hard again today. But this is hopefully the last logical issue somewhere?

I’ll share you the numbers confidentially, so you’ll understand how many backups I’m running daily, and how rare this issue is. Btw, even the filenames in the backup are totally meaningless, as long as the file contant itself is kept private. So I could share directory listings, whole databasse (the duplicati database) as well as destiation file listing and all the hashes of the files in the destiation folder.

And depending on the database restructuring phase, in theory this all could be caused by something like one extra file being left at target. Sure. Yet I still don’t understand why the repair wouldn’t clear it up. That’s the major question why there are two conflicting truths simultaneously.

How should we should continue? After thinking a while, my assumption is that I could give you everything, except the dblock files, username and password to the destination, full command lines I use for backup & compact, etc without uid/pwd. Yet I would include file hashes of all files. Would that be helpful? Would you need something else?

Update: data packet delivered with logs and reproducible case where tests pass and restore fails, with full debug & profiling logs. Inclunding database & iblocks and dlists.

I think it would be good to create a github issue to track this bug down if you haven’t already. I am curious does it occur if you try to restore the files locally? (move the backup to a machine with duplicati on it and run it on system)

I think the most valuable thing to figure out how this occurred is to reproduce it from square one on a test dataset. Only you know the exact process you went through to end up in your current state, so you will need to do that work and outline the exact setup/configuration and steps to reproduce. It would be good to have the full profiling log, you can regex replace any sensitive filenames from it (if it is a test dataset this shouldn’t be needed). Share it in the github issue.

Seeing as you have ran into this issue multiple times I would say we have a very good chance of identifying the root cause.

There’s actually a PM thread on this. I’ve been thinking maybe you should be on it to help with SQL.
What I’d prefer is if it could be done more in the open, because then other people can learn or help.

This would be a whole lot simpler if it was reproducible with the profiling log and database histories.
By that, I mean a UNIX-style rotation with names containing .1, .2 etc. so as to see early database instead of only seeing the suspicious dindex file that compact built (out of what?), and recreate fail.

My own production backup has even more history, as I rclone sync the destination to local, using
--backup-dir to also preserve the now-deleted-at-destination files that the compact compacted…

Until when/if the available data gets better, I have some things to try to increase my understanding.
I was trying to chop apart some SQL the other day, getting syntax errors, and wishing for your help.

ErrorID: DatabaseIsBrokenConsiderPurge
Recreated database has missing blocks and 3 broken filelists. Consider using “list-broken-files” and “purge-broken-files” to purge broken data from the remote store and the database.

is the result of a big SQL query with UNION which I took apart (with difficulty) several ways, finding various oddities in various places. The Block table wound up with entries with -1 for VolumeID that probably caused some uproar. The number of these exactly fits with the length of a blocklist in that
dindex file list folder (32 bytes per SHA-256 blocklist hash). There’s a second file in same dindex
with exactly the same name (the Base64 of the SHA-256) that looks like a superset of the first file…

which happens to be the last block hash in the blocklist whose hash matches its name. Other doesn’t.
I might have found where VolumeID becomes -1 on DB recreate, but the meaning of Size is elusive…

I fixed the problem by deleting the mysterious longer second copy of the blocklist. That turned out to be the more reasonable-looking shorter blocklist repeated twice. How that could happen would need a trip through C# code and SQL. Because I don’t develop in either, it would be nice if debug were more open.

I show it in 7-Zip because that’s where I did it, because File Explorer is confused by the duplicated files. When it tries to show me the information on the longer file, it shows the size of shorter, so I don’t trust it.

As to why one participant in the PM found that a Direct restore worked OK, only version 3, 4, and 5 reference the affected blocklist in that dlist file. Presumably the GUI restore just picked latest version, 0.

FWIW my “fixedup” folder for this experiment is the decrypted dlist and dindex files, plus empty files for former dblock files. On those, I leave the .aes extension because they’re referenced in dindex that way although it can actually deal with the names as .zip. You just have to endure some messages like this:

Unable to find volume duplicati-b83bdf633ed5a424e929c69e5257d94f3.dblock.zip.aes, but mapping to matching file duplicati-b83bdf633ed5a424e929c69e5257d94f3.dblock.zip

EDIT:

The troublesome blocklist showed up on the Mar 19 backup and continued through Mar 20 and Mar 21, however the first time it actually appears twice in the provided current snapshot is (according to dindex) duplicati-b83bdf633ed5a424e929c69e5257d94f3.dblock.zip.aes. Please check yours to see what’s in it. That file was uploaded from a Mar 22 compact, but earlier databases, dindex, and logs aren’t available. Depending on how the compact code works, maybe old issues can propagate. Needs a developer look.

Well let me finish the testing the DB rebuild performance changes and hopefully you guys can get a more public foundation of current progress on the issue. Maybe a full explanation of the configuration and layout of the systems running into the backup issue. Then we can brainstorm how we can create a reproduceable test case of the problem. I don’t really like PM conversations for this type of stuff because it increases the chance of double work to occur, which isn’t good for open source projects that are short on man power. I understand the C# and the SQL, but my domain knowledge is lacking so we definitely need an open discussion in the github to allow that type of conversation to be easier.

I think it would be good to create a github issue to track this bug down if you haven’t already. I am curious does it occur if you try to restore the files locally? (move the backup to a machine with duplicati on it and run it on system)

Yes, it will. Anyway, currently the backup restore tests are run over SMB and 10 Gbit network. It doesn’t make any difference (confirmed) if I copy files to local disk.

I’ve provided in one private thread a data set which causes database restore to crash. Filenames aren’t that sensitive and those would actually matter. But this is just one of the problems, now I think there are at least three overlapping “show stopper” level quality issues.

@ts678 If tarianjed needs the backup data I provided, it’s ok to forward it.

Also when I said that the database restore build performance is absolute nightmare. It seems that it’s usually related to message “registering missing blocks”. Same kind of backup can restore in 5 minutes versus 5 days, depending if that message is seen. I haven’t 100% proof of that. But now I usually just abort the restore test as soon as I see that message. Also there might be some issue with local-blocks because if I do not disable local-blocks that message seems to make the performance even worse. This is just a feeling again. No guarantees. But both seem to indicate situation where things first go slightly wrong and then the “fix” step is also problematic and leads to very bad end situation.

@ts678 It sounds good, always when issues are located, it helps. And for sure, sometimes deep logic issues are really agonizing to locate. Been there, done that.

I hope the FTP issue thread also helps to fix several overlapping issues, because it clearly points out the problems I’ve strongly suspected earlier.

Oh, ya I was just referring to this work I have been doing:

If you could post more info about that registering missing blocks and maybe track down the exact message and post it there that would be great.

This is explained (not solved though) on the forum, and is easier to see in the GUI or with logging.
Recreate end has 3 phases. On the GUI, from 70% to 100% as three 10% steps on progress bar,
dblock files are being downloaded in search of missing information. The final 10% is exhaustive…

Sometimes exhaustive isn’t enough, and there are still missing blocks. Sometimes long work fixes.
Blocks should never go missing in an ideal world, but that’s not what we have here, so look above.

There are multiple aspects to recreate improvements. One (being worked on) is SQL performance, however another (probably awaiting developer) is whether any improvements can be made to stop exhaustive search when it becomes ridiculously long. Sometimes it may be better to give up on it…

and in my view waiting out the third one (90% and up on progress bar) is sometimes not worthwhile.
The problem is that sometimes it is. I’ve seen a post saying that it checks for success frequently, but looking at the code (to the extent I can), it looks like it doesn’t check for success in pass 3, only after.

Checking occasionally during pass 3 would speed up some successful cases, but not the futile cases.
Futile cases in an ideal world should never happen, but all we can do is knock them down as isolated.
Even doing that requires good isolation, and expert developers with time. I’ve written a lot about both.

1 Like

Although how the blocklist data went bad is still a mystery (would be less so if we had more history), reproducing the final failure turns out to be easy if one manually inserts a bad blocklist into a dindex.

My goal with this was to have a small test case that would be easier to trace (even with a debugger).
The small test is easy to play with in variations, and it’s just dummy data so there’s no confidentiality.

Putting every recreate problem in one catch-all issue seems wrong, and I’d feel weird about opening another issue on basically garbage-in-garbage-out, but I’d do it. Or should we just get into this here?

The recipe summary is set blocksize=1024, backup a 1025 byte file to force a blocklist in list folder, extract the blocklist, double it with a hex editor, put it back, recreate DB, and watch the recreate error.
I also get the -1 VolumeID (which I might see a source of) and the negative Size that still puzzles me.

Do you still have the original data set with its dblock files? I was trying to see what else could be found in previous materials that you shared, and found that the extraneous blocklist referenced in image earlier is the last file in the broken dindex, and maybe this is significant. Can you look in the dblock for that block?

duplicati-b83bdf633ed5a424e929c69e5257d94f3.dblock.zip.aes would be the dblock that is indexed by duplicati-ibd07684ff5734e8f8cd14d6d7f1428a4.dindex.zip.aes that somehow got a bad entry at its end.

You can probably look for the block in the dblock file by name (which is the Base64 of hash, as dindex). Failing to find it that way, you could try by size. In dindex, the excess file is the only file with size 29632.

I’m still not seeing this blocklist in the database, and I don’t have prior databases. I showed how I save databases in a series so I can go back – I also save deleted files with rclone, but that’s harder to set up. Ideally this would be entirely reproducible with exact steps in a debug environment. We’re not so lucky.

I don’t know if this can be solved with available information, but I thought I’d fish for more, and request additional information collection (if practical) in case this happens again. If not, it might get stuck again.

Unfortunately I don’t have the original one anymore. Strange, ok, that’s not exactly what I expected that to be. It’s actually more interesting. Why would index file have a invalid reference, probably some other bug somewhere else, and not directly related to the compaction abort I were blaming so many issues with.

But, I might have already more similar cases. Yet it’s unfortunately likely that those will be a lot larger data sets, and therefore harder to analyze. That one I provided you, was so wonderful because it was really small set.

But this is probably the secondary cause, which is causing some of the backups to be extremely slow to restore, and unreliable. On top of the missing files after compaction.

But currently I don’t know any other problems than these two. Doesn’t sound too bad after all. I just yesterday run 100% restore tests again for all the sets, and these were the two cases. Missing blocks, which then can or can’t be restored and database locked error after compaction abort (clear bug!) on next backup run, which is trivial to remedy with repair.

Ref: FTP delete not atomic / verified / transactional (?) - #31 by Sami_Lehtinen

Edit continued:

Yes, it seems that we’ve already got such case. Registering missing blocks, insane amount of single threaded CPU waiting and then finally error. Yet again, this is data set which does pass backup. I actually can run test right now, I’ll do it.

Error on restore:

ErrorID: DatabaseIsBrokenConsiderPurge
Recreated database has missing blocks and 3 broken filelists. Consider using "list-broken-files" and "purge-broken-files" to purge broken data from the remote store and the database.

Checked:
Backup runs just fine, restore fails, list-broken-files good, purge-broken files good, repair didn’t fix a thing (no errors to fix) and test (I’ll update when done, this is going to take some time). Yes, test passed without errors. - Still totally logically and deadly dangerously broken.

@ts678 ts678, I can save this state temporarily, both, sqlitedb and full backkup storage. But this data set is around 5 GB. But as mentioned I can’t provide dblocks but otherwise the db contains nothing especially sensitive. Path / file / server -names, aren’t that sensitive.

I temporarily save snapshot of this state.

I’ve got again five new backup sets, all pass test, but can’t be restored. - Actually I’ve got nothing to add at this point, except it would be nice to hear that the basic logic has been fixed.

Any interrupted backup nearby? There’s one issue on GitHub with that message, but it’s thought fixed, although in release after Oct 30, 2020, so whatever you’re seeing now is probably still requiring debug.

Interrupting backup can leave hidden damage so Recreate will download all dblocks then fail from missing blocks #4341

How exactly to debug is a good question. A smaller dataset would help. Looking over history since last successful restore might be easier, so we’re again talking logs and possibly some database snapshots, although doing a recreate is going to break the change trail. Still, comparison of old vs. new might help.

Full-size zeroed-out dblock files aren’t very useful (so far) except to see the names. They can’t be read. Possibly using empty files will make the data set smaller, but it’s still not possible to recreate from them.

It would be nice if there was a developer who knew the recreate code. The original one is rarely around, however maybe some of the SQL folks who have begun looking at performance will learn the approach.

1 Like

I just did run some tests, and it’s still systemically broken. Compact destroys data beyond repair. Restore fails, but test lies to the user it’s good. Probably two overlapping extremely simple issues.

Read: A classic example of broken code and basic logic.

It seems that only way to fix the data set, is full delete and recreate. It’s FUBARed in classic computing terms.

This means that there are two overlapping issues:

  1. Compact is broken, because it messes up things
  2. Rebuild is broken, because it’s cant rebuild even if all required data is present
  3. As option 2 two, test is totally broken and lies to user that data is ok, even if it isn’t.

… But it’s pointless to discuss these matters, because it’s not rocket science …

When you say “test” I think of more than an error in normal use. Please describe test, and provide
whatever information might be useful to investigate. Just complaining (even loudly) does not solve.

As I’ve written to you before, test cases that are reproducible from scratch by developers can help,
however as I write here in the forum constantly, progress relies on volunteers. Help where you can.

Rather than post more heat than light, please consider aiding progress. You seek that, I’d assume.

Posting guidelines are rather generic, but give some good advice on general ways to interact here:

This is a Civilized Place for Public Discussion

1 Like

Just cross linking: FTP delete not atomic / verified / transactional (?) - #35 by Sami_Lehtinen

Tested with latest version, repair / test / restore functions still inconsistent.

Test → Ok

Successfully verified 70 remote files
Examined 70 files and found no errors

Repair → Ok
Listing remote folder ...

Restore → Fail
Recreated database has missing blocks and 13 broken filelists. Consider using "list-broken-files" and "purge-broken-files" to purge broken data from the remote store and the database.

This test was completed using the latest version ( v2.0.7.100-2.0.7.100_canary_2023-12-27 ). Yet, the issue was probably created with this specific data set by the older version. But even latest version doesn’t catch or fix or workaround that deadly treacherous problem.

Do you use the new ListAndIndexes option for the full-remote-verification ? it is supposed to allow to do a fuller test than ‘False’ (the default value) while not taking too much time (as with ‘True’) because it skips on data blocks. The idea is to select it even for normal backups, so that problems are caught straight away (when using it for backups, since the ‘all’ code used for test is not available, it’s necessary to use instead backup-test-percentage = 100 - I find this discrepancy irritating but it’s how it is for now)

makes me think that the mentioned Restore is a Direct restore from backup files, e.g. due to disaster.
Being able to test backups well enough to know they can do that is important. Does test do enough?

I don’t know exactly what it does, beyond file hash checks and some compares of blocks to expected.
There are several different test approaches. GetBrokenFilesets looks to be in *-broken-filesets.
Does The LIST-BROKEN-FILES command help any to do what the documentation claims it could do?

Checks the database for missing data that cause files not not be restoreable. Files can become unrestoreable if remote data files are defect or missing.

If you moved your database out of the way before the direct restore test, you’ll need to put it back first. Success here would be a problem detection. There’s a separate issue of how the problem got created.

EDIT:

You could actually try list-broken-files on the recreated database if you like, to see what that gets.
Early detection is of course better. I made a Python script that tries to do this, but it wants unencrypted dlist and dindex files to see if a dlist references any missing blocks (possibly the problem you’re hitting).

If this is a direct restore, which recreates part of the database, this is likely an issue that was found recently about incorrect index files. I got the same message with recreate. There is a fix, but it is not tested sufficiently:

This should prevent this incorrect index generation and also contains additional checks to detect existing damage.

We did not yet see this bug in existing backups that seem fine to the test command. If you want to test the fix, I can make a build which you can use for a direct restore. It also tells you which files are broken.

2 Likes