Backup valid, but still unrestorable?

ts678 · August 12, 2021, 2:04pm

Debug work hit an impasse when you seemingly would not allow your database bug report to be shared with anybody but me. There is a small chance that Duplicati staff could help, especially if original author could be persuaded to get involved. See the PMs on this for details. I asked twice, on May 8 and May 13.

To summarize the unusual situation I saw in the recreated database, the Block table normally has a Size (limited by your configured blocksize which defaults to 102400, but not all are that big) and information on which destination volume the block is in (e.g. so that a restore knows which dblock file it must download).

Sorting by size, here’s where the recreated database somehow went into not knowing block source, and saying that block size is negative. Most negative was -21487616. I don’t know if this is a specific unusual use that means something to an expert in the recreate process, so has special meaning, or if something just goofed. If so, did goof begin from bad remote data? All this requires more expertise than I can offer…

Are you SamiLehtinen on GitHub? I see that person opened an issue for development, but I don’t see this. Perhaps going in that way will get some people involved. Bumping a topic on a user forum might not help. Even if people get involved, they will probably need to work with you a certain amount. I hope you will help.

Sami_Lehtinen · August 13, 2021, 7:22am

Sure you can share the backup debug dump with the core Duplicati team.

If required, I can also provide probably new similar reports, if there’s need to confirm that with the latest version. Because this seems to be recurring thing.

This is something which does happen due to the current setup, where we do not “care” about the backup jobs being aborted by some reasons.

  Deleting file duplicati-20210805T161112Z.dlist.zip.aes (41,79 KT) ...
  Deleting file duplicati-20210812T161113Z.dlist.zip.aes ...
  Deleting file duplicati-20210812T161113Z.dlist.zip.aes ...
  Deleting file duplicati-20210812T161113Z.dlist.zip.aes ...
  Deleting file duplicati-20210812T161113Z.dlist.zip.aes ...
  Deleting file duplicati-20210812T161113Z.dlist.zip.aes ...
Fatal error => The remote server returned an error: (550) File unavailable (e.g., file not found, no access).

This happens as example, if the backup destination server boots during the remote backup(s) running. If the client behaves correctly in transactional way, it shouldn’t be a problem. But if the clients logic is broken, situations like this, make it much more likely that I’ll get integrity issues than it would be otherwise. Something like this happening during backup or compaction, could well trigger the situation which I’m seeing. Of course it’s “unlikely” but if there’s 365 backups being run every 6 months per backup set, and I’ve got plenty of backup sets, changes start to accumulate that something that could happen.

Finally confirming, that’s my github account. I still don’t know what the root cause is. But today I fixed tens of backups, using two different methods depending on backup size. Full reset, delete database and destination data and rerun. And for cases where history is important I just deleted the database run the commands mentioned above. Just now I’m running again full restore batch-job(s) to confirm if the measures actually lead to consistent backup state. - I’ll edit update this post with the results when I’ve get those. I’ll assume it’ll take several hours.

The repair (recreate) / rebackup tasks ended up causing around 10 TB of traffic in and out (20 TB combined) of the backup backend, which is btw latest Filezilla Server, in case it happens to ring any bells by causing problems (?).

– Rebuild / Repair + full restore process confirmation data –
Haha, let’s start with confession, one failure was due to invalid decryption phrase.

Some larger backups were too slow to repair, I had to delete the backups and some smaller sets were repaired successfully.

But in general, I assume I’ll find more (fatally!) broken backups when I do this check next time. I think backup testing should be done more often, because current testing clearly shows, backup data corruption beyond restore seems to be “normal feature”.

I’ll have to continue testing later, this is way too slow and got no time for this.

Sami_Lehtinen · August 24, 2021, 5:23am

This might be one of the reasons why I’m seeing this pattern, because the database rebuild probably doesn’t handle extra files very well. (Of course database is being rebuilt it’s kind of hard to tell which files should be excluded and which are necessary).

ts678 · August 24, 2021, 11:00pm

I was actually wondering from the first DB bug report if the negative sizes were in your extra dindex files:

duplicati-i130c97a0ed6540f7a85191b8cb680ff2.dindex.zip.aes (size 541)
duplicati-i7b41f10da9ff4c5cb1ff5e80ea6e8e52.dindex.zip.aes (size 541)

They seem too small for hundreds of negative block sizes though, but feel free to decrypt and look inside.
A manual sanity check for extra dindex files is to see if Index and Blocks have same number of rows in Remotevolume table. There is also an IndexBlockLink table that links those two. Those extras aren’t in it.

A very brute-force way of looking for the negative values origin might be to decrypt and unzip dindex, e.g.

is a file inside a dindex, whose name is the dblock it indexes and whose contents are blocks and sizes.

You could sample other broken recreates to see if there’s a pattern of negative block sizes as in image
A well-running Recreate reads only dlist and dindex files. Dblock files are downloaded if info is missing.
About → Show log → Live → Verbose can show you where you are. Full dblock download is very slow.

Sami_Lehtinen · October 18, 2021, 4:04pm

duplicati-i130c97a0ed6540f7a85191b8cb680ff2.dindex.zip.aes (size 541)
duplicati-i7b41f10da9ff4c5cb1ff5e80ea6e8e52.dindex.zip.aes (size 541)

I immediately felt familiar with the file size in your post. I do know where those are from. When I run repair, those files are uploaded as placeholders for missing files. I’ve seen that way way too many times.

ts678 · October 18, 2021, 5:37pm

The best thing to do is to decrypt and unzip such a file that you have laying around to see what’s inside it.
This may or may not be relevant to original issue, but it’s a notable finding as one way to go the other way involves deleting the file in the vol folder which is supposed to give information on a dblock of that name.
Encrypting that with SharpAESCrypt made me a 541 byte file.

Example of a small file in dindex vol directory. It’s named for the dblock it describes, and its contents are:

{“blocks”:[{“hash”:“VZrq0IJk1XldOQlxjN0Fq9SVcuhP5VWQ7vMaiKCP3/0=”,“size”:1},{“hash”:“hrBVpOwh+sSaiR+cg+GWpnSsnYPxuOKQ7/lmZBhkYM0=”,“size”:137}],“volumehash”:“kGXjjZVyOMpDk3O+ViQ/ODHR9K7JnXTbD1GJuiezko0=”,“volumesize”:677}

Almost certainly the size 1 is from 1 byte source file, and the size 137 is its metadata. I see 137 a lot.

If a dindex file goes missing, Repair is supposed to rebuild it from database. Are yours missing content?

Sami_Lehtinen · October 20, 2021, 5:56am

TL;Dr; Lots of random thoughts during the process… Result: Inconclusive.

If a dindex file goes missing, Repair is supposed to rebuild it from database. Are yours missing content?

Well, I just said that I’ve seen those 541 bytes files so many times. I’ve never checked what’s inside but my guess was totally correct. 541 bytes i-file - Decrypted 232 bytes (zipped) - Unzipped 146 bytes manifest. Content:
{"Version":2,"Created":"20180627T164703Z","Encoding":"utf8","Blocksize":102400,"BlockHash":"SHA256","FileHash":"SHA256","AppVersion":"2.0.2.1"}

Usually after that I’ve got bad expectations about restorability. Even if test (100%) says all is good, when running full restore test it’s likely to fail. (No i didn’t check this time, this is gut feeling and I usually do it like this). Usually best approach is to delete the database and rebuild it and only then next backup will fill in probably lost blocks.

This might be initially caused by the broken compact. Just my best guess so far.

Yet I just realized this was a bad approach. Because this file might not be created by repair, this might be empty file uploaded by Duplicati, because I’ve configured it to upload empty files… I’ll need to dig up the repair 541 byte file, next time I’ll encounter one.

Sami_Lehtinen · October 28, 2021, 5:42am

I think this is strongly related to this problem, but now I’ve been just digging bit deeper what’s causing the issue, instead of just observing it:

Files lost during compaction, backup broken, and requires manual fix.

Sami_Lehtinen · January 24, 2022, 11:09am

Based on latest tests with the latest version, exactly same problems still exist after several years. Data is dangerously corrupted and can’t be restored, even when test and repair says there’s nothing wrong with it.

It’s just crazy, for backup application to be this seriously, systemically and inherently broken.

drwtsn32 · January 24, 2022, 2:17pm

Would be interesting if you could repeat your problem by using back end storage that is much more durable, like an S3 bucket.

Sami_Lehtinen · January 26, 2022, 12:26pm

After discussing the flaws and issues with Restic team, I’ll install Restic in parallel for for few key systems and monitor the situation for half an year… Then it should be pretty clear…

drwtsn32 · January 27, 2022, 2:11am

I’m interested to hear your experience!

Sami_Lehtinen · January 27, 2022, 6:11am

Just a few things I want to add, about Duplicati:

I’ve never complained about hash mismatches.
Biggest annoyance is the fact that test passes, but restore fails. - This is something which definitely needs to be addressed. Because if test passes, afaik, restore should work too (!). And that can’t be blamed on durability / bit-rot.
Duplicati is around 98% at the goal, but the very small issues ruin it.
I also reset all Duplicati backup sets, because some are several years old and some of the issues could have been caused with older versions. Without reset, it wouldn’t be fair.

Restic:

I assume Restic will perform bit better
I’m worried about Restic memory footprint, the environment I’m running the backups got large pretty stale data sets, so Restic could cause memory issues. This could ruin 1.
Restic is very nicely wrapped in single binary, which makes it wonderful thing to just drop it on servers with script.
On some (rare) cases for me, I assume deduplication with variable block size to improve deduplication a lot. But this applies just to a few very specific data sets.
On the other hand the point 4. could get ruined by the fact that I’ve read that large files usually get rechunked bit later again. Which could make process more inefficient. Because in these cases it’s about exactly one very large file getting backed up.
I’ll use Restic with rest-server. Data is stored on the files system as Duplicati backups.
Last edit for now, I’ve configured everything, first backup is done. New backups will be created ever 6 hours and data retention (forget with --prune) is configured to be very similar to the Duplicati’s retention, yet not exactly the same.
Restic creates a lot of quite small chunks, well, pretty similar to Duplicati’s default. Not a great fit for SMR storage / cloud storage where there might be latency on per file access.
I also miss the Duplicati’s threshold option. Now when it forgets something, it also prunes at the same time the stuff out of the DB. Depending on environment and situation this is a good / bad approach. In my case, I don’t see any benefits from immediate prune, because we’ve got plenty of basically free storage. Pruning (purge / compact) just adds extra I/O if it’s done too often.

But I’ll collect data, and configure logging so that I’ve got full logs for whole 6 months period, so if there’s something I need to review, I can dig the data.

Sami_Lehtinen · March 24, 2022, 11:25am

Unfortunately it seems that the Duplicati is working, ahem, as it has been working earlier:

Yet I haven’t had time to run Restic restore tests yet. I’ll do that next. But the fact is that Duplicati seems to be still just as broken as it was, and it … Well, I’ve got no words for it anymore …

Sami_Lehtinen · March 24, 2022, 3:10pm

@drwtsn32 - Ref: Is Duplicati 2 ready for production? - #73 by drwtsn32

I’ll reply here, because the ready for production thread is wrong for this topic. It was just important that it’s important to people know that there are several trust issues. And that it’s extremely important to actually restore test from destiation and without the local blocks or duplicati database at hand.

About the Backend, I upgrade the Filezilla Server version. I updated all Duplicati’s to the latest version and triple checked everything on the backend, including all backups and other massive files with hash-validation, so I know that the backend itself is not corrupting data. We did also mem-test it and full storage system tests, on several levels. Also, all tranports are over TLS so there’s integrity check.

Then to the problem, why I don’t suspect the backend to begin with, but I also checked it.
I saved the smallest backup + source configuration. So I can repeat the problem. I can’t well reproduce the problem how it initially formed, but I can reproduce the restore error + run tests and repair which both end up saying that everything is good.

Unfortunately I’m also business software developer, so I have decades long history of trouble shooting multiple complex business systems including my own code which fail in multiple different agonizing ways. I just today told my colleague that if this would be my own code, I would be really pissed, because I would know it’s broken, but I couldn’t easily and quickly tell exactly what lead to it.

The trail is similar, something ends up being wrong in the destination, it could be left over file or something. Something which messes up the restore logic. I can’t directly say with confidence, if it’s the backup or the restore which is causing the problem.

Let’s recap what I know. I know that the latest version of Duplicati seems to handle the roll forwards much better than the old versions, which well, in some cases systemically failed. But with the latest version I see that if it uploaded file A and it wasn’t finished, it’s deleted or re-uploaded. And also the compaction even if aborted seems to at least in most of cases work as I would transactional system expect to work.

This is not a single event. I run, well, many backup sets daily. And in this case, I just had two failing sets, but both with exactly the same problem. And now it’s important to remember that the test with full remote verification passes. So there are NO MISSING FILES, and ALL THE FILES GOT CORRECT HASH… If we can assume that the test / repair itself isn’t flawed?

I did also reset the failing sets today, and re-backup from source. And both restores were successful after that, without any configuration changes. So that’s not the problem either.

Based on all this, I’m pretty confident that there’s quite simple (!) logical mistake or implementation error somewhere. We’re very very close to the goal.

Yet, as said, my experience is totally generic about file transfer, database, transactions and other failures with related logic. I’m not Duplicati expert. How could we get forward from this point? I’ve collected all forensic evidence about the event.

It would be all so wonderful easy, if the backup set didn’t contain data, which I couldn’t share with you guys. I would just upload the whole packet somewhere. But in this case that’s not possible. But the data is practically meaningless to you and I’ve got no problem with IP addresses, or even credentials, because it’s all on closed system.

I assume you as Duplicati expert know the best, what I should share with you. I really would love to help, this is very frustrating for me as well. If I wouldn’t like Duplicati I would have dropped and swapped to Restic a good while ago. Yet I really did think about that hard again today. But this is hopefully the last logical issue somewhere?

I’ll share you the numbers confidentially, so you’ll understand how many backups I’m running daily, and how rare this issue is. Btw, even the filenames in the backup are totally meaningless, as long as the file contant itself is kept private. So I could share directory listings, whole databasse (the duplicati database) as well as destiation file listing and all the hashes of the files in the destiation folder.

And depending on the database restructuring phase, in theory this all could be caused by something like one extra file being left at target. Sure. Yet I still don’t understand why the repair wouldn’t clear it up. That’s the major question why there are two conflicting truths simultaneously.

How should we should continue? After thinking a while, my assumption is that I could give you everything, except the dblock files, username and password to the destination, full command lines I use for backup & compact, etc without uid/pwd. Yet I would include file hashes of all files. Would that be helpful? Would you need something else?

Update: data packet delivered with logs and reproducible case where tests pass and restore fails, with full debug & profiling logs. Inclunding database & iblocks and dlists.

tarianjed · March 29, 2022, 5:18pm

I think it would be good to create a github issue to track this bug down if you haven’t already. I am curious does it occur if you try to restore the files locally? (move the backup to a machine with duplicati on it and run it on system)

I think the most valuable thing to figure out how this occurred is to reproduce it from square one on a test dataset. Only you know the exact process you went through to end up in your current state, so you will need to do that work and outline the exact setup/configuration and steps to reproduce. It would be good to have the full profiling log, you can regex replace any sensitive filenames from it (if it is a test dataset this shouldn’t be needed). Share it in the github issue.

Seeing as you have ran into this issue multiple times I would say we have a very good chance of identifying the root cause.

ts678 · March 29, 2022, 6:53pm

There’s actually a PM thread on this. I’ve been thinking maybe you should be on it to help with SQL.
What I’d prefer is if it could be done more in the open, because then other people can learn or help.

This would be a whole lot simpler if it was reproducible with the profiling log and database histories.
By that, I mean a UNIX-style rotation with names containing .1, .2 etc. so as to see early database instead of only seeing the suspicious dindex file that compact built (out of what?), and recreate fail.

My own production backup has even more history, as I rclone sync the destination to local, using
--backup-dir to also preserve the now-deleted-at-destination files that the compact compacted…

Until when/if the available data gets better, I have some things to try to increase my understanding.
I was trying to chop apart some SQL the other day, getting syntax errors, and wishing for your help.

ErrorID: DatabaseIsBrokenConsiderPurge
Recreated database has missing blocks and 3 broken filelists. Consider using “list-broken-files” and “purge-broken-files” to purge broken data from the remote store and the database.

is the result of a big SQL query with UNION which I took apart (with difficulty) several ways, finding various oddities in various places. The Block table wound up with entries with -1 for VolumeID that probably caused some uproar. The number of these exactly fits with the length of a blocklist in that
dindex file list folder (32 bytes per SHA-256 blocklist hash). There’s a second file in same dindex
with exactly the same name (the Base64 of the SHA-256) that looks like a superset of the first file…

which happens to be the last block hash in the blocklist whose hash matches its name. Other doesn’t.
I might have found where VolumeID becomes -1 on DB recreate, but the meaning of Size is elusive…

ts678 · March 29, 2022, 10:18pm

I fixed the problem by deleting the mysterious longer second copy of the blocklist. That turned out to be the more reasonable-looking shorter blocklist repeated twice. How that could happen would need a trip through C# code and SQL. Because I don’t develop in either, it would be nice if debug were more open.

I show it in 7-Zip because that’s where I did it, because File Explorer is confused by the duplicated files. When it tries to show me the information on the longer file, it shows the size of shorter, so I don’t trust it.

As to why one participant in the PM found that a Direct restore worked OK, only version 3, 4, and 5 reference the affected blocklist in that dlist file. Presumably the GUI restore just picked latest version, 0.

FWIW my “fixedup” folder for this experiment is the decrypted dlist and dindex files, plus empty files for former dblock files. On those, I leave the .aes extension because they’re referenced in dindex that way although it can actually deal with the names as .zip. You just have to endure some messages like this:

Unable to find volume duplicati-b83bdf633ed5a424e929c69e5257d94f3.dblock.zip.aes, but mapping to matching file duplicati-b83bdf633ed5a424e929c69e5257d94f3.dblock.zip

EDIT:

The troublesome blocklist showed up on the Mar 19 backup and continued through Mar 20 and Mar 21, however the first time it actually appears twice in the provided current snapshot is (according to dindex) duplicati-b83bdf633ed5a424e929c69e5257d94f3.dblock.zip.aes. Please check yours to see what’s in it. That file was uploaded from a Mar 22 compact, but earlier databases, dindex, and logs aren’t available. Depending on how the compact code works, maybe old issues can propagate. Needs a developer look.

tarianjed · March 30, 2022, 3:54am

Well let me finish the testing the DB rebuild performance changes and hopefully you guys can get a more public foundation of current progress on the issue. Maybe a full explanation of the configuration and layout of the systems running into the backup issue. Then we can brainstorm how we can create a reproduceable test case of the problem. I don’t really like PM conversations for this type of stuff because it increases the chance of double work to occur, which isn’t good for open source projects that are short on man power. I understand the C# and the SQL, but my domain knowledge is lacking so we definitely need an open discussion in the github to allow that type of conversation to be easier.

Sami_Lehtinen · March 30, 2022, 6:02am

I think it would be good to create a github issue to track this bug down if you haven’t already. I am curious does it occur if you try to restore the files locally? (move the backup to a machine with duplicati on it and run it on system)

Yes, it will. Anyway, currently the backup restore tests are run over SMB and 10 Gbit network. It doesn’t make any difference (confirmed) if I copy files to local disk.

I’ve provided in one private thread a data set which causes database restore to crash. Filenames aren’t that sensitive and those would actually matter. But this is just one of the problems, now I think there are at least three overlapping “show stopper” level quality issues.

@ts678 If tarianjed needs the backup data I provided, it’s ok to forward it.

Also when I said that the database restore build performance is absolute nightmare. It seems that it’s usually related to message “registering missing blocks”. Same kind of backup can restore in 5 minutes versus 5 days, depending if that message is seen. I haven’t 100% proof of that. But now I usually just abort the restore test as soon as I see that message. Also there might be some issue with local-blocks because if I do not disable local-blocks that message seems to make the performance even worse. This is just a feeling again. No guarantees. But both seem to indicate situation where things first go slightly wrong and then the “fix” step is also problematic and leads to very bad end situation.

@ts678 It sounds good, always when issues are located, it helps. And for sure, sometimes deep logic issues are really agonizing to locate. Been there, done that.

I hope the FTP issue thread also helps to fix several overlapping issues, because it clearly points out the problems I’ve strongly suspected earlier.