Slow - Missing volumes - Attempting to replace blocks from existing volumes

Sami_Lehtinen · September 12, 2018, 6:17am

I don’t know what the process to recovering blocks exactly is. All I know, it’s extremely slow. And due to deduplication, probably also useless?

Hash errors have lead to this situation, verify starts to give errors about bad blocks. Due to that error, the bad blocks have been deleted. -> Everything is working, as long as nobody tries to restore the data. Which is of course extremely dangerous situation.

When restore starts running, it says that blocks are missing, and starts recovery. But this recovery is somehow botched, I don’t know how. But it’s ultra slow. Based on my best linear projection skills, this recovery phase would take around 70 days for 22 gigabytes of backup (duplication files). And after that the restore process could start. I’m also expecting, that it can’t recover the missing blocks, so the time wasted is truly wasted.

This isn’t really great thing, from DR perspective.

Any thoughts about this? Anyone? Of course the volumes shouldn’t get missing / corrupted in the very first place, that’s the root cause. But the recovery process itself is also somehow off.

Edit: Forgot to mention that on the source server, repair and verify both state that there’s nothing wrong with the backup set. So the issue arises only when trying to restore data. - Ouch!

Continued: I’ve retested the verify, repair and restore with latest version 2.0.3.11_canary_2018-09-05 and results are still the same.

TPSMono · September 12, 2018, 5:35pm

As far as I know dblock, dindex, or dlist files cannot be reconstructed if they become corrupt/missing. You must use the switch --purge-broken-files to remove the files from your backup that existed in the corrupt/missing dblock. You can use --list-broken-files to see what files are affected before using --purge-broken-files. These two switches will only work if a dblock has been marked as missing or corrupt in the database, usually by backup or repair processes. You mention source server, are you accessing these backup files from multiple Duplicati setups at the same time? If two different Duplicati installations are accessing the same backup set using their own databases, it is possible for one Duplicati setup to delete files created by the other setup because it does not know about the new files and will mark those as extra files then delete them.

One thing you can try is deleting/renaming the sqlite database for this backup, then run repair to rebuild it. This should hopefully discover the bad dblocks and mark them bad, so you can then use --list-broken-files and --purge-broken-files to get back on track.

ts678 · September 12, 2018, 11:40pm

How big is the restore? One file, several, all? If you mean a dblock file hash error, that varies with the restore.

Could you clarify “the bad blocks have been deleted”. Verify shouldn’t do that. If you did it, did you save files?

When considering what will be affected if a dblock file gets deleted, you could use The AFFECTED command.

Attempting to replace blocks from existing volumes is (per Google search) actually in source file:

https://github.com/duplicati/duplicati/blob/master/Duplicati/Library/Main/Database/LocalRecreateDatabase.cs

so seemingly Duplicati was trying the fix the database, not the destination. Sometimes it’s hard to distinguish.

The default “repair” acts like it might do as little as a listing of the destination files, maybe checking sizes too. With a simple backend (many varieties, no special help), downloading everything to look inside gets too slow.

Unfortunately, that’s what recreating a database requires (if one opts to use the delete and recreate button). You can, if you like, use the menu About --> Show log --> Live --> (pick a level, maybe Information) to watch.

The “verify”, I believe, is similar to the verify done after the backup, in that it tests a random sampling of files. For the backup, one can set backup-test-samples, and for The TEST command, there’s a <samples> value. Using “all” causes all file to be verified. It’s probably almost as slow as a full restore, but it would be thorough. Running that option might require you to use the Commandline menu item on the job, then adjust it for “test”.

TL;DR is the default “repair” and “verify” seem to favor speed (they’re quick, right?) over slow-and-thorough. Recreating the database has no choice but to be slow, however you can watch it to make certain it’s moving.

Sami_Lehtinen · September 13, 2018, 7:31am

I’ve now double checked everything, and there’s something very fundamentally wrong here. Repair, verify, backup, purge-broken-files, list-broken-files, all pass without any issues and indicate that everything being ok. But when I try to restore, the backup set is broken. I could try running full verify, but could take a really long time.

To your questions:

I fully agree, that reconstructing data without redundancy won’t work and is guaranteed to fail. Unless some kind of erasure coding, or other more advanced techniques are being used, which of course also generate lot of overhead.
As mentioned, list-broken-files and purge-broken-files indicate that everything is ok.
Of course I’m using multiple servers. It would inherently kind of stupid to test only restoring to the origin server. Isn’t whole point of backup being able to recover from loss of that environment? - But answering this question in constructive way. Yes, I’m using another server to restore the data, compared to the origin server and the backup storage platform. -> at least three different servers are involved. - The restore server restores data directly from the storage, of course. Requiring origin server to be around would defeat whole purpose of off-site backups. - And of course not parallel databases, only the origin server is using database. The restore environment is configured to be without database as well as local-blocks are disabled.
Deleting database and running repair on the origin server would probably detect the issue, which now arises only when restoring. - Good approach, I like that. Yet inherently it sounds that doing it this way is the very wrong way, and requiring it be done this way means that the software is unreliable and the backup set can be easily in state, where everything is good, until you’ll need to restore. - Which is pretty devastating situation and defeats whole purpose of backups.
Question is why deleting database and repairing it would be required to detect clearly missing files?

Thank you

Sami_Lehtinen · September 13, 2018, 8:08am

Backup size: Source data around 100 gigabytes, and 100 files. De-dupliated backup sets with reasonable history (not going into exact days), around 22 gigabytes and right now exactly 512 duplicati files, including all file types. Mostly large files with just limited number of pages being modified in every data set. -> Successfully recovering such files, basically requires successful access to all previous versions. -> This is just the “problem” which de-duplication creates. If there’s anything wrong in the backup history, restore will fail.
Deleted files: Backup verify repeatedly stated that the file size is incorrect for the file. This triggers extra alerts several times daily on our side, in which case I usually delete the file(s). Because I know that there’s no way to recover from that situation properly. -> Earlier corrupted file becomes, missing file. Shouldn’t technically change a lot of anything. - Root cause question of course is, why there are corrupted files, where hash mismatches and or file size is incorrect. That’s the great question which requires answering. If that wouldn’t be a problem, this secondary problem wouldn’t be a problem. Yet of course each part of the process, should recover as well as possible from different problems which might arise at different stages of the process.
Basically there’s no point trying to fix the database, because in the restore environment there’s no database to begin with. In case the reconstructed database should already be up to date(?). I can see that it’s updating some temporary database while DoSing the system with random I/O. In this situation there’s no need to fix anything in the remote storage or on the origin server. All that can be done is fixing or “trying to recover” from lack of a few b-files on the restore server. -> For some reason that’s incredibly slow process. Maybe it’s doing lot of small database transactions? That’s sure way to completely ruin performance, this is what it looks like to me, based on quick “what it seems like” analysis.
Agreed, that looking inside every file is way too slow. That’s why (full) restore / verify tests are run in separate environment from the origin / storage server(s) and seldomly tested. This is also why it makes me go kind of ballistic when I notice at this stage that the backup is practically unrestorable, when everything seems to be good on the surface. - In our case the full verify isn’t that slow, because it can be run on the same rack with 10G interconnect. -> It’s a lot faster than running anything from the remote “origin” servers. -> That’s also the reason why I avoid doing anything backup related, on the origin servers as long as possible.
Btw. Recreating database isn’t a problem for us, I’ve done that thousands of times and it’s fast enough. Every restore test does that. The real problem is triggered when the “attempting to replace blocks from existing volumes” phase is started.
But I’ll check a few other backup sets with the verify all, just to be sure, and out of curiosity. Those worked flawlessly. In both cases I used no-local-db no-local-blocks and the restore is actually very (?) fast and nice, nothing to complain about. All of these tests were run on purpose using spinning rust drives.

It seems that this problem is probably due to at least three overlapping problems.

First broken dblock(s), actually four files are now missing, because first corrupted and then deleted (by me)
Secondly failure to detect that, unless full restore is attempted
While trying to recover, extremely slow recovery