Duplicati Database Recreate running for a week

ts678 · August 17, 2023, 1:42pm

Database and destination internals so you removed a value here?

blocklists one or more blocklist hashes of large file

The blocklist hash leads to the blocklist which defines the file. I don’t see how any repair can fix that.

It’s looking for references to blocks that aren’t in the index. Without a reference it can’t identify missing.

This syntax was introduced in Python 3.9. Any idea how one was supposed to do this on earlier ones?

Path. open (mode=‘r’, ***, pwd, **** )

Invoke ZipFile.open() on the current path. Allows opening for read or write, text or binary through supported modes: ‘r’, ‘w’, ‘rb’, ‘wb’. Positional and keyword arguments are passed through to io.TextIOWrapper when opened as text and ignored otherwise. pwd is the pwd parameter to ZipFile.open().

Changed in version 3.9: Added support for text and binary modes for open. Default mode is now text.

gpatel-fr · August 17, 2023, 2:21pm

yet fix it it does…
The hashlist block hash is found in the dlist file(s). When it’s missing from the index file, it can still be found in the dblock file - the index file holds a copy of all the hashlist block hashes in the dblock corresponding file. It’s read here:

github.com

duplicati/duplicati/blob/32fcab927685d11c760c1260962eac4894977511/Duplicati/Library/Main/Operation/RecreateDatabaseHandler.cs#L500


      
          
                                              bool anyChange = false;

                                              // Update the block table so we know about the block/volume map

                                              foreach (var h in rd.Blocks)

                                                  restoredb.UpdateBlock(h.Key, h.Value, volumeid, tr, ref anyChange);

          

                                              // now that we have the blocks/volume relationships, we can go from the (already known) blocklisthashes
          				    // to the needed list blocks in the volume, so grab them from the database
          				    // read the blocks list hashes from the volume data (files) and insert them into the temp blocklisthash table

                                              foreach (var blocklisthash in restoredb.GetBlockLists(volumeid)) {
                                                  if (restoredb.AddTempBlockListHash(blocklisthash, rd.ReadBlocklist(blocklisthash, hashsize), tr)) {
                                                      anyChange = true;
                                                  }
                                              }
          

                                              // Update tables if necessary (if no block or hash have been changed by a data volume
          				    // there is no need to run expensive queries - most data volumes have been
          				    // managed successfully by correct index volumes), so we know if we are done
          				    // if any change, add to the block and blocksetentry tables the references found in
                                              // the block lists of the volume saved in the temp blocklisthash table by AddTempBLockListHash
                                              if (anyChange) {

The readBlockList routine does read the file from the downloaded dblock file.

I will send you by private link the address where you can download the whole backend.

ts678 · August 17, 2023, 3:12pm

Without any blocklist hash in its blocklists list, it has no idea what data the file contains.
Possibly a “fix” would be to delete the file as hopelessly undefined – not exactly a great fix.

Not always, although taking anything but that default seems bad (or maybe for special tests).
Maybe you should make sure that the new code can handle all the cases kind of presentably.

C:\Duplicati\duplicati-2.0.7.2_canary_2023-05-25>Duplicati.CommandLine.exe help index-file-policy
  --index-file-policy (Enumeration): Determines usage of index files
    The index files are used to limit the need for downloading dblock files
    when there is no local database present. The more information is recorded
    in the index files, the faster operations can proceed without the
    database. The tradeoff is that larger index files take up more remote
    space and which may never be used.
    * values: None, Lookup, Full
    * default value: Full

gpatel-fr · August 17, 2023, 3:34pm

The dlist provide a hash. When Duplicati in the dblock handling (the third pass) reads a dblock file, it has the list of all ‘files’ (the zip directory). When it finds a ‘file’ name corresponding to the hash, it reads this ‘file’ and interprets it as a collection of block hashes. That’s why it can find the block list hash, while it can (theoretically at least) be missing from the index file list/ subdirectory.

If the user don’t want index files to be useful, that’s their call but it would be rather strange to complain that database repair is slow then. The whole purpose of index files it to make db rebuilding faster.

there is no new code. The branch is not able to fix anything new. The only change is to make it finish faster, fail or success, whatever. I think that it is valuable to know that Duplicati will not be able to fix it in 12 hours rather than 12 days, that’s all. If it fixes it, it’s rather better as well since almost nobody will wait for 12 days.

BTW I think I forgot to say that the link I gave you is a Sftp server.

ts678 · August 17, 2023, 5:08pm

What I was trying to clarify was if this got removed from blocklists ref list in the filelist.json:

Processing a large file (Duplicati manual)

  {
  "type": "File",
  "path": "C:\\data\\myvideo.mp4",
  "size": 215040,
  "hash": "4sGwVN/QuWHD+yVI10qgYa4e2F5M4zXLKBQaf1rtTCs=",
  "blocklists": [ "Uo1f4rVjNRX10HkxQxXauCrRv0wJOvStqt9gaUT0uPA=" ]
  }

Last line is a “block ref that IS a special blocklist hash” but not a block. Loss is not repairable.
The blocklist itself is a concatenation of block hashes, and is in dblock and usually its dindex.
From the pushback, I will assume you’re removing the lists/ blocklist copy, not the dlist ref.

It wasn’t a speed issue. The “presentably” meant no messages suggesting things were wrong.
I had the “Unexpected changes” warning in mind, but I’m not sure if dindex downgrade does it.

gpatel-fr · August 17, 2023, 5:35pm

I have only messed with dindex files.

this is very much to make more obvious what is the file causing a problem. This message is more in the hope that once a dblock full scan will become more useful, it will bring more chances to investigate the root cause(s).

SovereignEntity · August 22, 2023, 6:04pm

Could you redo the build? Seems the artifacts are expired

gpatel-fr · August 22, 2023, 7:05pm

Done

SovereignEntity · August 22, 2023, 7:38pm

Got it!

Come to think of it,there are definitely instances where the machine could have lost power during the 12hour long backups that somehow could have contributed to the excess of Index files. Bit of an edge case but a real risk at this site unfortunately

gpatel-fr · August 22, 2023, 7:46pm

wow 12 hours backup ! are you recreating your 400 GB backup every day or is your network very slow ?

SovereignEntity · August 22, 2023, 7:58pm

Lots and lots and lots of smallish mostly files in a few locations including network shares. Change rate isn’t particularly high,just takes ages to run through and check all of it

Kopia in comparison takes minutes,though I don’t particularly like their software

gpatel-fr · August 22, 2023, 8:05pm

too bad since it means that you can’t use USN (well, I think, I never tried it in this case but it would make sense for USN to not work for network drives). If most of your files are local, maybe it could make sense to have a local backup (using USN) and an external backup.

SovereignEntity · August 23, 2023, 12:57am

Right here goes,installed the canary and downloaded the duplicati destination folder,imported the old job with the original metadata and started an initial repair

SovereignEntity · August 23, 2023, 7:06am

Okay we’re back onto the loop of death

Pass 3 of 3, processing blocklist volume 892 of 7481

Seeing a few of these:
Unexpected changes caused by block duplicati-xxxx.dblock.zip.aes

gpatel-fr · August 23, 2023, 7:29am

Right, if you are looking at the advance, can you give a rough estimate of the time taken for processing a block when there are ‘unexpected changes’ and when there is none ? can you note at least a few of the offending blocks for further analysis if it succeeds in an acceptable time ?

SovereignEntity · August 23, 2023, 8:30am

Jotting some down as I spot them

Unexpected changes caused by block duplicati-b231b21ca170042aeb26b60ce885f5849.dblock.zip.aes
Unexpected changes caused by block duplicati-b2d3bcf0cd64547afa9f9a5710c01b57e.dblock.zip.aes
Unexpected changes caused by block duplicati-b2e58018b4d074407a1c182fc109ccdfb.dblock.zip.aes

We’re going a bit faster but i’m not sure if it’s related to the code changes or me overspeccing the VM as well
Pass 3 of 3, processing blocklist volume 1404 of 7481

Blocks are taking seconds when there’s no issues,haven’t spotted an unexpected change block completing while i’m looking anyway

SovereignEntity · August 23, 2023, 9:21am

Unexpected blocks seem to take 7/8minutes to clear before it starts the next block

gpatel-fr · August 23, 2023, 10:52am

This seems to nail it; before the change each of the 1400 blocks done so far would have taken 7 mn so about 9800 mn for the 1400 blocks (that is, more than six days)

SovereignEntity · August 23, 2023, 12:33pm

Yep!
Aug 23, 2023 12:32 PM: Pass 3 of 3, processing blocklist volume 2559 of 7481

We’re on this taking days rather than weeks

Compare to the other machine it’s still running on since this forum thread was made:
23 Aug 2023 12:26: Pass 3 of 3, processing blocklist volume 1990 of 7481

gpatel-fr · August 23, 2023, 1:40pm

At this speed it should take 1.5 days. Still abnormally slow, I don’t know why for me running a full test of a 1.7 TB backup (a task that does a very similar work since here the block size is 400K) takes 7h on a server that don’t even have SSD. Maybe that’s because it has less files (more big files).

Optimize the database could be worth it at this point - I was using

CUSTOMSQLITEOPTIONS_DUPLICATI=cache_size=-300000;temp_store=2;journal_mode=WAL;synchronous=Normal

and

EXPERIMENTAL_DBRECREATE_DUPLICATI=1

environment variables.

For db recreation I had another advantage, the backend was in perfect shape so the recreation took 45’ with a classic hard disk and 25’ on a Ram disk.