Database rebuild

ts678 · February 6, 2022, 4:06pm

Very nice (and rapid) analysis. I’ll put some comments below, but I think this is very much on the right track.
I’m not familiar with all the code or all the SQL, so some things may have to be studied to get more certain.

I think the design misleads decompression tools into this. It uses a file with a name sounding like a .zip file, and the tool follows it. Actually the file name is the linked dblock file name, and file content is a JSON string

I realized I didn’t get into dlist files before (it got long enough without them), so an example to help readers:

$ unzip -l duplicati-20220205T233452Z.dlist.zip
Archive:  duplicati-20220205T233452Z.dlist.zip
  Length      Date    Time    Name
---------  ---------- -----   ----
      146  2022-02-05 18:34   manifest
       24  2022-02-05 18:34   fileset
      284  2022-02-05 18:34   filelist.json
---------                     -------
      454                     3 files
$ unzip duplicati-20220205T233452Z.dlist.zip
Archive:  duplicati-20220205T233452Z.dlist.zip
  inflating: manifest                
  inflating: fileset                 
  inflating: filelist.json           
$ cat filelist.json
[{"type":"File","path":"C:\\backup source\\2KiB.txt","hash":"jtyAkw4swEOnLds7ZB/P6kHd8EncsXE8klW/Q4scfR4=","size":2048,"time":"20220205T233111Z","metahash":"I4wPQE8HSpTHOwUOgD8nBgCsNjTT7HMAXs2nRgO/UAM=","metasize":137,"blocklists":["zpo/GYuoxZFzTTU4tljfkou4c3vaJgW/x8MhOLMgfp8="]}]$ 
$ cat fileset
{"IsFullBackup":true}$ 
$ cat manifest
{"Version":2,"Created":"20220205T233452Z","Encoding":"utf8","Blocksize":1024,"BlockHash":"SHA256","FileHash":"SHA256","AppVersion":"2.0.6.100"}$

A manual sanity test I do sometimes (a bit more extensive than counting destination file names) is to see if Remotevolume table State Verified dblock count matches dindex count and IndexBlockLink table row count. Lower index count might mean a lost dindex which might mean a long recreate search to find block dblock.

There are also cases having more dindex files than dblock files. I suspect sometimes extras are redundant. Ideally, I think the dindex and dblock files are paired unless one asks for no dindex. Anything else is suspect. There might be room for better checking to prevent recreate surprises, and this doesn’t need big redesigns.

They can refer to blocks anywhere on the destination, thanks to block level deduplication, and the penalty is that a file might have its blocks scattered among multiple dblock files, which can make a file restore slower. Full restore, I believe, doesn’t go source file by source file. It reads destination dblocks and scatters blocks.

I think that’s correct when referring to where the blocklist resides. Presence in a dindex depends on policy.

I think most SQL name references go to the File view which was once a table but now has a PathPrefix table along with a FileLookup table to avoid making the database store redundant prefixes like a dlist does. This possibly traded reduced space for increased time, but I don’t know if any timing benchmarks were run.

Feature/fix path storage2 #3468

Source file is linked to two blocksets, one for its data content and one for its metadata via Metadataset table.

I agree on this, then got lost in words. There was an example of a two-block file earlier. Let’s try a one-byte.

$ cat filelist.json
[{“type”:“File”,“path”:“C:\backup source\B.txt”,“hash”:“335w5QIVRPSDS77mSp43if68S+gUcN9inK1t2wMyClw=”,“size”:1,“time”:“20220203T000452Z”,“metahash”:“iMKZleU/S7wBTUhf9pXTegajHh1gh+fS/oyH+qYE1Tw=”,“metasize”:137}]

Repeating the two-block output but with word wrap:

$ cat filelist.json
[{“type”:“File”,“path”:“C:\backup source\2KiB.txt”,“hash”:“jtyAkw4swEOnLds7ZB/P6kHd8EncsXE8klW/Q4scfR4=”,“size”:2048,“time”:“20220205T233111Z”,“metahash”:“I4wPQE8HSpTHOwUOgD8nBgCsNjTT7HMAXs2nRgO/UAM=”,“metasize”:137,“blocklists”:[“zpo/GYuoxZFzTTU4tljfkou4c3vaJgW/x8MhOLMgfp8=”]}]

and now there are blocklists involved because the file is no longer a tiny one where block hash is file hash.
In contrast, the database always sends file contents through blockset, even if there’s only one block in set.

It might be possible for a metadata blockset to have more than one block. One would have to look in code because this would be unusually large, and sometimes the code puts off implementation until it’s needed.

A similar but maybe more plausible big-string situation is a source file that needs more than one blocklist.

A blockset is a multi-purpose sequence of bytes, and some aren’t related to a blocklist and blocklisthash. Normalization isn’t specifically mentioned earlier, but it’s proven that external and database formats differ.

Terrific job on figuring this out and expanding on my brief linked summary which I’ll post here for any fixes:

Here’s my (maybe wrong) understanding of things:

Fileset shows backup versions, basically sets of files.
FilesetEntry show what files a given Fileset ID holds.
File view shows data and metadata content of a file.
Data contents are represented directly via Blockset.
Metadata goes through Metadataset to get to that.
Blockset is a generic byte sequence made of blocks. A small blockset might have only one. Larger ones need their blocks listed.
BlocksetEntry shows which blocks are in a Blockset. The Index is the 0-based index of a given block in the assembled sequence.
Blocklist is an external representation of a Blockset. This permits DB recreation from backup. It identifies blocks by hash values.
Hash is SHA-256, and is used many places, either directly as a 32 byte value, or as some sort of Base-64 encoding of its value.
Block shows the blocks and says where they reside. Blocks are mostly size --blocksize, but sometimes a short block is required.
Remotevolume is destination files, including blocks. Fileset gets dlist file. Blocks are stored in .zip files called dblock files.
.dindex files speed up things like DB recreate by saying which blocks can be found in which dblock file, and give other aids.

Thanks for your interest! By the way, if you have a specific restore problem you can open a new topic here. Current thread seems to be aimed at speed, and possibily robustness improvements to avoid such issues.