Very nice (and rapid) analysis. I’ll put some comments below, but I think this is very much on the right track.
I’m not familiar with all the code or all the SQL, so some things may have to be studied to get more certain.
I think the design misleads decompression tools into this. It uses a file with a name sounding like a .zip
file, and the tool follows it. Actually the file name is the linked dblock
file name, and file content is a JSON string
I realized I didn’t get into dlist files before (it got long enough without them), so an example to help readers:
$ unzip -l duplicati-20220205T233452Z.dlist.zip
Archive: duplicati-20220205T233452Z.dlist.zip
Length Date Time Name
--------- ---------- ----- ----
146 2022-02-05 18:34 manifest
24 2022-02-05 18:34 fileset
284 2022-02-05 18:34 filelist.json
--------- -------
454 3 files
$ unzip duplicati-20220205T233452Z.dlist.zip
Archive: duplicati-20220205T233452Z.dlist.zip
inflating: manifest
inflating: fileset
inflating: filelist.json
$ cat filelist.json
[{"type":"File","path":"C:\\backup source\\2KiB.txt","hash":"jtyAkw4swEOnLds7ZB/P6kHd8EncsXE8klW/Q4scfR4=","size":2048,"time":"20220205T233111Z","metahash":"I4wPQE8HSpTHOwUOgD8nBgCsNjTT7HMAXs2nRgO/UAM=","metasize":137,"blocklists":["zpo/GYuoxZFzTTU4tljfkou4c3vaJgW/x8MhOLMgfp8="]}]$
$ cat fileset
{"IsFullBackup":true}$
$ cat manifest
{"Version":2,"Created":"20220205T233452Z","Encoding":"utf8","Blocksize":1024,"BlockHash":"SHA256","FileHash":"SHA256","AppVersion":"2.0.6.100"}$
A manual sanity test I do sometimes (a bit more extensive than counting destination file names) is to see if Remotevolume table State Verified dblock count matches dindex count and IndexBlockLink table row count. Lower index count might mean a lost dindex which might mean a long recreate search to find block dblock.
There are also cases having more dindex files than dblock files. I suspect sometimes extras are redundant. Ideally, I think the dindex and dblock files are paired unless one asks for no dindex. Anything else is suspect. There might be room for better checking to prevent recreate surprises, and this doesn’t need big redesigns.
They can refer to blocks anywhere on the destination, thanks to block level deduplication, and the penalty is that a file might have its blocks scattered among multiple dblock files, which can make a file restore slower. Full restore, I believe, doesn’t go source file by source file. It reads destination dblocks and scatters blocks.
I think that’s correct when referring to where the blocklist resides. Presence in a dindex depends on policy.
I think most SQL name references go to the File
view which was once a table but now has a PathPrefix
table along with a FileLookup
table to avoid making the database store redundant prefixes like a dlist does. This possibly traded reduced space for increased time, but I don’t know if any timing benchmarks were run.
Feature/fix path storage2 #3468
Source file is linked to two blocksets, one for its data content and one for its metadata via Metadataset table.
I agree on this, then got lost in words. There was an example of a two-block file earlier. Let’s try a one-byte.
$ cat filelist.json
[{“type”:“File”,“path”:“C:\backup source\B.txt”,“hash”:“335w5QIVRPSDS77mSp43if68S+gUcN9inK1t2wMyClw=”,“size”:1,“time”:“20220203T000452Z”,“metahash”:“iMKZleU/S7wBTUhf9pXTegajHh1gh+fS/oyH+qYE1Tw=”,“metasize”:137}]
Repeating the two-block output but with word wrap:
$ cat filelist.json
[{“type”:“File”,“path”:“C:\backup source\2KiB.txt”,“hash”:“jtyAkw4swEOnLds7ZB/P6kHd8EncsXE8klW/Q4scfR4=”,“size”:2048,“time”:“20220205T233111Z”,“metahash”:“I4wPQE8HSpTHOwUOgD8nBgCsNjTT7HMAXs2nRgO/UAM=”,“metasize”:137,“blocklists”:[“zpo/GYuoxZFzTTU4tljfkou4c3vaJgW/x8MhOLMgfp8=”]}]
and now there are blocklists involved because the file is no longer a tiny one where block hash is file hash.
In contrast, the database always sends file contents through blockset, even if there’s only one block in set.
It might be possible for a metadata blockset to have more than one block. One would have to look in code because this would be unusually large, and sometimes the code puts off implementation until it’s needed.
A similar but maybe more plausible big-string situation is a source file that needs more than one blocklist.
A blockset is a multi-purpose sequence of bytes, and some aren’t related to a blocklist and blocklisthash. Normalization isn’t specifically mentioned earlier, but it’s proven that external and database formats differ.
Terrific job on figuring this out and expanding on my brief linked summary which I’ll post here for any fixes:
Here’s my (maybe wrong) understanding of things:
- Fileset shows backup versions, basically sets of files.
- FilesetEntry show what files a given Fileset ID holds.
- File view shows data and metadata content of a file.
- Data contents are represented directly via Blockset.
- Metadata goes through Metadataset to get to that.
- Blockset is a generic byte sequence made of blocks. A small blockset might have only one. Larger ones need their blocks listed.
- BlocksetEntry shows which blocks are in a Blockset. The Index is the 0-based index of a given block in the assembled sequence.
- Blocklist is an external representation of a Blockset. This permits DB recreation from backup. It identifies blocks by hash values.
- Hash is SHA-256, and is used many places, either directly as a 32 byte value, or as some sort of Base-64 encoding of its value.
- Block shows the blocks and says where they reside. Blocks are mostly size --blocksize, but sometimes a short block is required.
- Remotevolume is destination files, including blocks. Fileset gets
dlist
file. Blocks are stored in.zip
files calleddblock
files. .dindex
files speed up things like DB recreate by saying which blocks can be found in whichdblock
file, and give other aids.
Thanks for your interest! By the way, if you have a specific restore problem you can open a new topic here. Current thread seems to be aimed at speed, and possibily robustness improvements to avoid such issues.