Hey,
Coming from: https://github.com/duplicati/duplicati/issues/4041
Background: I found out the hard way that (local-side) disaster recovery with Duplicati 2 is broken as of this moment, as unfortunately the software doesn’t seem to be realistically able to rebuild the local database when there are too many blocks due to serious performance concerns. After a short message exchange on Issue #4041, I decided to give a shot at fixing it.
I’ve started my research on the local database and remote file format. From anyone who has this knowledge and can spare the time, I would like to ask some questions to deepen my understanding of the thing at hand. Keep in mind that these are asked from the point of view of database optimization.
Question #1: remote filenames
dindex
file are associated to their dblock
files through one of the contained zip entries - namely the entry in vol/
. I’m wondering whether there’s a reason why this association is not made through a natural key - namely the filename. I.e., why not having i-xxx.dindex be the index of b-xxx.dblock? For one, this would enable instantaneous sanity-checks of remote folders, and might finally lead to useful bug reports for issues such as this.
Question #2: missing index entries
I generated a tree containing 1000 files with random sizes (512 bytes to a few MBytes) divided in 20 directories, and ran a fresh backup with a local filesystem as destination, no encryption, default compression. I then stopped Duplicati, and examined the database and local filesystem. I picked one of the blocks at random and:
/shared/DuplicatiTestSet/remote
➜ unzip -l duplicati-bcc1d395590a442809da7491a24d0ba3c.dblock.zip | grep files
52252544 788 files
/shared/DuplicatiTestSet/remote
➜ unzip -l duplicati-i5a6d8922b3054abd9d808332af816e61.dindex.zip | grep files
71584 96 files
/shared/DuplicatiTestSet/remote
➜ unzip -l duplicati-i5a6d8922b3054abd9d808332af816e61.dindex.zip | grep dblock
54382 2022-02-05 16:09 vol/duplicati-bcc1d395590a442809da7491a24d0ba3c.dblock.zip
As you can see, the index file contains a minuscole number of entries with respect to the associated block file. And indeed, randomly sampling the index file shows that all the files in there are in the block file, but (obviously) not vice-versa.
Unless I’m missing something, this is a gigantic issue to fix along with the database performance stuff.
Question 3: volume hash
In the Remotevolume
table, what is the Hash
column? Is it just the hash of the volume file? Has it got any use?
Question 4: denormalize volume types
Is there a specific reason why the database was designed to have the three different types of volume (dlist, dindex, dblock) represented in a single table? If not, they should be separated - they have different semantics and indeed are always queried separately. Relationships also clearly express this difference in semantics. Also, presuming, that most applications will see way less dlist files than dindex and dblock files (i.e. best case scenario), indices are loaded ~2 times what they could be for no good reason!
I might have more later as I continue to investigate.