Database rebuild


Coming from:

Background: I found out the hard way that (local-side) disaster recovery with Duplicati 2 is broken as of this moment, as unfortunately the software doesn’t seem to be realistically able to rebuild the local database when there are too many blocks due to serious performance concerns. After a short message exchange on Issue #4041, I decided to give a shot at fixing it.

I’ve started my research on the local database and remote file format. From anyone who has this knowledge and can spare the time, I would like to ask some questions to deepen my understanding of the thing at hand. Keep in mind that these are asked from the point of view of database optimization.

Question #1: remote filenames
dindex file are associated to their dblock files through one of the contained zip entries - namely the entry in vol/. I’m wondering whether there’s a reason why this association is not made through a natural key - namely the filename. I.e., why not having i-xxx.dindex be the index of b-xxx.dblock? For one, this would enable instantaneous sanity-checks of remote folders, and might finally lead to useful bug reports for issues such as this.

Question #2: missing index entries
I generated a tree containing 1000 files with random sizes (512 bytes to a few MBytes) divided in 20 directories, and ran a fresh backup with a local filesystem as destination, no encryption, default compression. I then stopped Duplicati, and examined the database and local filesystem. I picked one of the blocks at random and:

➜ unzip -l | grep files
 52252544                     788 files

➜ unzip -l | grep files
    71584                     96 files

➜ unzip -l | grep dblock
    54382  2022-02-05 16:09   vol/

As you can see, the index file contains a minuscole number of entries with respect to the associated block file. And indeed, randomly sampling the index file shows that all the files in there are in the block file, but (obviously) not vice-versa.

Unless I’m missing something, this is a gigantic issue to fix along with the database performance stuff.

Question 3: volume hash
In the Remotevolume table, what is the Hash column? Is it just the hash of the volume file? Has it got any use?

Question 4: denormalize volume types
Is there a specific reason why the database was designed to have the three different types of volume (dlist, dindex, dblock) represented in a single table? If not, they should be separated - they have different semantics and indeed are always queried separately. Relationships also clearly express this difference in semantics. Also, presuming, that most applications will see way less dlist files than dindex and dblock files (i.e. best case scenario), indices are loaded ~2 times what they could be for no good reason!

I might have more later as I continue to investigate.

1 Like

Additional information: I used the Recreate option to try and recreate the database. Duplicati did something but did not write to the database. I subsequently deleted the database and tried ricreating it from scratch from the remote, which did not work (empty database).

Welcome to the forum @vmsh0

The developer ranks are extremely thin, and most seem to be on GitHub, but since GitHub sent you here, perhaps we can take some guesses about “why” questions. Potentially someone who knows will stop by.
I’m not a Duplicati developer, but can sometimes figure out C# code, SQL queries, and database innards.

Upload errors leave a bit of a mystery (there was also a destination bug), so retry uses different file name.
I think dlist filename is incremented by 1 second, and dblock and dindex get new GUID-based names.
One might say this means a dblock retry needs a new name inside its dindex. I think that was once a bug.

There might have been a future ambition to make file relation more flexible, but I don’t know the motive:

Final guess is maybe class hierarchy led to the volume names being easily GUID-named independently.

index-file-policy of none seems (experimentally) to do what it says, however I’m not sure how useful it is.
I’m guessing that lookup adds the vol file, and back here I see I thought full adds blocklist information.
Developer documentation gets into this more than How the backup process works in user manual could.

I do like the sanity-check idea, because I’m guessing lack of a dindex could lead to searching all dblocks.

I think the idea of index-file-policy=full is that it has blocklist information that’s redundant with what could be obtained by downloading the dblock file, but it saves DB recreate from having to do a download.

Expanding blocklists in How the restore process works gets into this.
Processing a large file in How the backup process works also does, and mentions some paths-not-taken.

are probably the folders you have in your dindex (depending on index-file-policy) and what they do.
The big file count mismatch compared to files inside the block might be comparing very different things.

I think so (after Base64 encoding). You might find a modified Base64 in the .zip, where + is - and / is _.

Integrity test of the file after downloading it. Encrypted files have their own integrity test. A bit redundant?

I can’t comment much without profiling information, but there are hugely bigger tables than this due to the small default 100 KB blocksize and the multi TB files some people back up – and find things get too slow.
Duplicati profiles individual SQL statements, but some might add up, e.g. anything done for every block…

I’m not a DB expert, but I think both computers and DBs sometimes work faster if operations are bundled.

Can’t think of one, after you’ve looked into the actual usage, and I’ve looked at the Local database format.

I suppose I could ask @mnaiman who did the diagram and added highly helpful Log sql variables #3314.
The best person to consult on code and design history is original author @kenkendk – or fish in GitHub.

Thank you @ts678, I appreciate your help. Your response does clarify some of the points and gives some useful pointers. It is quite insightful particularly because it prescinds from some assumptions I had made.

TL;DR: from what I’ve seen, indices are broken, and ideally they should be fixed before even taking the first look at the queries, as such complex queries should not be necessary in the first place. The biggest way in which they are broken is that they can be absent by design.

So let’s address the elephant in the room: the biggest thing is still #2. I left index-file-policy alone (and the default is “full”), but as I mentioned Duplicati did not include full indices in the dindex files. It only included 96 block hashes instead of the full 788.

This is actually a pretty big issue as you don’t get to do the following:

  • Tell if a block file has an index file just from the directory listing of the remote
  • Tell if the index file for a block will be complete or partial

It’s a bit like having a file system which doesn’t have a data structure suitable for file listing… not good :slight_smile:

So after today, the conclusion I reached is that while Duplicati seems to be relatively reliable at pushing data to the remote, it fails at building a data structure suitable to recover it efficiently. Indices are just plain broken. I get the impression that the need for such complex queries during rebuilding is just to try to work with this broken index system, because the operation is otherwise conceptually very simple: make a local copy of the index. You can put it this way: while the local database is not the pinnacle of relational design, it is not horrible either.

I came here with the plan of fixing the queries, and now I concluded that there’s not much point in doing that as of now. So I want to put forward a couple approaches to fix what I think is really broken:

  • Make the indices useful. Have dindex files with the same name as the dblocks, and always upload them together as data+metadata, and make not having them both a “remote data loss” special case. This would be very easy to implement: building dindex files is very cheap while the dblocks still resides on the local side. Rebuilding the local database from this would be trivial: download all the dindex files and make a row in the database for each entry. With the current dindex format, this would give about ~1.5% space overhead with 50M dblocks and 100K blocks. If this overhead really is a concern, a better serialization format for indices can be devised (the current one uses zipfiles directory listings), and shorter hashes with post-facto collision detection can be used. While I’m not very familiar with the zip format, I’m quite sure that the current system does have a ton of pointless overhead, and the 1.5% figure can definitely be improved a lot. Hashes are 44 ASCII characters long, that’s ~0.05% of 100K, thus there’s no good reason why the overhead can’t be reduced by an order of magnitude at the very least.
  • Make the indices easy to use. Keep indices separate and redundant as they are now, but at least divide indices on a per-hash bucket instead of a per-block basis, so that files can’t be recovered without having to build the entire local database. I see this approach as significantly less solid, because the other approach has the great advantage of never having to update any files on the remote, ever. This is an advantage that, I believe, already exists with the current system.

Compare both to right now, where Duplicati uses the worst of both worlds: indices are not a mandatory piece of metadata for blocks, and it is not easy to discover the right index file for a file.

I’m launching one more database rebuild of my laptop’s backup before going to sleep, and tomorrow I’ll see if I can do anything on the matter of optimizing the queries, as a sort of a buffer solution (i.e. in an attempt to salvage existing backups, should they ever need to be restored). But I’m going to be honest: from what I’ve seen, I think Duplicati indices are broken by design and should be fixed to make Duplicati useful and competitive. Had I known of this issue, I would not have made my own backups with it.

I’m thinking there’s a solid possibility that Duplicati maintainers have had this in mind for long, and they simply don’t have the manpower to implement the required fixes. And I’m sorry, but while I would have been happy to send in some fixes (which I already did for something else), I’m really not sure I can help with fixing something so major, simply because I can’t get invested so much with a project that doesn’t even use my technology stack of choice.

I can still help, if you have any use of it, to design the data structures, including the index files and the database. But I’m figuring it’s not really what you need :slight_smile:

Please refer to my previous explanation and materials, and let me also give an example.

The numbers aren’t supposed to match. The storage in the .zip is completely different.

The dblock has a file per block, plus manifest. The dindex indexes all blocks in one file.
Additionally it may have blocklist storage in the list folder. Volume index is in vol folder.

Here’s a test using blocksize=1024, with 2048 byte source with same first and last 1024:
The expectation is two blocks, but deduplication boils it down to 1. There’s metadata too.
This is timestamps and such data. On Windows these are commonly around 137 bytes.
A block is known by its 32 byte SHA-256 hash, so the blocklist for two blocks is 64 bytes.
Files below are the manifest, a source file metadata block, a data block, and a blocklist:

$ unzip -l
  Length      Date    Time    Name
---------  ---------- -----   ----
      146  2022-02-05 18:34   manifest
      137  2022-02-05 18:34   I4wPQE8HSpTHOwUOgD8nBgCsNjTT7HMAXs2nRgO_UAM=
     1024  2022-02-05 18:34   Xmka5msPk2C-P8wucfflOnLEKZJ2Q3OsmMZ1tnQtFRc=
       64  2022-02-05 18:34   zpo_GYuoxZFzTTU4tljfkou4c3vaJgW_x8MhOLMgfp8=
---------                     -------
     1371                     4 files

The dindex has two folders. vol indexes the blocks in the volume. list is for redundant blocklist copies.

$ unzip -l
  Length      Date    Time    Name
---------  ---------- -----   ----
      146  2022-02-05 18:34   manifest
      290  2022-02-05 18:34   vol/
       64  2022-02-05 18:34   list/zpo_GYuoxZFzTTU4tljfkou4c3vaJgW_x8MhOLMgfp8=
---------                     -------
      500                     3 files
$ unzip
  inflating: manifest                
  inflating: vol/  
  inflating: list/zpo_GYuoxZFzTTU4tljfkou4c3vaJgW_x8MhOLMgfp8=  
$ cat vol/
$ hexdump -C list/zpo_GYuoxZFzTTU4tljfkou4c3vaJgW_x8MhOLMgfp8=
00000000  5e 69 1a e6 6b 0f 93 60  be 3f cc 2e 71 f7 e5 3a  |^i..k..`.?..q..:|
00000010  72 c4 29 92 76 43 73 ac  98 c6 75 b6 74 2d 15 17  |r.).vCs...u.t-..|
00000020  5e 69 1a e6 6b 0f 93 60  be 3f cc 2e 71 f7 e5 3a  |^i..k..`.?..q..:|
00000030  72 c4 29 92 76 43 73 ac  98 c6 75 b6 74 2d 15 17  |r.).vCs...u.t-..|

Notice that the 64 byte blocklist has identical first and second 32. That matches the source file’s design.
I’ll reformat vol/ file describing that volume:


There you see the metadata block, the data block, and the blocklist for how to reassemble the source file.
The three block hashes refer to the three block hashes in the filename, except for slight alphabet remaps.
For example the block hash that starts with zpo/ in the dindex text becomes a zpo_ when a file in dblock.

Below is the Create bug report sanitized-and-downloaded database for the example test I just described. (7.6 KB)

Some notes on how I think some of the database tables work and relate are found in the following issue

Backup gets stuck executing a query #4645

Below is a .zip file of the destination folder (called blocklist) for this test. Please have a look at these. (3.0 KB)

I think it’s based on misunderstanding, so I’ll stop here until there’s some basic agreement on the design.
There are some ideas later on that can be discussed after this is cleared up. Thank you for your interest!

Ok - I get it now. Thank you for the additional clarifications. What I had gathered from the material was that the information in index files was based upon zip directory listings. What I was missing is that files have an actual content, because at one moment in time I had tried to open the entry in vol/ in a dinxed file but unzip whined that that entry was corrupted. Probably a random event, that wasted me a couple hours of research.

So, mainly for my future reference, let me summarize all that information:

  • There are volumes, which are compressed and optionally encrypted archives containing either backup lists (dlist), any kind of block (dblock), or indices and blocklist blocks (dindex)
    • These are simply files in the remote
    • dlists have a list of all the source files contained inside a backup, with their hash, metadata hash, and if necessary the hashes of the blocklist(s) that must be used to put them back together
      • They are represented as Filesets, where each file is associated to the set by a FilesetEntry
      • Filesets are also linked back to their source dlist file
    • dblocks have blocks named after their hash
    • dindexes have blocklist blocks named after their hash in list/ and an index of their dblock named after the dblock in vol/
    • They are represented as Remotevolumes in the database, where dblocks and dindexs can also be linked together through IndexBlockLink
  • There are blocks, which are binary blobs containing either source files, metadata, or blocklists (i.e. lists of blocks which compose a source file)
    • Blocks exist in the remote in volumes, either in a dblock (all types of blocks), or in the list/ directory of a dindex (only blocklists)
    • They are represented as Blocks in the local database, belonging to a Remotevolume. In the case of blocklists, which belong to more than one volume, they are (as far as I can see) always represented to be part of a dblock, even when they are also part of a dindex
  • Source files are… well, we know what they are :slight_smile:
    • They are listed for each backup in the backup’s dlist file
    • In the same file, they are linked to their metadata
    • In the same file, they are either listed to a single block through its hash (which is also the file’s hash), or to a list of blocklists, through the hashes of the blocklists
    • A file is represented as a FileLookup, which belongs to a Fileset and has a Blockset
  • Metadata exists only as part of a source file, in the form described earlier
    • It is represented as Metadatasets
  • Blocklists are binary files as described earlier
    • They are represented by Blocksets, where each block is linked to is by a BlocksetEntry (which obviously points to a Block)
    • Note that this system parts form the remote representation: when there is more than a blocklist for a file, that’s represented as a single Blockset in the database, which thus has the hash of the file; and when there is no blocklist for a file, a Blockset containing a single block is represented, with the same hash as the block and thus the file
    • The above is also true for metadata, represented as Metadataset (which however is not a set in the same sense as a Blockset or a Fileset), which points to a Blockset with a single Block, having the same hash as the Blockset itself
    • The blocklists themselves are not represented as Blocksets with a single entry (i.e. there are no Blocksets with the same hash as the hash of a list), but as BlocklistHashes, each associated with a Blockset
      • This means that not every Blockset has a BlocklistHash, because as I mentioned earlier some Blocksets are made up for the sake of normalization and don’t have a corresponding representation in the remote - or a corresponding blocklist
1 Like

Very nice (and rapid) analysis. I’ll put some comments below, but I think this is very much on the right track.
I’m not familiar with all the code or all the SQL, so some things may have to be studied to get more certain.

I think the design misleads decompression tools into this. It uses a file with a name sounding like a .zip file, and the tool follows it. Actually the file name is the linked dblock file name, and file content is a JSON string

I realized I didn’t get into dlist files before (it got long enough without them), so an example to help readers:

$ unzip -l
  Length      Date    Time    Name
---------  ---------- -----   ----
      146  2022-02-05 18:34   manifest
       24  2022-02-05 18:34   fileset
      284  2022-02-05 18:34   filelist.json
---------                     -------
      454                     3 files
$ unzip
  inflating: manifest                
  inflating: fileset                 
  inflating: filelist.json           
$ cat filelist.json
[{"type":"File","path":"C:\\backup source\\2KiB.txt","hash":"jtyAkw4swEOnLds7ZB/P6kHd8EncsXE8klW/Q4scfR4=","size":2048,"time":"20220205T233111Z","metahash":"I4wPQE8HSpTHOwUOgD8nBgCsNjTT7HMAXs2nRgO/UAM=","metasize":137,"blocklists":["zpo/GYuoxZFzTTU4tljfkou4c3vaJgW/x8MhOLMgfp8="]}]$ 
$ cat fileset
$ cat manifest

A manual sanity test I do sometimes (a bit more extensive than counting destination file names) is to see if Remotevolume table State Verified dblock count matches dindex count and IndexBlockLink table row count. Lower index count might mean a lost dindex which might mean a long recreate search to find block dblock.

There are also cases having more dindex files than dblock files. I suspect sometimes extras are redundant. Ideally, I think the dindex and dblock files are paired unless one asks for no dindex. Anything else is suspect. There might be room for better checking to prevent recreate surprises, and this doesn’t need big redesigns.

They can refer to blocks anywhere on the destination, thanks to block level deduplication, and the penalty is that a file might have its blocks scattered among multiple dblock files, which can make a file restore slower. Full restore, I believe, doesn’t go source file by source file. It reads destination dblocks and scatters blocks.

I think that’s correct when referring to where the blocklist resides. Presence in a dindex depends on policy.

I think most SQL name references go to the File view which was once a table but now has a PathPrefix table along with a FileLookup table to avoid making the database store redundant prefixes like a dlist does. This possibly traded reduced space for increased time, but I don’t know if any timing benchmarks were run.

Feature/fix path storage2 #3468

Source file is linked to two blocksets, one for its data content and one for its metadata via Metadataset table.

I agree on this, then got lost in words. There was an example of a two-block file earlier. Let’s try a one-byte.

$ cat filelist.json
[{“type”:“File”,“path”:“C:\backup source\B.txt”,“hash”:“335w5QIVRPSDS77mSp43if68S+gUcN9inK1t2wMyClw=”,“size”:1,“time”:“20220203T000452Z”,“metahash”:“iMKZleU/S7wBTUhf9pXTegajHh1gh+fS/oyH+qYE1Tw=”,“metasize”:137}]

Repeating the two-block output but with word wrap:

$ cat filelist.json
[{“type”:“File”,“path”:“C:\backup source\2KiB.txt”,“hash”:“jtyAkw4swEOnLds7ZB/P6kHd8EncsXE8klW/Q4scfR4=”,“size”:2048,“time”:“20220205T233111Z”,“metahash”:“I4wPQE8HSpTHOwUOgD8nBgCsNjTT7HMAXs2nRgO/UAM=”,“metasize”:137,“blocklists”:[“zpo/GYuoxZFzTTU4tljfkou4c3vaJgW/x8MhOLMgfp8=”]}]

and now there are blocklists involved because the file is no longer a tiny one where block hash is file hash.
In contrast, the database always sends file contents through blockset, even if there’s only one block in set.

It might be possible for a metadata blockset to have more than one block. One would have to look in code because this would be unusually large, and sometimes the code puts off implementation until it’s needed.

A similar but maybe more plausible big-string situation is a source file that needs more than one blocklist.

A blockset is a multi-purpose sequence of bytes, and some aren’t related to a blocklist and blocklisthash. Normalization isn’t specifically mentioned earlier, but it’s proven that external and database formats differ.

Terrific job on figuring this out and expanding on my brief linked summary which I’ll post here for any fixes:

Here’s my (maybe wrong) understanding of things:

  • Fileset shows backup versions, basically sets of files.
  • FilesetEntry show what files a given Fileset ID holds.
  • File view shows data and metadata content of a file.
  • Data contents are represented directly via Blockset.
  • Metadata goes through Metadataset to get to that.
  • Blockset is a generic byte sequence made of blocks. A small blockset might have only one. Larger ones need their blocks listed.
  • BlocksetEntry shows which blocks are in a Blockset. The Index is the 0-based index of a given block in the assembled sequence.
  • Blocklist is an external representation of a Blockset. This permits DB recreation from backup. It identifies blocks by hash values.
  • Hash is SHA-256, and is used many places, either directly as a 32 byte value, or as some sort of Base-64 encoding of its value.
  • Block shows the blocks and says where they reside. Blocks are mostly size --blocksize, but sometimes a short block is required.
  • Remotevolume is destination files, including blocks. Fileset gets dlist file. Blocks are stored in .zip files called dblock files.
  • .dindex files speed up things like DB recreate by saying which blocks can be found in which dblock file, and give other aids.

Thanks for your interest! By the way, if you have a specific restore problem you can open a new topic here. Current thread seems to be aimed at speed, and possibily robustness improvements to avoid such issues.

1 Like

Although it’s sometimes difficult to isolate the code you want to measure, I’ve found BenchmarkDotNet to be a useful tool for benchmarking.

Thank you for all the info. Just as an update, I’m having Internet connection issues, so I’m having a hard time testing the stuff on a real backend.

I still plan to take a look at the queries and see if I can help.

1 Like

I have gotten my Internet connection back, and attemped a new database rebuild on my backup.

Seems to me like the big performance killer might be FindMissingBlocklistHashes @ LocalRecreateDatabase.cs:179. That query takes >3 minutes every time, and I have no clue about how many times it’s supposed to be executed.

I’m at 70% of the recovery process, and I’m guessing it won’t move much further anymore just because of this query.

I’m trying to understand what it does. What I gather from the name, is that it attempts to gather the hashes of all missing blocklists. In other words, it finds instances of data with relationships to a Blockset, where the Blockset is not there (or has no entries? this is an important detail).

This is quite confusing, as by taking a glance to the query, what I see is that some data gets inserted into the database - so it’s not really just a “Find” after all :slight_smile: The method has no documentation. Can anyone offer some help?

In the meanwhile, I’m going to try and make sense of the queries later today.

After almost 24 hours, it failed with:

2022-02-20 07:01:47 +01 - [Error-Duplicati.Library.Main.Operation.RecreateDatabaseHandler-IndexFileProcessingFailed]: Failed to process index file:

(not sure about the timestamp, it failed around 12:45 GMT+1)

This is where historical knowledge of the code would be useful. However, those that were around back then are quite short on time (@kenkendk?). The commit message (along with others that touch the method) offers a few clues:

I think you have described many details of the algorithm and the parts.

But just to make it clear: it is possible to restore files entirely without the dindex files.
The only purpose the dindex files serve is to reduce the download required to rebuild a local database. The local database is intended to be an easily queryable dataset describing the remote storage.

The dindex files are generally 1-to-1 with the dblock files, but in some cases (handling partial data removal) the dindex files can cover multiple dblock files, and do not need to be tied to anything. In principle, there could be a single dindex file for the entire backup.

And I agree, the queries required to reconstruct the database are way too complex. The complexity is caused by a goal of keeping a “reasonable” memory requirement, keeping all lookups in the database. It succeeds in that respect as I have been able to recreate/restore a 30GiB backup on a machine with 256MiB total memory, but it does have performance costs.

Yes, that is also how I recall it. The key to understanding that function is to understand the recreate process, implemented in RecreateDatabaseHandler.cs.

The process starts by grabbing one (or more) dlist files and creates a database that contains an entry with path, hash and metadatahash for each item that needs to be restored. This can be done by only examining a single dlist file, and can be filtered if only a part of the data needs to be restored.

The next step is downloading all the dindex files. This includes recording which datablocks are in which remote dblock files. If there are any blocklists, these are also added to the database.

At this point, the database is supposed to be complete and correct. It is possible to compute the number of blocks in each file, and this makes it possible to figure out if there are any blocklists that are incomplete. Blocklists are essentially a list of hashes, stored as a block in the remote data, and described as multiple rows in the database.

The call to FindMissingBlocklistHashes will check if any blocklists are incomplete. It does this by inserting the needed data into BlocksetEntry.

The related call is then made to GetMissingBlockListVolumes which will retrieve the dblock files that are needed to complete database tables. If does this in three steps:

  • step 0 will return dblock files that are expected to contain the missing information
  • step 1 will return any dblock files that are present remotely but has no links in the database
  • step 2 will return all dblock files

The idea with the approach is that step 0 is the most targeted, and should get the database back on track. Step 1 is a guess that the reason the database is not correct yet, due to some dangling files. And finally, the dreaded step 2 that will take forever, as it downloads all remote data in a last attempt at finding the missing pieces of data.

The FindMissingBlocklistHashes will update the BlocksetEntry table, which is used to see if there is any missing data. So the method is called once before running the three-level algorithm to check if the database is already complete. Then it is called after each dblock file has been processes to see if we can stop early.

So the method will be called at most once pr dblock file and one inital time.

Also, to explain the process, there is an article on the Duplicati restore process that explains the details with a small example.


Just wanted to update here that I have been working on some performance improvements in the database rebuild process. Currently looking at around a 3x performance increase.

1 Like

If you can maintain 3x and stability (or even improve that too), that will be fantastic :smiley: