Recreating DB triggers downloads of all DBLOCKs

mbrijun · September 2, 2017, 10:10am

OS: Windows 10 (1703), 64 bit
Version: 2.0.2.2_canary_2017-08-30 (clean install, profile deleted prior to install)
Backend: Amazon Cloud Drive
Backup size: 100k+ files, 450GB

After a clean reinstall of the latest version, I have initiated a rebuild of the DB. It has downloaded the filelists, followed by the index files, and then proceeded onto downloading the DBLOCK files. This is likely to take several weeks. I was under the impression that the index files are there so that the DBLOCKS don’t have to be downloaded.

The backups have been created over the period of the last 1 year with the help of various “canary” pre-releases.

I like Duplicati and I am seriously considering making it my only backup, but this sort of behaviour makes me rather nervous.

Please help me getting to the bottom of this. I have kept a copy of a previous DB, so this is not the end of the world, however I am concerned this is either caused by a critical bug, or there is a deep-rooted corruption.

> 2 Sep 2017 10:16: Message
> Processing all of the 9139 volumes for blocklists
> 
> 2 Sep 2017 09:54: Message
> Processing required 5 blocklist volumes
> 
> 2 Sep 2017 09:27: Message
> Filelists restored, downloading 9144 index files
> 
> 2 Sep 2017 09:23: Message
> Mismatching number of blocklist hashes detected on blockset 247624. Expected 0 blocklist hashes, but found 1
> 
> 2 Sep 2017 08:39: Message
> Rebuild database started, downloading 199 filelists

Thank you.

saviodsouza · September 2, 2017, 5:49pm

A good precaution is to schedule data db backup right after the last backup task in a similar way.

If data has to be restored to a new system first restore the data db from scratch. Then restore the data pointing dbpath to the db restored few minutes back.

saviodsouza · September 2, 2017, 5:57pm

Is this your latest db then just point dbpath and start your data restore.

Having the latest db backup at times of bare metal restore can save a lot of hours or even days.

mbrijun · September 3, 2017, 2:12pm

@saviodsouza: thank you for your reply.

I agree that having the the latest DB is a huge time saver. Having said that, I am still concerned about the fact that the engine cannot rebuild the DB without embarking on a massive download of the whole archive. This is a question of usability and reliability. Relying on a DB being available is just a workaround in my mind.

Is there any usable troubleshooting I can perform so that the reason of this behaviour is understood?

Thank you.

tophee · September 4, 2017, 5:50am

BTW: A good way of showing your appreciation for a post is to like it: just press the button under the post.

If you asked the original question, you can also mark an answer as the accepted answer which solved your problem using the tick-box button you see under each reply.

All of this also helps the forum software distinguish interesting from less interesting posts when compiling summary emails.

kenkendk · September 5, 2017, 12:16pm

As the output states, there is some error in the index files. When Duplicati attempts to rebuild the database, it figures out that there is a problem (some data is not found in the dindex files). Since there is no way of knowing which dblock file has the missing data, it just starts pulling down the files until it finds what it needs.

mbrijun · September 5, 2017, 6:33pm

I wonder if there is a middle ground? If I restore an old DB with the same potentially corrupt archive, the DB “repairs” very quickly. What I mean, is there a way to fix the corruption quickly, using the data already in the old DB? Thank you.

kenkendk · September 6, 2017, 9:44am

Not really, but you can force it.
Simply remove (or rename) all the dindex files on the remote store, then run the “Repair” command and it will re-create all the missing dindex files, which should fix the problem.

mbrijun · September 7, 2017, 6:37am

Simply remove (or rename) all the dindex files on the remote store, then run the “Repair” command and it will re-create all the missing dindex files, which should fix the problem.

This gave me an idea. I downloaded the whole archive to a local drive (took less than 24h, thanks God for broadband). Then I removed all DINDEX files, so what’s left is the DLIST and DBLOCK files. Finally, I removed the old DB and initiated a repair.

Once the repair engine got to the DBLOCK files, I started observing the processing speed of around 2x 50MB DBLOCK files per minute, or around 100MB/min. At this rate it is likely to take around 75 hours to ingest a 450GB archive.

The profiler is displaying a lot of

Starting - ExecuteScalarInt64: SELECT "VolumeID" FROM "Block" WHERE "Hash" = ? AND "Size" = ?

in between downloads of DBLOCK files.

Perhaps this is the area that could be improved with a slightly different logic, maybe an in-memory structure that would not rely on querying the DB so much?

Thank you.

kenkendk · September 7, 2017, 8:20am

We tried this for other operations earlier and it had a large memory overhead and did not improve execution time. But maybe we could batch the volume lookups to improve the performance here.

mbrijun · September 7, 2017, 7:42pm

Another experiment… Creating a RAMDISK and placing a Sqlite DB there, I can now see 6x50MB DBLOCKS ingested per minute, compared with 2x50MB DBLOCKS, when the DB lives on an SSD drive. A three-fold increase in speed.

EDIT: overnight the speed of ingestion has dropped to 2x50MB DBLOCKS per minute. This is in line with the speed observed on an SSD drive. I guess this makes sense. If the ingestion procedure relies heavily on a SELECT operation, the database slows down as it grows bigger. What I cannot understand is this (without reading the source code, something I have not done yet). Based on the whitepaper I understand that the DBLOCKS is nothing more than a collection of data blocks and their hashes/sizes. First, if DINDEX does exist, what does the engine expect to find in a DBLOCK that it cannot find in the DINDEX for that DBLOCK. Second, why does the DBLOCK ingestion relies so heavily on SELECTs. Is it trying to establish that each data block only apears once across all volumes?

Thank you.

kenkendk · September 8, 2017, 8:21am

It looks for hashes that are mentioned by the dlist files, that are not mentioned in the dindex files. If you look in the database, you should see that some entries in the “Block” table has -1 as the VolumeID indicating that Duplicati does not know where the block is located.

The lookup that you report as the problem is a guarding lookup that ensures that we only have each block mentioned once in the database.

Tapio · September 8, 2017, 11:11am

My use case is, if I am unavailable (erm, e.g. in case of death), give relatives:

a) oauth key
b) backup key and paths
c) Duplicati

and enable them to restore specific stuff. But that db rebuild is a problem and making them juggling with database files is too difficult for them.

The restore process should be as easy as possible!

I am probably thinking too simple, but… Duplicati cannot simply save a complete database server side and (temporarily?) download it in case of restore?

mbrijun · September 8, 2017, 6:54pm

It looks for hashes that are mentioned by the dlist files, that are not mentioned in the dindex files. If you look in the database, you should see that some entries in the “Block” table has -1 as the VolumeID indicating that Duplicati does not know where the block is located.

Hm, this makes it sound as if the duplicati engine does not quite trust the index files… Is there a good reason for that? I could imagine a situation that some DBLOCK files have simply been deleted by a user, along with their corresponding DINDEX files…

Maybe this could be a user selectable setting, disabled by default, called something like “Perform a full DBLOCK scan in case of missing chunks”. Or maybe even a separate maintenance command. In my view, trusting the indexes should be the default stance.

mbrijun · September 9, 2017, 3:08pm

The lookup that you report as the problem is a guarding lookup that ensures that we only have each block mentioned once in the database.

Looking at the Block table and at the guarding query, 2 questions come to my mind:

The guarding query checks both the Hash and the Size. AFAIK, strong hashing functions like SHA-256 guarantee, for all practical purposes, that no two blocks will have the same Hash, regardless of their sizes. For that reason I think checking against the Hash is sufficient and the Size does not need to be taken into account.
The Block table has an integer based primary key. Would it not be practical to make the Hash the primary key, as it uniquely and reliably describes each block? As an added bonus, there would be an enforced check that would prohibit an addition of another block with the same hash. Instead of relying on a guarding query, perhaps the same result could be achieved through the error handling during an INSERT operation?

kenkendk · September 9, 2017, 3:28pm

This is done to make Duplicati fault tolerant and work without the index files. The check for duplicates is required in case the index files are broken or duplicated.

That is exactly what it does. But when it figures out that the index files are broken, it will automatically scan the blocks. Stopping instead would leave you with a broken database.

kenkendk · September 9, 2017, 3:37pm

In theory you are correct, but we did see a reported collision without the size check. This seems to happen because the smaller files provide too little info. Maybe someone with math stats skills can explain why.
The query is not affected by this as SQLite will use the index, scanning the hash first:

github.com

duplicati/duplicati/blob/master/Duplicati/Library/Main/Database/Database schema/Schema.sql#L157


      
          */

          CREATE TABLE "Blockset" (

          	"ID" INTEGER PRIMARY KEY,

          	"Length" INTEGER NOT NULL,

          	"FullHash" TEXT NOT NULL

          );

          

          CREATE UNIQUE INDEX "BlocksetFullHash" ON "Blockset" ("FullHash", "Length");

          

          /*

          The elements of a blocklist,

          the hash is the block hash,

          they are grouped by the BlocksetID

          and ordered by the index

          For general speed and storage improvement 

          we use a table with option "WITHOUT ROWID"

          ["WITHOUT ROWID" available since SQLite v3.8.2 (= System.Data.SQLite v1.0.90.0, rel 2013-12-23)]

          */

            

          CREATE TABLE "BlocksetEntry" (

          	"BlocksetID" INTEGER NOT NULL,

Yes, but this would require storing the larger hash all the places where it is needed instead of the smaller ID. The check for duplicates is enforced with a unique index. We could do a INSERT IF NOT EXISTS but it seemed to have little impact on performance.

mbrijun · September 9, 2017, 8:19pm

In theory you are correct, but we did see a reported collision without the size check. This seems to happen because the smaller files provide too little info. Maybe someone with math stats skills can explain why.

This is an interesting thought. I have checked online and according to this fairly recent discussion the only member of family to have known collisions is SHA-1. But Duplicati uses SHA-2, for which at the moment there are no known collisions.

If indeed a collision found in SHA-2 by the Duplicati project, this would likely make the headlines worldwide. It would be interesting to take the suspected colliding blocks and compare them using an alternative hashing tool.

kenkendk · September 11, 2017, 10:04am

I think you are referring to collisions as a security weakness, where you are able to replace (parts of) a file, and maintain the same hash value. Such attacks are really bad security wise, but have little impact on Duplicati (unless you try to trick the backup to not store a special piece).

If you like, you can also opt to use SHA-1 or MD5 in Duplicati.

If it is a random collision, I don’t think it will make headlines. I usually do not get to examine peoples files, so I cannot say if it did happen, but it sounded plausible that it happened.

mbrijun · September 11, 2017, 7:33pm

That is exactly what it does. But when it figures out that the index files are broken, it will automatically scan the blocks. Stopping instead would leave you with a broken database.

From my (limited) observation, the database “repair” process ingests DINDEXs pretty quickly, then moves onto the DBLOCKs. The ingestion of DBLOCKS is pretty slow, and gets slower when the DB grows bigger. What I really had in mind with my original question is this: if an index corruption is detected (or suspected), instead of ingesting DBLOCKs into the DB, why not create indexes without any interaction with the DB. As I understood from the whitepaper, indexes only show what hashes live within each DBLOCK. Once the indexes are ready, they could be ingested into the DB, which seems to be a speedy process.

I am sorry if this is a gross oversimplification, but I thought I would try to articulate my thought.