Recreating DB triggers downloads of all DBLOCKs

kenkendk · September 11, 2017, 7:41pm

The dindex files also contains copies of “indirection” blocks (aka blocklists). These are “hashes of hashes” and are required to know what hashes are really required.

If the update process is slow, I think a safe way to speed up the process would be to do batch updates, where all hashes are injected into a temporary table, and then run a batch update statement.

mbrijun · September 12, 2017, 6:54pm

The dindex files also contains copies of “indirection” blocks (aka blocklists). These are “hashes of hashes” and are required to know what hashes are really required.

Thank you for your patience with me. I have re-read the whitepaper and I think understood what you had in mind. A follow-up question: which blocklist(s) does each DINDEX file refer to? The problem is that DINDEX are DBLOCK specific, but blocklists are DLIST specific…

If the update process is slow, I think a safe way to speed up the process would be to do batch updates, where all hashes are injected into a temporary table, and then run a batch update statement.

If you could code this in one of the upcoming releases, I would be a very happy guinea pig

kenkendk · September 13, 2017, 9:57am

There is no dependency, so you cannot know without reading the contents of the dindex file. In other words, both the dindex and dblock files are named randomly and independently random even.

Thanks!

steviehs · October 23, 2017, 9:06am

I am rebuilding now the DB since a few days… the backup sits on a banana pi which is good for around 10MByte/s - but as I see the network throughput to the bpi, this is not the bottleneck. The size of the backup is around 500Gig (from 53 sessions), data chunks are 50MByte (9338 dblock files). The index files take around 500 MB of space in 9285 files). I have not saved the DB before rebuilding (did it just for “curiosity”), so I also dunno about the size of the DB.

Although I understood most of the above discussion and have also read the whitepaper, there is a really important topic right now:

For fast, complete and reliable disaster recovery - means the complete loss of all source technology / data - there is a solution needed.

After having lost my laptop, I would like to buy a new one, connect my duplicati drive (or NAS or cloud) to it, remember my password (uh ) and would then like to restore my data with an acceptable overhead to the file_get throughput.

The normal assumption, that I have a complete backup, should be taken on consideration for the design, so if it helps to store the DB after every backup session on the remote system and then just to restore that db to reach the goal of a fast recovery, why not… if the system then finds - after check of this DB, that it is incomplete, it should still go best effort in restoring everything which could be found completely and then continue with older versions or whatever… just for my understanding: the consistency of the DB is only checked against the dindex, which should not be that hard…

drwtsn32 · October 23, 2017, 10:40pm

Should some “check” be performed regularly to ensure index files are not broken? If I do have corruption somewhere, I’d want to know sooner rather than later.

steviehs · October 25, 2017, 8:16am

What I tested next was to restore my home directory “from scratch” (means without the database) from a local USB3 drive. This seemed to be successful and took around 12h for 250Gig (means 5MByte/sec). Is this in the area of something which has to be expected?

Still: this does not help with the issue of rebuilding the DB as mbrijun wrote, it is not mainly influenced by IO throughput from the backup place but by the speed of updating SQLlite…

For the situation right now, this means: disaster recovery (AKA loss of source system and total restore from backup) is feasible with slow speed. Still then continuing to use this backup to append new data is not really feasible as the DB rebuild takes far too long…

JonMikelV · October 25, 2017, 11:51am

Thanks for providing some real world stats!

The database performance is definitely an issue in certain circumstances.

Hopefully the threading and hashing updates currently being worked on will improve those restore times to a point that people will stay confident in Duplicati’s functionality / performance enough to wait for the database rewrite. I expect that’s when we’ll see the biggest gains on large / “long” (many version) backups.

steviehs · October 25, 2017, 12:21pm

Regarding the “space consumption” vs. “reliability” relation. Why wouldn’t it be an idea to handle the database for the backup as a very special file which will be backed up after every backup action on this profile. This would - of course for the price of a little bit more data entropy (my 1y old DB was around 2,5 Gig for 250Gig of Home Dir size and 450Gig of historical Backup in >50 versions) - provide the fastest possible way for a disaster restore with fast and complete view on files and versions. Versions are also important for disaster restore, when the restore is necessary due to infected machines…

Funny enough, I excluded the DB file for the homedir backups since one year, as Duplicati obviously complained about the changing file

JonMikelV · October 25, 2017, 2:21pm

I’m personally not against the idea, but usually the discussion around trying to do such a thing stops when we get to one or more of these points:

The database files can get quite big (as you’ve seen)
Backing up the database file after a backup is done is, itself, a backup - which would use the database being backed up, so there’s a lot of new coding needed to support this type of functionality
Databases don’t always put new data at the end of their files, this means it’s possible a small change in contents could cause a large portion of the database to have to be backed up, potentially chewing up more destination space than one might otherwise expect

One thing you could do is create a “database backup” job that backs up all the files EXCEPT the ones for the the ones related to the new “database backup” job, then fire that job off when a normal “data backup” job runs via a --run-script-after parameter.

This gets your database backed up in the cloud, however restoring data would mean two restores - one to restore the database, then another to restore data using the restored database. Unfortunately, it’s not a very smooth process at this time.

steviehs · October 25, 2017, 3:46pm

I will track DB size from now on regularly, spending 1-2% of the the backup size/costs for having much faster restores would be a good trade in I guess.

I have no glue about the whole implementation of functionality right now. But would just a scripted implementation of your “database backup” you recommend afterwards inside the existing architecture be a straightforward approach? May just name all db backup related stuff to duplicatidb- and that’s it.

Which would be another demonstration of Duplicati’s power of deduplication? BTW: the 2.5Gig DB could be gzipped down to 800Megs.

I till definitely do so and note down 1st and 2nd DB sizes over the next weeks. But as you mention, it is not that smooth and will need an expert when it comes to “Steve’s after life data restauration” like Tapio mentioned above.

I’d be curious, how big the user base of Duplicati 2 is right now and what the backup sizes, history depth and DB sizes are… So it could be estimated how long a DB rewrite process could take…

JonMikelV · October 25, 2017, 4:03pm

Thanks! That’s one of the things I’d like to see added to the stats tracking (see below).

Yes and no. Duplicati uses a fixed block size - let’s say 100KB. So if we have a 1,000 KB database and 1 byte is CHANGED anywhere, the deduplication makes sure we only need to back up 100KB to include the changed block.

But if even 1 byte is ADDED, say right in the middle of the database, then everything after it could “pushed” by 1 byte. Now the deduplication will see the entire 2nd half of the 1,000 KB database as different and we back up 500KB, even though only 1 byte was changed.

Some of that information is available now, but when implemented it wasn’t designed to be used for this type of performance debugging so doesn’t include everything one would want in a scenario like this.

Personally, I’d like to see a “package restore process” feature that would run at the end of a backup (or as often as set). This would basically package up the job settings and databases into an executable. This could then be run by somebody and, after entering the correct password, would basically set up a portable version of Duplicati with the job and database intact and open a web browser, connect to it, and leave the user on the Restore page.

But that’s kind of a lot of work for something that is likely to be used infrequently.

kenkendk · October 30, 2017, 12:04pm

Yes, there are checks regularly that test the index files for corruption (same as the data files). I am not sure when the problem happens, but it must be a case where the database is not correctly recording the need for a particular index block.

kenkendk · October 30, 2017, 12:06pm

I agree. The current solution is to use this Python script to do it: duplicati/Tools/Commandline/RestoreFromPython at master · duplicati/duplicati · GitHub

The “acceptable overhead” part is probably not covered here, and does not have a simple solution.

steviehs · October 30, 2017, 3:39pm

Thanks for the link, I will have a look at that. For having some ideas about times and volumes: My current plan is to set up a dedicated system to which I copy weekly my home directory. On this system I will then backup this dir with duplicati and duplicity both in incremental manner - with duplicati I will also backup the database as a separate backup. I will then restore duplicity, duplicati and duplicati with db backup to get some ideas about speed. For duplicati I will try also to rebuild the db from scratch. As this test system should then run for several months, I would love to automate it as much as possible, so it needs some time to set it up. A similar setup with windows cloud backup solutions would be great but I think my resources are too limited for that…

A first manual check of restoring from scratch with a full backup lead to following numbers: 250Gig Homedir → Duplicity 175G, restore 12h, Duplicati, 165G, restore 10h30, DB size: 720M. DB Backup size: 330M
Especially the DB size looks IMHO promising to store this regulary.

I hope, what I am planning helps in development. As soon as I have the setup done, I will start a new thread in comparisons or so.

sanderson · October 31, 2017, 3:30am

For what it’s worth, when I started off I was backing up my Duplicati databases after every backup, but I eventually gave up on this practice. I’m sure everyone’s experience will vary, but at the time my sqlite DB was about 4GB, and there was enough change from one backup to the next that it was uploading almost the entire 4GB even when very little of the source data changed.

JonMikelV · October 31, 2017, 5:31am

That was my experIence as well.

kenkendk · October 31, 2017, 12:33pm

The beta version does a “vacuum” operation at the end, meaning that it rewrites the entire database. The canary build will not do that. Since SQLite uses “pages” internally, it should be more efficient to do a backup of the database in the recent versions.

sanderson · November 2, 2017, 12:08am

That’s great news!

Now we’ll just see if which happens first, new beta build or I stop being lazy and get around to playing with canary