How long should a repair of a database take?

tiberius · November 10, 2017, 11:57pm

Hi, I’ve been running a repair for about 3 days now. It shows the status as “starting” and task manager shows an average of about 30% CPU usage consistently throughout the time the db repair has been running, so I figured it’s still working? Just wanted to get confirmation. I’m backing up to onedrive for business in the cloud and so, I figure maybe there are so many files, plus the fact that it’s got to compare them from the cloud to the local db, that maybe this is expected? I just wanted to confirm and make sure I’m not waiting around for nothing and should try something else. The original issue was that my backup to the cloud was failing due to missing files and it said to run a repair. Thanks in advance!

JonMikelV · November 11, 2017, 5:00am

The length a repair varies quite a bit based on the number and length of file paths as well as the number of versions and, I think, individual blocks.

The “compare them to the cloud” part probably isn’t as much of the time as you might think because the actual archive files (likely) aren’t being processed so much as the remote index files, which are much smaller.

Three days does seem a longer than normal duration, I’ve heard of at least one process that took 5 days. However, it is still pretty long…

If you look at your Duplicati main menu “Show log” page then direct Live and Profiling, do you see transactions being logged?

tiberius · November 12, 2017, 4:25am

No actually, there have not been any transactions logged in the debug log since the day I started the repair.

JonMikelV · November 12, 2017, 4:43am

I’ll try to do a repair on a larger test backup and watch the log to see if I can identify potential log entries to expect…

tiberius · November 18, 2017, 1:16am

It was still running and it had been almost two weeks, so I decided to do a delete and repair, and now that failed with: “No files were found at the remote location, perhaps the target url is incorrect?” I tested the connection from Duplicati to OneDrive and it’s fine, and it actually did create a new file in the folder path for the backup when I did the Recreate (delete/repair) option… So I fear there is some sort of issue with duplicati finding existing files and now I may have to start over from scratch…

JonMikelV · November 20, 2017, 6:09pm

Sorry for the delay in following up. When you tested the connection to OneDrive did you see Duplicati files in the OneDrive destination?

I’m not sure how long a repair “should take” but in my high-file-count-low-block-count (local destination) test job a repair of a 300M .sqlite file WITH NOTHING WRONG took about a minute. However a recreate (delete-and-repair) took 65 minutes.

Note that this test backup is a stress test of 500,000 files (low version count), so there was a lot data to be rebuilt - however since the destination was a local file system, it may not accurately take into account file transfer overhead.

tiberius · January 4, 2018, 8:44pm

Yes, I have plenty of files in OneDrive. They are about 50MB a piece, and too many files to count. This is the error I’m getting now.

Duplicati.Library.Interface.UserInformationException: The database was attempted repaired, but the repair did not complete. This database may be incomplete and the backup process cannot continue. You may delete the local database and attempt to repair it again.
   at Duplicati.Library.Main.Operation.BackupHandler.Run(String[] sources, IFilter filter)
   at Duplicati.Library.Main.Controller.<>c__DisplayClass16_0.<Backup>b__0(BackupResults result)
   at Duplicati.Library.Main.Controller.RunAction[T](T result, String[]& paths, IFilter& filter, Action`1 method)
   at Duplicati.Library.Main.Controller.Backup(String[] inputsources, IFilter filter)
   at Duplicati.Server.Runner.Run(IRunnerData data, Boolean fromQueue)

JonMikelV · January 6, 2018, 3:36am

Well, that certainly sounds like things are pointing to where they should be…

Normally a database repair tries to download the necessary dindex files (smaller than the dblock archive files) to fill in it’s missing gaps. A database recreate would need to download all the dindex files. If any dindex files are missing or broken then the larger dblock archive files are downloaded. If dindex files are broken so badly that Duplicati can’t figure out which dblock files to download, then dblocks five has to be downloaded until the missing index (and database) data can be recreated. In the case of recreate with no dindex file, then EVERY dblock file would need to be downloaded.

So the actual length of a repair, from a downloads point of view, can vary greatly depending on what actually went wrong. On top of the downloads, there is also a lot of database processing that has to happen - and some of that code is probably not as optimal as it could be.

At this point, you’ll have to try the Recreate (delete/repair) option again to get the local database working.

Note that you should still be able to do a “Direct restore from backup files …” even if there is no local database, so the already done backups are still usable.

In fact, that might be a good test to make sure everything is pointing to the right place. If you can do a test restore of a single file, then we know the job is pointing to the right place which means if the database repair still won’t work then there’s a failure point somewhere.

Note that for test restores you should either restore an older version of a file or make sure to enable the --no-local-blocks parameter (which tells Duplicati to not use the local file to speed up the restore).

By the way, I edited your post by adding “~~~” before and after the error message to make it a bit easier to read.

conrad · March 19, 2018, 1:58pm

Just as a late note on this, it took two weeks to recreate the database for a 1.4TB backup of mine after “repair” repeatedly wouldn’t work. This appears to have been largely constrained by mechanical hard drive I/O I think. It seems to take pretty much as long as the initial backup run, unfortunately.

JonMikelV · March 20, 2018, 8:28pm

Thanks for the followup! Not many people have the patience to let the process finish if it takes that long.

I have a local 1TB backup that I think I might try a recreate on just for comparison as it seems for some people it’s relatively quick and for others it literally takes weeks.

conrad · March 20, 2018, 9:18pm

It does seem oddly long: if it had to read every byte of the source dataset, that implies a read speed of 1.4TB / (14 days) / (86400 seconds per day) = 1.2MB/s. Given the drives that are in there it should have been able to do that maybe a hundred times that fast if the data were accessed linearly. So does the rebuild scan the source dataset more or less randomly, meaning the bottleneck is seek speed/IOPS, not bandwidth?

JonMikelV · March 21, 2018, 3:01pm

Actually, the rebuild doesn’t look at the source data at all - it’s purely the destination data, otherwise no history would be included. Depending on the retention policy, age of the backup job, and amount of changes that happen with your source data that might mean even though you have a 1.4TB source you could have double that in the backup destination.

Those destination files (not necessarily all of them) need to be downloaded and are used to rebuild the local database. Technically, most of the local database is just a faster way to access all the history and hashes that exist in the destination - otherwise things like deduplication would require TONS of downloads with each backup job.

So a better throughput calculation would be to look at the destination size (either directly or in the job summary) and divide that by repair time. And when looking at potential bottlenecks like RAM and IOPS you also need to consider network bandwidth. (Of course if the destination is a physical drive attached to the machine, network bandwidth can be ignored.)

I started my recreate of a 94.74 GB backup (destination size) with 17 versions about 11 hours ago. The original sqlite file was 853MB and in-process one is at 253MB, so it’s looking like I’m running at about 23 MB/hr (sqlite size) thus will need about 37 hours total for the recreate from a LAN based destination over (old / slow) wifi.

Of course most people won’t know their original database size before doing the recreate, so that’s not the best way to do estimates…

I have a profiling log file being generated so hopefully when it’s done I can dig through that and look for slow points.

Edit:
Here I am about 8 hours later and the sqlite file size is still 253 MB, no new entries in the job log since around 2 AM, but the Live Profiling log shows DB transactions as recently as a minute ago. So maybe DB file size isn’t such a good estimation method…

conrad · March 21, 2018, 3:41pm

Ok then that’s super weird. As I said, my NAS was completely I/O bound during the rebuild — which was running on the NAS, so the source was local disc. The destination is BackBlaze B2, but I have a fairly fast internet connection (70Mbps down), never saw any significant use of download during the recreate, and total download traffic from B2 over that time looks to have been around 30GB.

Since I’ve only recently started running remote backups, source and destination are actually around the same size (both ~1.4TB, 40 versions).

I wish I’d instrumented the recreate better myself. Could the slowness be down to nasty database access patterns then? My database is 4.7GB: 14 days’ recreate time would imply 14MB/hr, not far off your 23MB/hr.

JonMikelV · March 21, 2018, 9:05pm

Sure could.

There are a number of known areas that could benefit from SQL optimization (“easy”), and a few known design decisions that could be revisited (“hard”)…

conrad · March 22, 2018, 11:21am

Noting your 8 hours of busy stasis above, that’s exactly the kind of thing I experienced during my recreate runs: days-long periods during which the progress meter didn’t budge by 1 pixel, but the machine was nevertheless working very hard; the process seems non-deterministic, in that these points were different between my two recreate runs on the same data (the first died at 11 days for other reasons). If I get time at some point I may try running a recreate on another machine with some database instrumentation.

JonMikelV · March 22, 2018, 3:19pm

Yes. The problem with progress bars is that you have to have a target something for completion. With a database recreate we don’t actually know what the “I’m done” target is, so the progress bar is very hard to get right.

I’m gonna throw this out there for @Pectojin to see what he thinks - would it make sense to include the current database “block table” record count as part of the manifest file in the dlist zip so that in cases of doing a recreate we can display the progress bar based on how many records we’ve written vs. the most recent “known” number?

Pros:

a progress bar that progresses

Cons:

small amount of “wasted” space in each dlist for something that will (hopefully) never be needed and even then is only for GUI purposes
if there are no dlist files we’re back to a practically static progress bar

TowerBR · March 22, 2018, 4:14pm

I think the main issue here is more basic: if you need a restore, would you be willing (could) wait days (or weeks) to do it?

Okay. You simply identified a backup problem, but (fortunately) you don’t need a restore, you just want your backup to work again. Question 2: Would you be willing (could) suspend your backup for days (or weeks) until the database repair is finished?

JonMikelV · March 22, 2018, 8:47pm

It depends on the scenario. If you need to restore everything, then yes - this would take quite a while (I still have some testing scheduled for that scenario once my current test is done).

If you’re only restoring a few files then you can restore directly from the destination and not need an entire database recreate to be completed first.

At the moment I’m assuming the intent of the OP’s question is to know how long it will take for the repair / recreate to finish - aka, how long will their machine be stuck NOT being able to make backups thus leaving fresh data at risk.

Pectojin · March 23, 2018, 8:09pm

Hmm, there is progress indication code in the recovery method, but I’m not sure it works as intended.

The way it’s implemented in the code is how I was imagining doing it, but there must be something wrong

The recreate method lists all the files from backend and then it keeps track of how many files it downloaded and processed vs how many there are in total. That way you should see an almost linear progression as each file you’ve downloaded and processed puts you n+1/total_files percent closer to being done.

JonMikelV · March 24, 2018, 4:22pm

This is just a guess but I suspect the culprit is the “does this hash already exist” SQL that doesn’t scale well as the record increase.

So the per-file progress works, but as more and more hashes are added each file of size X appears to take longer and longer. Toss a large file in there and the hash lookups take so long it appears to stall.

That’s why I was proposing basing it on hash records instead of files. The “more = slower” issue would still exist, but there progress bar would still progress with each hash.

And if course that SQL should be reviewed. If it can’t be optimized in place then a skinny temp table might be in order.

All in theory, of course…