Resource usage of recreation of local database

MaloW · June 5, 2019, 5:48pm

I just had an “Unexpected difference in fileset” error happen and googled myself to trying out recreating the local database to try to solve it. I was surprised by how long the recreation took so I tried to see why that is, and found that the resource-usage on my computer was quite odd. This is my backup size:
Source: 179.87 GB
Backup: 171.26 GB / 14 Versions
It is located on a hard-drive inside a NAS and accessed over the network from my computer.

First off I noticed CPU usage never really went above 20% on my 4c/8t CPU, which is fine I guess. I suppose the recreation process is mostly coded to be single-threaded but I guess it would be nice if more cores was used to speed it up. Not really an odd thing though, so that’s fine.
However I noticed that there was a VERY large amount of data written to my disk: https://i.imgur.com/6sLIwK8.png
At the same time I noticed that VERY little RAM was being used by Duplicati: https://i.imgur.com/j6SMQOJ.png
Checking the TBW (Tera-bytes written) on my SSDs confirms the heavy writing: https://i.imgur.com/BiyuqNI.png
The whole recreation process has taken about 8 hours so far, and it’s about 80% complete. Measuring the TBW written to my SSD over 1 hour shows about 300 GB written.
Looking at the actual local database that has been created it comes in at about 2GB: htt ps://i.imgur.com/qxqxXb8.png (had to break this link due to limit of 3 links per post as a new user)

To me these stats seems quite odd. I can’t come up with an explanation why several terabytes of data needs to be written to my SSD (reducing its life-span btw) when it all ends up just being a 2GB file. If the process needs to iterate with writes is it not better to do so in RAM and only write to the disk when complete? That whole 2GB file could easily fit in my RAM, and if not it could be paged to disk by the OS. I feel like doing so would probably speed up the whole process substantially. If I ever need to run a recreation again I will at least make sure to have the local database on a RAM-disk while it runs and just copy over the completed files after, I assume that would work. Or is there something weird with my system and I’m the only ones experiencing this odd resource usage?

ts678 · June 5, 2019, 7:49pm

Hello @MaloW and welcome to the forum!

What Duplicati version are you using? I’m hoping your version is not 2.0.4.18 canary, because this should have been made better, though extended reads of the remote can still happen if data is genuinely missing.

Before the fix, there was a false positive on missing data because of a change where empty files weren’t actually put on the remote. The recreate code hadn’t been adjusted, so kept on looking for an empty file…

Empty source file can make Recreate download all dblock files fruitlessly with huge delay #3747

Although having logs (got any?) would be better, one can estimate what Recreate is doing using following:

Recreating database logic/understanding/issue/slow

Server logs at Information level or above (e.g. Retry) can show what you’re fetching. I’d guess it’s dblocks.

Channels describes what canary is. If you decide to test it to see if it can Recreate faster, that’d be helpful. Don’t upgrade a production system to canary. It’s difficult to downgrade. Also, if you Recreate onto another system, don’t actually do a backup from there. You never want two systems backing up to one destination.

Typically, the first thing to try for “Unexpected difference in fileset” is to try delete of the version mentioned, perhaps using the Command option of the GUI. Adjust syntax which by default is set up to run backup, so generally you just need to remove the source paths backup would need, and add a --version delete needs.

Having tried to explain the busy disk as getting remote volumes in, I’m bothered to see 0% network activity.

EDIT: Are you a developer? I suspect 2.0.4.5 might be field-patchable using a debugger if you want to hack.

MaloW · June 5, 2019, 8:02pm

I’m using version 2.0.4.5 which I thought was the latest version since when I do “Check for updates now” / check on Duplicati it says so So I guess you can disregard everything I wrote if the later version already improved it! I didn’t know later versions are posted here on the forums.
I am indeed a developer, but I’ll probably just go ahead and install the latest version and hopefully it should be quicker if I ever have to recreate again in the future! If my issue persists after recreate I’ll go ahead and try the delete-method you advised.
Thanks!

ts678 · June 5, 2019, 8:11pm

You have the latest beta. You do not have the latest canary (which is bleeding-edge but people who are willing to use it for non-critical backups and to help report problems that are found are very helpful for the next beta).

A chronological view of releases, along with the actual download locations, is here in GitHub. Canary can add nice features (sometimes having bugs), known bug fixes, and unknown issues (sometimes rather severe…).

Settings in Duplicati explains how you can adjust the update systen. You can probably either set Canary, and update the beta when it finds one (updates go to a different directory and are detected at startup), or uninstall the beta and install canary as a fresh install. Uninstalling Duplicati should keep your old backup configuration.

MaloW · December 7, 2019, 12:51am

Just had another random “Unexpected difference in fileset” error happen, and again tried delete + recreate on local database because nothing else worked to fix it, and again I’m having the exact same issues with the recreate as the last time, namely HUGE amounts of data written to my SSD with progress being stuck at around 80% with seemingly no other resource of my computer being used. After letting it run for 8 hours I had to cancel it after it has eaten up 3 TB written (that’s 1% of the warranty-covered life-span of my SSD).

This time I’m on version Duplicati - 2.0.4.23_beta_2019-07-14

ts678 · December 7, 2019, 2:07am

v2.0.4.23-2.0.4.23_beta_2019-07-14

Changes in this version:
This update only contains warnings to inform users on Amazon Cloud Drive that the service is discontinued.

meaning you’re effectively still on 2.0.4.5. Until such time as the next beta can be made, can you run Canary instead? It’s quite good now except for Stop after current file button bugs (avoid that).

MaloW · December 7, 2019, 10:15am

Oh, that’s a very confusing versioning system you guys are using. You’re telling me that version 2.0.4.23 does not include the things that version 2.0.4.18 did? Do the canary and beta branches work completely independently while still sharing a number-sequence for the versions?

I’ve been hesitant to swap to the canary builds due to the warning about not using for important data, since the backup I’m running is kinda important. But since the Beta builds aren’t working very well for me I’ll give the canaries a try and report back.

drwtsn32 · December 7, 2019, 10:32am

Yeah, this was very unfortunate but I think it only has happened once. There was an important change that needed to be added but 2.0.4.6 version (and many later) were already used by Canary releases. There have been discussions about how to avoid this in the future but I don’t know if any decision was made.

I agree that in general I would not recommend Canary releases for your production backups, but the latest canary is quite stable (in many ways better than the current beta) and is getting close to being promoted to an experimental and/or beta release… So I am comfortable recommending its usage, too. I would switch back to the beta channel the next time one is released though.

ts678 · December 7, 2019, 2:07pm

The current release manager (who is the original author but would be happy to have someone else do it) likes Duplicati version numbering plan whose main idea is (I think) that instead of the number bump done on the way to the Beta (e.g. bump at Experimental), it’s done after release that might be patched, basically opening up lots of number space. Other releases such as Canary or Nightly (not yet running) get replaced, not patched, so possibly Canary is pretty much Nightly plus a human-made release note, which people may or may not read. If you stay on Canary instead of going back to Beta per earlier note then you probably should follow Forum Releases category awhile before actually taking Canary update.

v2.0.4.1-2.0.4.1_experimental_2018-11-08

Changes in this version:
This is a collection release that is based on the 2.0.3.14 canary build.

shows previous bump-at-Experimental into 2.0.4.x range used for Beta, Beta patch, and also Canary…

The scheme might get finessed further, but to use a specific example about “Unexpected difference” fix:

v2.0.4.22-2.0.4.22_canary_2019-06-30

Fixed data corruption caused by compacting, thanks @ts678 and @warwickmm

If that had been v2.0.5.17-2.0.5.17_canary_2019-06-30, would 2.0.4.23_beta_2019-07-14 be assumed to include the 2.0.5.x fix or not? Based on number, not, but would someone rely on the date?

Above gets solved by the proposal requiring a number bump after Beta or Stable (not yet running) that might avoid getting numbers quite so mixed together (for those who look at numbers instead of dates). Experimental is currently like a pre-Beta which helps weed out update issues for infrequent updaters…

There have also been proposals around downgrade compatibility (database changes format sometimes in some Canary typically) being reflected in the version numbering, and also a proposal to admit defeat, and just do Nightly and Canary, but that seems to be a minority view. The problem is that extended user testing leading to Beta and Stable takes awhile, but it does offer a range of new-versus-better-proven… Nobody’s been keen on being fancier than a single GitHub master except maybe when releases occur.

Comments?

AimoE · December 7, 2019, 2:38pm

That’s an odd proposal. I, for one, would be very happy to find db version mentioned in release notes (it still never is), but I cannot see any reason to build up a database of feature data into version numbering. Version number and file names are a bad choice for implementing a database.

ts678 · December 7, 2019, 3:06pm

Building compatibility clues into version numbers is the primary focus of Semantic Versioning, however IMO it’s more important for things like libraries. Duplicati doesn’t have a published API, however it does depend on format consistency, especially of the database against the code that’s using the database…

This has not been proposed, and the name/number scheme could probably not contain sufficient detail. A hint of possible feature change can be seen in the release number, as one can see in number bumps which are now proposed to be after broad releases rather than at start of the ramp towards the release.

MaloW · December 7, 2019, 4:09pm

So I updated to version “2.0.4.34_canary_2019-11-05” and started a recreation of my local database again. After 5 hours it’s around 80% done, which it has been for the past 4 hours. After these 5 hours another 1 TBW has been wasted on my SSD. So it seems nothing have changed compared to the previous version I was using and the issue still persists.

MaloW · December 7, 2019, 4:24pm

Regarding versioning, now I have no idea how your development process works so I’m not sure if this will work for you, but generally from my experience “beta”, “canary” or “experimental” builds are not given their own versions, instead they’re made as release candidates of a future stable release. For example if the latest stable release is v2.0.4 then the latest beta version might be v2.0.5-rc1 for example. And this beta version might go through a bunch of revisions before it’s deemed stable enough to properly release. As such it might take until v2.0.5-rc37 before it’s deemed stable to release, and then a v2.0.5 is released with its contents identical to v2.0.5-rc37, and the next beta build will be named v2.0.6-rc1. If you want nightly/experimental builds on top of that then give them an appendix as well like v2.0.5-nb12347 (where 12347 is just an incremented number for each build).

In terms of how to handle it branch-wise in Git I’ve had success using a system where you have 3 separate branches for Major, Minor and Patch releases. Everything first gets merged into the Major branch since you want all new features and changes to be included in the v3.0.0 release. Then part of the commits going there are cherry-picked onto the Minor branch, namely anything you wish to include in your v2.1.0 release. And then part of those commits are cherry-picked onto the Patch branch, namely anything you want to include in your v2.0.5 release.

Like I said, I don’t know if any of this will work for you guys, but that strategy is what I’ve had the most success with when it comes to versioning.

ts678 · December 7, 2019, 9:33pm

Thanks for the input. I’m not completely familiar with the automatic update code, but there’s quite a lot.
Channels describes that 2016 design in user terms, and we probably want to keep something like that.

Changing to something different from a version number plus channel is of unknown difficulty to me. I’m unsure how separate they are – accidental 2.0.4.3 Beta after 2.0.4.3 Canary was fixed by 2.0.4.5 Beta.

Whether or not build changes can fit into the scheme hasn’t specifically been discussed, but the use of multiple branches has been slightly discussed – and part of the challenge is a volunteer to cherry-pick. Currently the focus is on getting a long-awaited Beta out, then perhaps more process work is possible.

Issues to address to get out of beta pointed to some specific somewhat complex schemes. There was also a post pointing to Discussion: release cycle. There is disagreement over how hard it is to manage, so for the moment the one way to stabilize seems to be to slow down for awhile on less critical change.

If this means in your own work, where you’re quite aware of everything, that’s how Duplicati once was, however Duplicati has been growing into a multi-person (with a too-tiny value of multi) for some time…
Stay tuned and we’ll see what happens. Sometimes good volunteers join and things become less tight.

MaloW · December 7, 2019, 11:49pm

This comes from my professional development experience, I’ve been on a couple of huge projects (500+ employees working on the same code-base with multiple customers subscribing to different versions). Something similar to what I described is what I’ve seen work the best based on that experience, but like I said that doesn’t mean it will work well for you guys. The cherry-picking commits down to different branches does suck ass, and I can imagine this being an open-source project that it might be hard to get people to take that role though. My biggest recommendation regarding making that easier is to deprecate branches quickly to reduce the amount of branches you need to cherry-pick to. Going from a Major/Minor/Patch system to a Major/Minor system might help reduce the branches as well if only 2 numbers in the version is enough for stable releases (1 for bug-fixes and 1 for larger changes).

drwtsn32 · December 8, 2019, 1:44am

If I were to guess, your database recreation is requiring dblock downloads. Can you check About → Show Log → Live → Verbose to see what is happening? If you are running 2.0.4.34 and it’s downloading dblocks, you should see something like ‘… processing blocklist volume X of Y’:

This will at least give you an idea of how many dblocks are needed, and help you estimate the time.

Normally dblocks should NOT be needed to recreate the database. I have experimented with a way to fix the issue of requiring dblocks, but it requires a functioning database first.

drwtsn32 · December 8, 2019, 1:48am

By the way, the best way to deal with the “unexpected difference” issue is NOT to do a database recreation, but to do a simple version deletion. Do you happen to have a copy of the database before you started the recreation? If so, I would cancel the recreation, restore your database file, and then delete the offending backup version. Once the error is gone, you won’t experience it again now that you’re on 2.0.4.34.

MaloW · December 8, 2019, 10:06am

I ended up deleting my entire backup and starting from scratch. I’m only backing up ~160 GB so it only took an hour or so to create a completely new backup. Thanks for all the help though, hopefully I won’t be back in another 6 months with the same issue again now that I’m running the Canary version

ts678 · December 8, 2019, 1:59pm

This is often the best way out. I usually advise people not to have Duplicati be the only backup of long-term file history that they would really hate to lose. I used to get the “Unexpected difference” too often, on a production backup of a not-too-critical system, and abandon-and-restart was my usual approach.

“Unexpected difference in fileset” test case and code clue #3800 seems at least 100x better after a fix.

Empty source file can make Recreate download all dblock files fruitlessly with huge delay #3747 was a problem where “empty source file” and “fruitlessly” are they key points. They didn’t make a backup file, due to a design change, but the recreate code hadn’t been changed, so it looked everywhere for them. The fix took care of the bogus search, but a legitimate search can still occur as part of recreate design.

Looking everywhere is still possible if a block mentioned in a dlist file is actually missing, and there’s no option yet to say “just give up on those files”, which is basically what a version delete does (aiming with little precision), however a missing block can be in several versions because deduplication reuses them.

How the backup process works

github.com

duplicati/duplicati/blob/07c55ea9fcb480d79ca9e2e87eebb9bb8bc08bd5/Duplicati/Library/Main/Operation/RecreateDatabaseHandler.cs#L424-L429


      
          // We have now grabbed as much information as possible,

          // if we are still missing data, we must now fetch block files

          restoredb.FindMissingBlocklistHashes(hashsize, m_options.Blocksize, null);

                          

          //We do this in three passes

          for(var i = 0; i < 3; i++)

If your progress bar gets in the 70% to 100% range, it’s fetching dblocks. The last 10% is the heaviest. Downloads and uploads both pass through –tempdir, and some SSD users point that to a RAM disk… Duplicati used to be bad about leaving temporary files behind (named dup-<GUID>) but it’s better now.

I sure hope so. You might want to monitor release notice feedback if you keep taking Canary releases, or just change Settings back to Beta (or Experimental as a pre-Beta) to avoid unexpected surprises…

ts678 · December 8, 2019, 2:07pm

Thanks for the experimenting, and thanks for picking up on the thread after I missed the first of the two nearby posts, only starting down the process one instead of the still-having-trouble one. If you suspect there’s a more elegant fix than version delete, I’d love to know, but I also hope this issue will end soon.