Database recreate performance

ts678 · February 18, 2023, 9:01pm

In addition to those, damage can be by software handling files wrong. Those are more controllable.
Damage is also in the eye of the software. In one open issue, an interrupted compact got confused.
It deletes dindex file (as it should), gets interrupted before commit. so next run sees a “missing” file.
This is a fairly harmless nuisance though. I’d like to know if software can currently lose a dindex file.

Another approach in addition to speedup of the worst case might be detecting pending recreate risk. Yesterday I was working on a Python script heading towards sanity checking dblocks versus dindex.
Step after detection might be Duplicati code to recreate any missing dindex files before it’s too late…

No argument with any of that. Just wanting to say this may be difficult, just as fixing bugs is difficult…

How To Corrupt An SQLite Database File

If an application crash, or an operating-system crash, or even a power failure occurs in the middle of a transaction, the partially written transaction should be automatically rolled back the next time the database file is accessed.

SQLite is Transactional

The claim of the previous paragraph is extensively checked in the SQLite regression test suite using a special test harness that simulates the effects on a database file of operating system crashes and power failures.

is SQLite claim anyway. I know some systems must have something in them that goes wrong, as there sometimes are times when SQLite flat out can’t open its database – a step worse than bad data inside.

Xavron · February 18, 2023, 9:15pm

Its not that its unmaintainable. Its that its a lot and apparently too much. Its many years on things. For a backup application where stability is vital, its too slow on fixing things. The only two ways around that are more time spent fixing things and slimming it down.

If you’re fine with various things being broken for years then its fine. I’m also fine with it as I’m not experiencing any issues atm and the issues I know about I know how to avoid.

But, I’d still instantly axe a bunch of things to make it easier to maintain. I will do that with my own projects where I need to.

Of course, recreate here might be worth it to keep it assuming it receives enough improvements. Its a valid viewpoint to say its necessary. Personally, I’d never ever wait for more than a day on it. I’d find another way of doing things or focus on its performance to the point where something happens that it can be made fast enough to be happy with it.

gpatel-fr · February 18, 2023, 9:40pm

I would not send PR to fix them if I was fine with it.
It’s not right in this time to have dog performance while recreating DB with a data size of 500 GB. In my tests I have seen performance begin to degenerate with 1.2 M blocks (equivalent to 120 GB of data with standard block size) Having to change block size with 10 TB of data or more could be expected. Many users would consider and ask themselves if backing up 10 TB could need special consideration before starting to configure a backup. Not many will do with 500 GB.

ts678 · February 18, 2023, 10:12pm

Try not backing up the database. APFS local snapshots are probably invisibly low-level.
Windows NTFS ones are that way at least, but they do cause a brief suspension of I/O.

So situation is just what the message says. It’s not really corruption, just unfinished work.
The way to avoid this is avoid intentional interruption. May be hard to do with it so slow…

Is the database on that drive? If so, don’t do that if the drive is prone to being unplugged.
If the database is on the permanent drive, can source disconnect reliably break test DB?
How exactly is it messaged? Source drive errors usually are caught and just do warning.

Agree with @gpatel-fr question. Maybe you’re thinking of Options Remote volume size.

There’s a link on the GUI screen there to click, or direct link is Choosing sizes in Duplicati.
Remote volume size is a simpler-but-still-quite-confusing term for the dblock-size option.

It’s got to be from something. I don’t know macOS much, so can’t give step by step advice.
Doing Google search for troubleshooting macos performance claims 43 million hits though.

This is interesting. For me, it reliably stops after finishing what it’s doing and uploading data.
I’m on Windows. Any one of the Duplicati folks want to see if they can reproduce this issue?

It’s pretty clear some resources are being exhausted, so keep looking for what that could be.
Do you know how to measure the size of processes, such as the mono ones doing Duplicati?
One Linux user was observing memory growth although we don’t know exactly the operation.

A little too vague, though I have some guesses of similar ones.

looks like (on mine) the same as the Started to Completed time I posted. It still leaves open the
question of whether there’s some other work such as decryption that might be pushing time up.

2022-10-17 16:52:39 -04 - [Profiling-Timer.Finished-Duplicati.Library.Main.BackendManager-RemoteOperationGet]: RemoteOperationGet took 0:00:00:06.617

If that’s true of the external drive, then how a source error damages a DB gets more mysterious,
although the definition of “corrupted” DB being used here isn’t the usual one that one would use.

as does disk congestion, which is why I’ve been asking. There’s a CLI-based test I described too.

Make sure you understand my comment on CPU cores, but if normal use means not straining for
database recreate or something, are you saying that slows things down but all monitors look fine?

If you somehow back it up during backup, it goes instantly obsolete, as it’s changing during backup.
If you back it up while idle and restore a stale copy later, it mismatches and Repair hurts destination.
Database and destination must always match, e.g. you can copy it with Duplicati script after backup.
Configuration data in Duplicati-server.sqlite is less active but usually you Export and save the config.

To soften the impact a little, if space allows and old one is still intact, save it for if you need older files.
Newer one with a better blocksize and who knows what other old problems removed may work better.
Sometimes hidden damage may be possible and reveal itself in a few ways, e.g. dblock downloading.

I can think of an ISP action that would be a lot worse than that. It’s nice that yours wasn’t very upset…

gpatel-fr · February 18, 2023, 10:24pm

if the block size is at the default value, that is, 100K, with a total data size of 6 TB, that would mean 60 Millions blocks to manage. Database size could be about well, maybe 30 GB ? In the abstract, such database size could work all right with simple, optimized queries. But with the current queries it’s way too much. Recreating backup with a block size of 5 Mb would reduce the block number to about 1.5 M and that would be much more manageable.

@StevenKSanford : open a sqlite db browser, open the job database and enter
select count(*) from block
if it’s over 2 millions, you have to consider raising the block size.

ts678 · February 18, 2023, 11:14pm

For example DB Browser for SQLite or any one you like.
While overloading Duplicati is known but hard to fix, how
exactly that leads to whole system slowdown is not clear.

Another approach might be to measure I/O done at mono which maybe doesn’t actually hit the SSD.
OS caching can do that, but keeping macOS busy handling I/O requests might slow other programs.

It’s nice (in an odd way) to see your idea of the rough limit is the same one I got. I think I tested with increasingly small blocksize, and at some point it spent all its energy endlessly reading etilqs files, making me wonder if making some cache larger might avoid that – same idea as you’ve been trying.
Watching file activity on Windows would be with Process Monitor. I suppose Linux could run strace.

On another topic, I might test disconnecting a USB-attached SD while backing up to see how it goes.

gpatel-fr · February 18, 2023, 11:16pm

that’s exactly what I did, I did not want to mess with hundreds of gigabytes to do tests, so I set block size and dblock size to default / 10.

ts678 · February 18, 2023, 11:38pm

Test with USB SD card was to do first backup with one file to get at least one dlist up.
Next added a folder, began backup, pulled USB, got yellow popup at bottom that said:

Log truncates that list and only shows first line (I opened an issue asking for more) e.g.

2023-02-18 18:20:57 -05 - [Warning-Duplicati.Library.Main.Operation.Backup.FileBlockProcessor.FileEntry-PathProcessingFailed]: Failed to process path: I: …

Profiling log that I run gives the important second line giving an exception summary, e.g.

System.IO.DirectoryNotFoundException: Could not find a part of the path '\?\I: …

Live log could have found that too. One can click for details, but I’d prefer better regular log.

Verify files button has no complaints, so at first glance I’m not sure I can reproduce bug.
Next backup starts just fine and gets 86 warnings when I kill it prematurely by disconnecting.
If macOS is different, I can’t run it, but I’d still like test-backup steps from @StevenKSanford

StevenKSanford · February 19, 2023, 1:37am

@gpatel-fr

“blocksize” in my reply is “–dblock-size”, which I set using the GUI config screen #5 to 1 GB.

StevenKSanford · February 19, 2023, 2:00am

My local sqlite DB in the ~/.config/Duplicati folder, for the recreate that just failed (froze my system), is currently 13GB. This is the largest DB in that directory, although I have several more that are nearly as large.

gpatel-fr · February 19, 2023, 9:17am

so your block size is still at the default value of 100 K. That’s way too low for good performance currently. You should raise it to 5 MB and recreate your backups.

JimboJones · February 19, 2023, 7:01pm

Seems to me, raising the default -blocksize to a larger value could be a good thing moving forward, I’m thinking 100K just isn’t quite enough for the amounts of data people are now backing up.

@gpatel-fr Thanks for the effort on a fix.

This is the in the developer section so I’ll toss a bit more onto the fire…

1- As mentioned above, changing the default -blocksize to a larger value (YTBD) is likely a simple change that could be implemented in the next release (I know, also YTBD). Short of re-compiling with the new default there shouldn’t be anything else required, no re-writing of code or conflicts it’s just a default change that would likely prevent a lot of user frustration moving forward as this sort of thing seems to be coming up more and more often.

If that’s not in the books for the foreseeable future could we create a “hot fix” of sorts, something users can run post install to adjust the -blocksize setting to a larger default value? This leads to item 3, but stop by item 2 first.

2- Change the wording in the Duplicati documentation around -blocksize to emphasize how it should be scaled up based on data sizes. Maybe a note like if you’re backing up more than 0.5TB you absolutely should increase the value to “xyzKB/xyzMB” or more. I started re-writing the manual page on the topic but lost the edit in a stray reboot, I’ve been trying to get back to it but things have been busy as of late.

3- Create a routine that can change the -blocksize for an existing backup set. Now I haven’t really looked into this all that hard but I know there are many threads around here on the topic and at the end of the day it can’t currently be done but holy moly would it be nice.

I get that the whole DB and backup set would probably have to be recreated and storage would likely increase following the process. I also expect that the process would require at least 2x the storage space in the first place to make it even close to a safe process but I don’t think that’s all that out of reach for that many users. I think this should be a local process vs processing directly to a remote destination.

Storage is cheap and time is money… An external 12TB USB3 drive is only a few hundred bucks and getting cheaper by the day. If the process has to download the existing backup set to a sufficiently large local drive, convert/verify it, remove the old one (option to keep either version locally), purge the old backup set then upload the new set, that doesn’t seem like the worst thing. You can at least reuse the external drive afterwards to create an additional local backup.

I’m by no means saying that’s a simple set of tasks to complete but there has to be a way it could be done and if so would probably really help with user retention.

If the users backup set is too large to be converted locally then they would need to create a new backup with a “better” -blocksize value. That brings up another thing, in a changing data set the “best” -blocksize could change next week or most certainly after a few years, chances are your data is going to grow not shrink so being able to move up a new -blocksize at some point seems like a really valuable feature.

Sure, if you’d rather just make a new backup set then by all means but I really think that that option only appeals to a very small percentage of users faced when faced with that reality. One of Duplicati’s best features, in my mind is it’s versioning and to loose a years of versions would be a huge hit for many, if not most.

I’ll do some more reading on the subject…

4- I’ll try to get some tests setup to see if I can catch TM and Duplicati clashing.

StevenKSanford · February 19, 2023, 9:28pm

re: --blocksize

@gpatel-fr

Thanks for the clarifications on blocksize vs dblock-size. The backup I’m currently repairing only had two versions, so I’ve deleted it and am recreating with a 5MB blocksize, and 1GB dblock-size. Let’s see how this works?

@JimboJones

Thanks for your thoughts on hash blocksize. Being stuck with an initial blocksize for years of backups is, as you rightly pointed out, undesirable. I really like your idea for a utility to re-blocksize the backups without losing all of one’s versions.

I’d love to contribute to the coding efforts, but my programming skills were built in the 1970s-1980s. COBOL anyone?

ts678 · February 20, 2023, 12:30am

Maybe here:

github.com

duplicati/duplicati/blob/666b2281032460254839fdc3b6e1055fdf7ce1db/Duplicati/Library/Main/Options.cs#L38-L41


      
          /// <summary>

          /// The default block size

          /// </summary>

          private const string DEFAULT_BLOCKSIZE = "100kb";

I’d agree, although there might still be improvements possible for some areas such as recreate.
If, for example, it’s doing anything (e.g. SQL) per-block, that will really add up with many blocks.
@gpatel-fr has likely looked at that more than I have, in addition to being better at C# and SQL.

That would simplify it, if it winds up being a standalone. A less complex rewriter is described at:

Re encrypt remote back end files (referencing a Python script from forum user @Wim_Jansen)

and it also gets into more of the little details of the files. First thing to do might be to open a few.
Format doesn’t seem super complicated. A file is either a block or a blocklist that lists its blocks.
Both cases identify a block by its SHA-256 hash, sometimes directly, sometimes after a Base64.

The challenge is that a hash is one-way, so hash of (say) 10 blocks in sequence can’t be known
unless the blocks are around. Unfortunately they’re packed into dblock files, but can be undone,
however that’s more storage space temporarily while repacking data into larger blocks per spec.

Following the idea of opening a few files, unzip of a dblock can just drop all its blocks right there.
The dindex file has a vol folder with a file of JSON data describing the contents of its dblock file.
There might also be a list folder to hold redundant copies of blocklists also in its big dblock file.

EDIT:

Duplicati.CommandLine.RecoveryTool.exe is another rewriter, in C#, working on recompression.
Duplicati.CommandLine.BackendTool.exe might possibly be the basis for a file uploader where it
matters that the uploaded files wind up looking like Duplicati made them. Google Drive likes that.

StevenKSanford · February 20, 2023, 11:20pm

Not knowing the code at all, this may be impossible:

Could “blocksize” be made an attribute of the backup version? That way, one could change the blocksize going forward, like “dblock-size” can be changed. I imagine the local database might have to keep separate tables for each version, though, unless it’s already doing that now.

That would open up an opportunity to re-blocksize a single version at a time, which would use local temp space to download the files, but not all the versions at once?

ts678 · February 20, 2023, 11:41pm

Infeasible would be a better word. Given unlimited resources, lots of things would become possible.

It’s not. Fixed block size is quite firmly embedded. Short blocks (e.g. at end of file) are of course OK.

Files of a backup version are mostly older blocks (only changes are uploaded), so concept doesn’t fit.
There might be some similar way to do this, but developer resources are too scarce. Any volunteers?