DB Query Performance Testing, Fixes, and Maintainability

I have realized now my support post (Identified another slow query during backup) has expanded past the Support tag it is under so I am going to make this post to have a working thread for some of the stuff I brought up.

I wanted to bring up a couple things I noticed while fixing my query performance issues and open a thread for general DB issues to be discussed.

Indexes -
I identified some indexes that basically solved most of the query performance issues I was experiencing on a large many file backup with blocksize set to default.

CREATE INDEX "nnc_Metadataset" ON Metadataset ("ID","BlocksetID")
CREATE INDEX "nn_FilesetentryFile" on FilesetEntry ("FilesetID","FileID")

-- Line 602 & 603 LocalBackupDatabase.cs 
-- CREATE INDEX "tmpName1" ON "{0}" ("Path"),tmpName1
-- CREATE INDEX "tmpName2" ON "{0}" ("Path"),tmpName2

CREATE INDEX "nn_FileLookup_BlockMeta" ON FileLookup ("BlocksetID", "MetadataID")

CREATE INDEX "nnc_BlocksetEntry" ON "BlocksetEntry" ("Index", "BlocksetID", "BlockID")

I will create a PR once I figure out how to implement them into the code.

Implicit Joins-
One of the main things I noticed while delving into these queries was the implicit joins.
They make it hard to read and understand the queries and can expose the code base to accidental cross joins. Basically it means lower long term maintainability.

If I made a PR with all of the implicit joins made explicit would that be something likely to be merged?
For all the queries I touched to fix my performance issues I had to already do the conversion to properly read them so it isn’t a lot of work.

General-
Please share any particular DB pain points that I should look at. I am new to Duplicati so I don’t really know the current state of things. But I am comfortable all the languages in the code base and have strong SQL and RDBMS maintenance/performance tuning skills. I will check out the issues in the github after I make my PR but if you feel there are any high priority ones link them here.

1 Like

PR for the indexes -

1 Like

This is great! Indexes have been added before, but you’re clearly getting improvements.

The sqlite3_analyzer.exe Utility Program did show indexes using much space on my DB.
That’s probably not generally run, so it might be useful to see which indexes are worth it.
That’s easier said than done though, because it may depend on the individual database.

Although I don’t do much SQL (and am not actually a Duplicati developer or pull request merger),
I guess these are what are also called comma joins, and seem quite out of favor, from what I find.
Unfortunately I didn’t ask original author why these were employed. Author is less available lately.

The SQLite Query Optimizer Overview Joins says (and I can’t see in changelog that it’s new)

Thus with SQLite, there is no computational advantage to use the newer SQL92 join syntax over the older SQL89 comma-join syntax. They both end up accomplishing exactly the same thing on inner joins.

One warning on any conversions (which you likely know, but I didn’t) is that precedence is different.
Beyond that, I’m all in favor of SQL that is easier to read and maintain well, maybe by non-experts, although I hope at some point an expert (perhaps you?) becomes a regular. The need is regular…

My personal way of trying to understand queries is to copy an actual query out of a profling log and reformat it at poorsql.com which changes the usual run-together line into something more readable.
I hoped the indented formatting would also help me understand the evaluation order, groupings, etc.

Other way to understand queries was to follow the source and paste things together, as code does.
I guess it’s helpful to have somewhat mnemonic names for subparts, but it’s quite a different “view”.
I’ll also pick text out of a profiling log, and try a source code search to try to find where SQL is from.

Although I don’t expect instant solutions to all this, I’ll nominate recreate, scaling, and transactions.

DB recreate is a big one. I did one today and watched lots of individual INSERT operations go by…
My understanding of SQLite performance is that fewer but larger operations sees better throughput.
There might be some limits beyond which it just collapses from size, as opposed to chugging along.

Scaling up in general has been somewhat troublesome. The only tool we have now is the blocksize.

Transaction design concerns me. I haven’t seen a design document on high level approach, haven’t researched in the code, and have neither the time nor the C#, .NET Framework, and SQL skills for it.

There are sometimes times one wants to commit to a table, but all one can do is commit every table, possibly interfering with the carefully laid out commit plan that some other code thread had intended.

This would probably not be a performance issue, but some functional error, e.g. if Duplicati got killed.
A nearly-ideal end goal would be to be able to kill Duplicati anytime, and have it recover next backup.
Frequently it can, but sometimes it cannot. This is a rich area for test and fix, if resources are around.

1 Like

Had some time to delve into the DB recreate as my recreate is getting stuck on a specific query. I noticed what you were talking about with the many inserts. They are actually done in 1 transaction. The whole database recreate is 1 transaction actually. Not sure how I feel about that.

Some improvements I have identified and am still testing/implementing on my custom beta version -

The heaviest queries are the two inserts that occur in LocalRecreateDatabase.cs Line 208-255. I don’t see a good way to rewrite these queries without changing how the data is stored and I am not familiar enough with the database design and underlying concepts to see if there is a better way to store the data.

Also all the selects that occur to check if a block needs to be updated (LocalRecreateDatabase.cs Line 382) probably could be batch selected and stored app side and then a compare done. Or we could create a temp table and insert all pending blocks to be checked do a query to only return the updates we need to do. The selects take more time than the inserts for my specific database restore (pointed at a healthy backup). The majority of the database rebuild is spent on these selects.

This also brings up that we could see way better insert performance if we dropped the indexes before the large batches of inserts and recreate them after, that requires us to do the selects needed for the inserts beforehand. With my current setup I am getting 200 inserts/second. We will see how the batch inserts affects this rate.

To give some perspective on the inserts vs selects -
This is my log file (profiling level) scrollbar, red is inserts -

It is currently 36,550,333 lines long. Only 424,264 inserts, 36,197,941 selects (math doesn’t add up because I was dealing with live numbers constantly updating, point still stands).

I think maybe a hybrid in-memory work database that we persist results from to the disk database is probably the most effective method. Let me know what you think.

I’m not enough of a DB expert to comment, however one thing I worry about is my generally limited understanding of the underlying design philosophy of keeping relationships consistent despite what catastrophes (e.g. hard kills) may occur. Consistency (or at least repairability) need is also external, especially for ongoing backups. Maybe it’s less so for things like DB recreate that can’t be resumed.

Processing IIRC does dlist files first, so dangling block references will exist until dindex readings.
Perhaps this order suits Direct restore from backup files best (where one picks the version initially)?
Sorry I’m not familiar with the code internals. Most of my experience has been external observation.

Write-Ahead Logging and Enable write-ahead logging and memory mapped IO #4612 may fit in too, especially when talking about performance, and potentially temporary file sizes during DB recreate.

In terms of external files that don’t have a transaction model, there’s a state-tracking mechanism to hopefully make it possible to clean up whatever sometimes-partially-done work is at the destination.
Some of this cleanup, though, relies on the database which in theory has some aids to transactions.

Question is, did we use transactions correctly, especially for the concurrent processing that occurs?
Sometimes one thread must flush something it wants out, but does that break the plans elsewhere?

One challenge that Duplicati faces on memory is the extremely diverse size of systems that it runs on.
Maybe a challenging case would be a large backup running on a NAS, perhaps with Duplicati Docker (because Synology as of DSM 7 no longer will install the current version – Docker is the workaround).

An excellent person to bring into this would be the hard-to-find original author who also tried a rewrite. Unfortunately, it never emerged. I wonder if you would be a suitable person to try to pick up the effort?
It seems a huge waste to discard something possibly so near at least some initial level of “completed”.

Original author still seems short on time though, but as someone who’s done lots, you deserve an ask. Your willingness to dig in is fantastic. Maybe you’ll be invited to join staff (which is quite thin right now).

@kenkendk what do you think about this seemingly very relevant topic, and also recreate questions?

Posted a targeted question in the dedicated recreate DB issues github issue:

1 Like

Ya, I understand that limitation, we could probably have multiple options, high memory recreate and low memory recreate. Doing that in code in a maintainable way would be the challenge.

Adding this note here for future reference:

To improve performance on the block table we need to convert the block table to a without rowid table and add a clustered index. This is similar to how the BlocksetEntry and FilesetEntry are configured. This will require us to add code to handle block.ID throughout the code base and adjust any queries to take advantage of that clustered index.