DB Query Performance Testing, Fixes, and Maintainability

tarianjed · February 27, 2022, 2:46am

I have realized now my support post (Identified another slow query during backup) has expanded past the Support tag it is under so I am going to make this post to have a working thread for some of the stuff I brought up.

I wanted to bring up a couple things I noticed while fixing my query performance issues and open a thread for general DB issues to be discussed.

Indexes -
I identified some indexes that basically solved most of the query performance issues I was experiencing on a large many file backup with blocksize set to default.

CREATE INDEX "nnc_Metadataset" ON Metadataset ("ID","BlocksetID")
CREATE INDEX "nn_FilesetentryFile" on FilesetEntry ("FilesetID","FileID")

-- Line 602 & 603 LocalBackupDatabase.cs 
-- CREATE INDEX "tmpName1" ON "{0}" ("Path"),tmpName1
-- CREATE INDEX "tmpName2" ON "{0}" ("Path"),tmpName2

CREATE INDEX "nn_FileLookup_BlockMeta" ON FileLookup ("BlocksetID", "MetadataID")

CREATE INDEX "nnc_BlocksetEntry" ON "BlocksetEntry" ("Index", "BlocksetID", "BlockID")

I will create a PR once I figure out how to implement them into the code.

Implicit Joins-
One of the main things I noticed while delving into these queries was the implicit joins.
They make it hard to read and understand the queries and can expose the code base to accidental cross joins. Basically it means lower long term maintainability.

If I made a PR with all of the implicit joins made explicit would that be something likely to be merged?
For all the queries I touched to fix my performance issues I had to already do the conversion to properly read them so it isn’t a lot of work.

General-
Please share any particular DB pain points that I should look at. I am new to Duplicati so I don’t really know the current state of things. But I am comfortable all the languages in the code base and have strong SQL and RDBMS maintenance/performance tuning skills. I will check out the issues in the github after I make my PR but if you feel there are any high priority ones link them here.

tarianjed · February 27, 2022, 3:58am

PR for the indexes -

github.com/duplicati/duplicati

Add indexes to improve backup query performance

duplicati:master ← jedthe3rd:db_changes

opened 03:47AM - 27 Feb 22 UTC

jedthe3rd

+49 -26

Here is the PR for the indexes I identified to fix some query issues. I mentione…d it all here: https://forum.duplicati.com/t/identified-another-slow-query-during-backup/14005 These probably should be included in the next beta release because they seem resolve some of the query issues caused by certain datasets combined with the default blocksize which almost caused me to find a different backup solution. Let me know if you have any questions.

ts678 · February 27, 2022, 9:46pm

This is great! Indexes have been added before, but you’re clearly getting improvements.

The sqlite3_analyzer.exe Utility Program did show indexes using much space on my DB.
That’s probably not generally run, so it might be useful to see which indexes are worth it.
That’s easier said than done though, because it may depend on the individual database.

Although I don’t do much SQL (and am not actually a Duplicati developer or pull request merger),
I guess these are what are also called comma joins, and seem quite out of favor, from what I find.
Unfortunately I didn’t ask original author why these were employed. Author is less available lately.

The SQLite Query Optimizer Overview Joins says (and I can’t see in changelog that it’s new)

Thus with SQLite, there is no computational advantage to use the newer SQL92 join syntax over the older SQL89 comma-join syntax. They both end up accomplishing exactly the same thing on inner joins.

One warning on any conversions (which you likely know, but I didn’t) is that precedence is different.
Beyond that, I’m all in favor of SQL that is easier to read and maintain well, maybe by non-experts, although I hope at some point an expert (perhaps you?) becomes a regular. The need is regular…

My personal way of trying to understand queries is to copy an actual query out of a profling log and reformat it at poorsql.com which changes the usual run-together line into something more readable.
I hoped the indented formatting would also help me understand the evaluation order, groupings, etc.

Other way to understand queries was to follow the source and paste things together, as code does.
I guess it’s helpful to have somewhat mnemonic names for subparts, but it’s quite a different “view”.
I’ll also pick text out of a profiling log, and try a source code search to try to find where SQL is from.

Although I don’t expect instant solutions to all this, I’ll nominate recreate, scaling, and transactions.

DB recreate is a big one. I did one today and watched lots of individual INSERT operations go by…
My understanding of SQLite performance is that fewer but larger operations sees better throughput.
There might be some limits beyond which it just collapses from size, as opposed to chugging along.

Scaling up in general has been somewhat troublesome. The only tool we have now is the blocksize.

Transaction design concerns me. I haven’t seen a design document on high level approach, haven’t researched in the code, and have neither the time nor the C#, .NET Framework, and SQL skills for it.

There are sometimes times one wants to commit to a table, but all one can do is commit every table, possibly interfering with the carefully laid out commit plan that some other code thread had intended.

This would probably not be a performance issue, but some functional error, e.g. if Duplicati got killed.
A nearly-ideal end goal would be to be able to kill Duplicati anytime, and have it recover next backup.
Frequently it can, but sometimes it cannot. This is a rich area for test and fix, if resources are around.

tarianjed · March 15, 2022, 7:45am

Had some time to delve into the DB recreate as my recreate is getting stuck on a specific query. I noticed what you were talking about with the many inserts. They are actually done in 1 transaction. The whole database recreate is 1 transaction actually. Not sure how I feel about that.

Some improvements I have identified and am still testing/implementing on my custom beta version -

Added PRAGMA optimize; before initial Block table inserts and after Block table inserts.
Group batch of inserts into single statements.
- There is some performance to be had from doing this - Squeezing Performance from SQLite: Insertions | by Jason Feinstein | Medium

The heaviest queries are the two inserts that occur in LocalRecreateDatabase.cs Line 208-255. I don’t see a good way to rewrite these queries without changing how the data is stored and I am not familiar enough with the database design and underlying concepts to see if there is a better way to store the data.

Also all the selects that occur to check if a block needs to be updated (LocalRecreateDatabase.cs Line 382) probably could be batch selected and stored app side and then a compare done. Or we could create a temp table and insert all pending blocks to be checked do a query to only return the updates we need to do. The selects take more time than the inserts for my specific database restore (pointed at a healthy backup). The majority of the database rebuild is spent on these selects.

This also brings up that we could see way better insert performance if we dropped the indexes before the large batches of inserts and recreate them after, that requires us to do the selects needed for the inserts beforehand. With my current setup I am getting 200 inserts/second. We will see how the batch inserts affects this rate.

To give some perspective on the inserts vs selects -
This is my log file (profiling level) scrollbar, red is inserts -

It is currently 36,550,333 lines long. Only 424,264 inserts, 36,197,941 selects (math doesn’t add up because I was dealing with live numbers constantly updating, point still stands).

I think maybe a hybrid in-memory work database that we persist results from to the disk database is probably the most effective method. Let me know what you think.

ts678 · March 15, 2022, 3:27pm

I’m not enough of a DB expert to comment, however one thing I worry about is my generally limited understanding of the underlying design philosophy of keeping relationships consistent despite what catastrophes (e.g. hard kills) may occur. Consistency (or at least repairability) need is also external, especially for ongoing backups. Maybe it’s less so for things like DB recreate that can’t be resumed.

Processing IIRC does dlist files first, so dangling block references will exist until dindex readings.
Perhaps this order suits Direct restore from backup files best (where one picks the version initially)?
Sorry I’m not familiar with the code internals. Most of my experience has been external observation.

Write-Ahead Logging and Enable write-ahead logging and memory mapped IO #4612 may fit in too, especially when talking about performance, and potentially temporary file sizes during DB recreate.

In terms of external files that don’t have a transaction model, there’s a state-tracking mechanism to hopefully make it possible to clean up whatever sometimes-partially-done work is at the destination.
Some of this cleanup, though, relies on the database which in theory has some aids to transactions.

Question is, did we use transactions correctly, especially for the concurrent processing that occurs?
Sometimes one thread must flush something it wants out, but does that break the plans elsewhere?

One challenge that Duplicati faces on memory is the extremely diverse size of systems that it runs on.
Maybe a challenging case would be a large backup running on a NAS, perhaps with Duplicati Docker (because Synology as of DSM 7 no longer will install the current version – Docker is the workaround).

An excellent person to bring into this would be the hard-to-find original author who also tried a rewrite. Unfortunately, it never emerged. I wonder if you would be a suitable person to try to pick up the effort?
It seems a huge waste to discard something possibly so near at least some initial level of “completed”.

Original author still seems short on time though, but as someone who’s done lots, you deserve an ask. Your willingness to dig in is fantastic. Maybe you’ll be invited to join staff (which is quite thin right now).

@kenkendk what do you think about this seemingly very relevant topic, and also recreate questions?

tarianjed · March 15, 2022, 3:51pm

Posted a targeted question in the dedicated recreate DB issues github issue:

github.com/duplicati/duplicati

Database recreate desperately needs improvement

opened 06:17PM - 05 Jan 20 UTC

instantlinux

performance issue local database issue

- [x] I have searched open and closed issues for duplicates. ----------------…------------------------ ## Environment info - **Duplicati version**: 2.0.4.21_experimental - **Operating system**: LInux - **Backend**: B2 ## Description Unacceptably poor performance of database recreate. And I have to trigger it way more often than I've wanted over the course of a year of running Duplicati. Is this fundamental usability issue getting prioritized? (A similar complaint #2889 is now over 2 years old.) ## Steps to reproduce 1. Create a reasonably big backup to a backend like B2 2. Trigger a database recreate, and compare the time it takes with the time it originally takes to create the first backup (recreate is considerably slower) To corrupt the sqlite database on demand, simply do this: 1. Start a backup 2. Force a crash of the server running duplicati (reboot it or otherwise prevent it from completing) 3. Confirm that your sqlite database is corrupted 4. May have to repeat a couple times to get it corrupted (doesn't take much effort, at least under linux) - **Actual result**: Database recreate takes longer than initial backup - **Expected result**: The recreate should take considerably less time, should never fail, and should inform the user what it's doing at each step. If the fundamental problem cannot be fixed because there are other higher-priority issues (for a backup tool, I cannot imagine anything higher priority than robustness of the metadata), then at least I would like to see a warning message if there are more than perhaps 10,000 metadata records to the effect that "Hey, this is going to be really, really slow and might not work; you might rather throw away all your savesets and start over instead." ## Screenshots ## Debug log I captured some info on the problem this most recent time, running the re-create for 10 days. It never finished so I hit the stop button. After a few minutes I got this: ``` Error while running main-b2 Missing block for blocklisthash: AfsxZ3bG4EUxcT6Ig751uWz350M ``` Then I could get the following log: ``` { "MainOperation": "Repair", "RecreateDatabaseResults": { "MainOperation": "Repair", "ParsedResult": "Success", "Version": "2.0.4.21 (2.0.4.21_experimental_2019-06-28)", "EndTime": "2020-01-02T19:31:57.298724Z", "BeginTime": "2019-12-22T20:02:38.73636Z", "Duration": "10.23:29:18.5623640", "MessagesActualLength": 0, "WarningsActualLength": 0, "ErrorsActualLength": 0, "Messages": null, "Warnings": null, "Errors": null, "BackendStatistics": { "RemoteCalls": 1493, "BytesUploaded": 0, "BytesDownloaded": 287066911076, "FilesUploaded": 0, "FilesDownloaded": 1492, "FilesDeleted": 0, "FoldersCreated": 0, "RetryAttempts": 0, "UnknownFileSize": 0, "UnknownFileCount": 0, "KnownFileCount": 0, "KnownFileSize": 0, "LastBackupDate": "0001-01-01T00:00:00", "BackupListCount": 0, "TotalQuotaSpace": 0, "FreeQuotaSpace": 0, "AssignedQuotaSpace": 0, "ReportedQuotaError": false, "ReportedQuotaWarning": false, "MainOperation": "Repair", "ParsedResult": "Success", "Version": "2.0.4.21 (2.0.4.21_experimental_2019-06-28)", "EndTime": "0001-01-01T00:00:00", "BeginTime": "2019-12-22T20:02:38.455934Z", "Duration": "00:00:00", "MessagesActualLength": 0, "WarningsActualLength": 0, "ErrorsActualLength": 0, "Messages": null, "Warnings": null, "Errors": null } }, "ParsedResult": "Warning", "Version": "2.0.4.21 (2.0.4.21_experimental_2019-06-28)", "EndTime": "2020-01-02T19:32:26.2636Z", "BeginTime": "2019-12-22T20:02:38.455929Z", "Duration": "10.23:29:47.8076710", "MessagesActualLength": 2990, "WarningsActualLength": 1, "ErrorsActualLength": 0, "Messages": [ "2019-12-22 12:02:38 -08 - [Information-Duplicati.Library.Main.Controller-StartingOperation]: The operation Repair has started", "2019-12-22 12:02:38 -08 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: List - Started: ()", "2019-12-22 12:02:40 -08 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: List - Completed: (1.80 KB)", "2019-12-22 12:02:44 -08 - [Information-Duplicati.Library.Main.Operation.RecreateDatabaseHandler-RebuildStarted]: Rebuild database started, downloading 5 filelists", "2019-12-22 12:02:44 -08 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Get - Started: duplicati-20180131T060331Z.dlist.zip.aes (653 bytes)", "2019-12-22 12:02:45 -08 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Get - Completed: duplicati-20180131T060331Z.dlist.zip.aes (653 bytes)", "2019-12-22 12:02:45 -08 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Get - Started: duplicati-20180907T182059Z.dlist.zip.aes (4.80 MB)", "2019-12-22 12:02:57 -08 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Get - Completed: duplicati-20180907T182059Z.dlist.zip.aes (4.80 MB)", "2019-12-22 12:02:57 -08 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Get - Started: duplicati-20190403T134113Z.dlist.zip.aes (19.63 MB)", "2019-12-22 12:03:38 -08 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Get - Completed: duplicati-20190403T134113Z.dlist.zip.aes (19.63 MB)", "2019-12-22 12:05:21 -08 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Get - Started: duplicati-20190619T110106Z.dlist.zip.aes (29.37 MB)", "2019-12-22 12:06:21 -08 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Get - Completed: duplicati-20190619T110106Z.dlist.zip.aes (29.37 MB)", "2019-12-22 12:13:13 -08 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Get - Started: duplicati-20190727T082143Z.dlist.zip.aes (29.38 MB)", "2019-12-22 12:14:14 -08 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Get - Completed: duplicati-20190727T082143Z.dlist.zip.aes (29.38 MB)", "2019-12-22 12:29:53 -08 - [Information-Duplicati.Library.Main.Operation.RecreateDatabaseHandler-FilelistsRestored]: Filelists restored, downloading 930 index files", "2019-12-22 12:29:53 -08 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Get - Started: duplicati-i002f3cff8cab4468abf275a59cf31ffc.dindex.zip.aes (348.15 KB)", "2019-12-22 12:29:54 -08 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Get - Completed: duplicati-i002f3cff8cab4468abf275a59cf31ffc.dindex.zip.aes (348.15 KB)", "2019-12-22 12:29:54 -08 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Get - Started: duplicati-i00d869e9950d419686e2c2e2cc53b035.dindex.zip.aes (272.01 KB)", "2019-12-22 12:29:55 -08 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Get - Completed: duplicati-i00d869e9950d419686e2c2e2cc53b035.dindex.zip.aes (272.01 KB)", "2019-12-22 12:29:56 -08 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Get - Started: duplicati-i01ae1ede7d2d40f290dfec3fa84d4511.dindex.zip.aes (377.72 KB)" ], "Warnings": [ "2019-12-30 03:49:00 -08 - [Warning-Duplicati.Library.Main.Operation.RecreateDatabaseHandler-FailedRebuildingWithFile]: Failed to use information from duplicati-b6a5683cc5f6249e29886c5618b2e5878.dblock.zip.aes to rebuild database: Object reference not set to an instance of an object" ], "Errors": [], "BackendStatistics": { "RemoteCalls": 1493, "BytesUploaded": 0, "BytesDownloaded": 287066911076, "FilesUploaded": 0, "FilesDownloaded": 1492, "FilesDeleted": 0, "FoldersCreated": 0, "RetryAttempts": 0, "UnknownFileSize": 0, "UnknownFileCount": 0, "KnownFileCount": 0, "KnownFileSize": 0, "LastBackupDate": "0001-01-01T00:00:00", "BackupListCount": 0, "TotalQuotaSpace": 0, "FreeQuotaSpace": 0, "AssignedQuotaSpace": 0, "ReportedQuotaError": false, "ReportedQuotaWarning": false, "MainOperation": "Repair", "ParsedResult": "Success", "Version": "2.0.4.21 (2.0.4.21_experimental_2019-06-28)", "EndTime": "0001-01-01T00:00:00", "BeginTime": "2019-12-22T20:02:38.455934Z", "Duration": "00:00:00", "MessagesActualLength": 0, "WarningsActualLength": 0, "ErrorsActualLength": 0, "Messages": null, "Warnings": null, "Errors": null } } ```

tarianjed · March 15, 2022, 6:29pm

Ya, I understand that limitation, we could probably have multiple options, high memory recreate and low memory recreate. Doing that in code in a maintainable way would be the challenge.

tarianjed · March 20, 2022, 4:07am

Adding this note here for future reference:

To improve performance on the block table we need to convert the block table to a without rowid table and add a clustered index. This is similar to how the BlocksetEntry and FilesetEntry are configured. This will require us to add code to handle block.ID throughout the code base and adjust any queries to take advantage of that clustered index.