Sorry! Older bugreport.
Sorry! Older bugreport.
I just gave a run of the simplier “hanging” query on DB Browser for SQLite which I reformatted for clarity:
SELECT "CalcLen", "Length", "A"."BlocksetID", "File"."Path" FROM ( SELECT "A"."ID" AS "BlocksetID", IFNULL("B"."CalcLen", 0) AS "CalcLen", "A"."Length" FROM "Blockset" A LEFT OUTER JOIN ( SELECT "BlocksetEntry"."BlocksetID", SUM("Block"."Size") AS "CalcLen" FROM "BlocksetEntry" LEFT OUTER JOIN "Block" ON "Block"."ID" = "BlocksetEntry"."BlockID" GROUP BY "BlocksetEntry"."BlocksetID" ) B ON "A"."ID" = "B"."BlocksetID" ) A, "File" WHERE "A"."BlocksetID" = "File"."BlocksetID" AND "A"."CalcLen" != "A"."Length"
First of all I needed to manually run the commands below taken from the Duplicati/Library/Main/Database/Database schema/Schema.sql file as the database @Kahomono sent didn’t have these tables and view:
/* The PathPrefix contains a set of path prefixes, used to minimize the space required to store paths */ CREATE TABLE "PathPrefix" ( "ID" INTEGER PRIMARY KEY, "Prefix" TEXT NOT NULL ); CREATE UNIQUE INDEX "PathPrefixPrefix" ON "PathPrefix" ("Prefix"); /* The FileLookup table contains an ID for each path and each version of the data and metadata */ CREATE TABLE "FileLookup" ( "ID" INTEGER PRIMARY KEY, "PrefixID" INTEGER NOT NULL, "Path" TEXT NOT NULL, "BlocksetID" INTEGER NOT NULL, "MetadataID" INTEGER NOT NULL ); /* Fast path based lookup, single properties are auto-indexed */ CREATE UNIQUE INDEX "FileLookupPath" ON "FileLookup" ("PrefixID", "Path", "BlocksetID", "MetadataID"); /* The File view contains an ID for each path and each version of the data and metadata */ CREATE VIEW "File" AS SELECT "A"."ID" AS "ID", "B"."Prefix" || "A"."Path" AS "Path", "A"."BlocksetID" AS "BlocksetID", "A"."MetadataID" AS "MetadataID" FROM "FileLookup" "A", "PathPrefix" "B" WHERE "A"."PrefixID" = "B"."ID";
Any ideias why the sent database didn’t contain these tables and view?
I tried the “hanging” query anyway.
As I suspected, the inner most query, with it’s OUTER JOIN/GROUP BY/SUM on a table with more than 14 million lines (BlocksetEntry) against another with more than 14 million lines too (Block) takes too long and, AFAICT, produces tons of unnecessary results as later in the query only the BlocksetIDs present in the “File” view are actually filtered.
To make it faster I just filtered against “File”.“BlocksetID” earlier in this inner most query including an early “INNER JOIN” between the “BlocksetEntry” table and the “File” view.
To be fair, the instant results I’m getting are a direct result of the change I made AND the fact that my “File” view has no lines at all making the entire query return zero lines but as I expect (hope?) that the “File” view will always have considerably less lines than the “BlocksetEntry” table, this early INNER JOIN will always be a powerfull shortcut. Here is the complete query with the new “INNER JOIN”:
SELECT "CalcLen", "Length", "A"."BlocksetID", "File"."Path" FROM ( SELECT "A"."ID" AS "BlocksetID", IFNULL("B"."CalcLen", 0) AS "CalcLen", "A"."Length" FROM "Blockset" A LEFT OUTER JOIN ( SELECT "BlocksetEntry"."BlocksetID", SUM("Block"."Size") AS "CalcLen" FROM "BlocksetEntry" INNER JOIN "File" ON "BlocksetEntry"."BlocksetID" = "File"."BlocksetID" LEFT OUTER JOIN "Block" ON "Block"."ID" = "BlocksetEntry"."BlockID" GROUP BY "BlocksetEntry"."BlocksetID" ) B ON "A"."ID" = "B"."BlocksetID" ) A, "File" WHERE "A"."BlocksetID" = "File"."BlocksetID" AND "A"."CalcLen" != "A"."Length"
Can someone actually test the code? I just tested the SQL straight in DB Browser for SQLite.
Another thing: are there tests for Duplicati code? I couldn’t find them.
Humm… I just found out that the CreateBugReportHandler drops the missing tables and view, probably to create a smaller database.
If these tables are big enough to warrant being deleted for the bug reports, my “fix” might not actually save time after all.
It would be great to have a complete database so I would be able to test with the complete real data. Does anybody have a complete database big enough to be causing the above query to be too slow?
Probably from the sanitization and maybe some cleanup of thought-unneeded tables. Are they needed?
FWIW just running the slow query on my home-grown multi-block DB bug report only needed FixedFile.
Took 31 seconds, and 6 on second run, presumably due to the relevant data being cached somewhere.
Probably need to consult @warwickmm about the unit tests. There are a lot that run on PR submission.
There might be one more forum regular who builds from source. I’m not set up here to do any such thing.
P.S. you found the table mods already. The issue with an unmodified DB is it’s not particularly sanitized…
I’ll leave it to @Kahomono whether he’s willing to PM one that shows filenames. Log data (especially on clean runs) is pretty harmless (I think). If need be, whole table could probably be discarded for SQL tests.
There should never be any true source file data in the SQL-level representation of the DB as far as I know.
An alternative path might be to try to figure out why my not-particularly-private backup does a query faster.
I’m at about 5 million blocks (although 1 KB blocks) over 22445 files (I just backed up Program Files (x86).
Sorry - the file names themselves are rather descriptive; I need to keep them to myself.
My tests didn’t hit the wall until around 13.6 million blocks. 6.8 million ran just fine, it was only a hair slower than 1.36 million.
At a 1 KB blocksize (the minimum), I guess that means about that many GB. I don’t have that much laying around as installed programs (less likely to have personal paths), so maybe I can ransack my collection of Windows ISOs (all obsolete now…) and copy those and hope that having only a few files shows the issue.
Before trying that, I guess I should ask @rsevero if some similar approach is possible on local systems, because transferring the database might be challenging. I might try a WeTransfer, and I think it does 2 GB.
@ts678 if you could create the database, it would be great as I’m a complete Duplicati newbie. I believe WeTransfer should deal well with the resulting database.
Isn’t it possible for you to just scramble the filenames. I believe the actual filenames aren’t really important for this tests.
The unit tests are here:
Running now, but the poor mechanical hard drive in this desktop is very busy (queue length is about 5).
This is just an unencrypted backup with 1 KB blocksize of four folders checked in GUI to a local folder.
The only slight tricky part should be that
blocksize is on the Options screen 5 in
I can certainly help with any usage questions, and to some extent on the DB layout and how it’s used.
How the backup process works is helpful too, although it’s not really getting down to SQL table levels.
As for my testing to build a big database, I got one but it’s not slow. I wonder if more files would slow it?
I’m not too keen on going wholly-custom on data building. I wonder if throwing small files in might slow?
Row statistics comparison:
Kahomono ts678 Block 14973745 16230200 BloocksetEntry 14943985 15980592 FixedFile/File 39489 341
Kahomono DB ran the new query (after editing
FixedFile) for 11 hours before I stopped it.
Process Monitor showed it reading the database heavily. Process Explorer suggested 250 MB/sec, however almost none of it got through Windows to the disk, which Task Manager often showed idle.
DB Browser for SQLite results of the query on my database (not using a dump, just directly with it):
old new rows 0 2970 with 760 good, meaning the functionality of query changed. time 20 325
I really need access to a complete database that shows slowness to further work on this issue.
One approximation would be to use the FixedFile view. Admittedly performance behavior may differ from actual because there’s no longer a File view that has to go to other tables to return the result to Duplicati.
CREATE VIEW “File” AS SELECT “A”.“ID” AS “ID”, “B”.“Prefix” || “A”.“Path” AS “Path”, “A”.“BlocksetID” AS “BlocksetID”, “A”.“MetadataID” AS “MetadataID” FROM “FileLookup” “A”, “PathPrefix” “B” WHERE “A”.“PrefixID” = “B”.“ID”
Do you have an opinion on how to slow down the query after looking at the statistics differences between the slow-but-private DB and my not-so-slow-but-not-so-private DB? Is File count a plausible guess for it?
@Kahomono do you still have equipment (and can you do scripting) to write dummy data to try to match problem on your actual data? Guidance from the first paragraph might help figure out the recipe for “slow”.
Probably possible given enough tools and coding. A fixed value might be easier, but will that defeat tuning?
What I’m talking about is scrambling in the DB (we’re not DB experts), assuming that is what was meant. Duplicati’s dump does a nice job, but it also gets rid of the view for a table (where performance may vary).
The options might be:
Option 1 not going well simply adding more files. I added 40000 empty files and it didn’t slow, and the odd thing is that the File view showed only the folder not its files. I made the files non-empty. Still not changing.
What finally got the File view update (Refresh button wasn’t enough) was closing and reopening browser, however original query is still pretty fast (22 seconds). One interesting thing is the same query running in Duplicati is hitting the disk hard (unlike the query in browser), and is making it slow, but not ridiculously so.
Still looking for a good recipe for a complete database that shows slowness…
Just wanted to add that this is my experience too, both with Duplicati and with other apps that use the Sqlite. There is a certain point beyond which performance drops off a cliff.
My interim report on this slow query is at Query during backup takes too long, does not finish in the hope of seeing if its other slow query shows the same symptoms under lots of general-purpose performance tools.
Summary is that SQLite is reading at a crazy rate (maybe about 200 MB/sec, but not actually to the drive). Writing doesn’t seem to be happening much, which makes me wonder if 17 days’ run made any headway.
Architecture of SQLite makes me wonder if the super heavy read sequences are from the Pager using up all its cached pages (cache is indirectly configurable, so experiment is possible), so thrashing I/O instead.
The other issue at least maybe has hope of getting a database bug report with the relevant tables intact to attempt a performance analysis (and maybe SQL revamp) because it doesn’t use
It also had a query that was semi-slow before things fell off the cliff. Maybe that one could be tuned instead.
Ick. I’ve been sort of wondering if there are any SQLite issues in here, but it’s hard to prove. Have you seen any well-explored-and-documented writeups? If any workarounds were identified, that would be even better.
Sorry, this was a fair way back, maybe 7-8 years ago, and the only solution I found was to avoid the database getting too large by splitting it up.
Hopefully I will have a better internet connection soon, in which case I might start over with one of those large backups and see how it performs with a much larger block size.
I skimmed this thread and appreciate all the research you’ve done. It seems I would benefit from increasing blocksize from 0.5MB to 5MB. Is a blocksize conversion feature likely to be added at some point? It sure would be helpful so we could keep our backup history without doubling the backup size.
Looks unlikely with current level of community support. All progress depends on community volunteers.
Volunteers are a bit unpredictable. Sometimes they go. Sometimes new ones arrive. It’s hard to plan…
Please read further here on why IMO this is more than just a feature. I’ll highlight, but offer good news.
DB Query Performance Testing, Fixes, and Maintainability had many great thoughts on possibilities, however all of this could probably use some community help. If nothing else, see if below fix helped.
I am fairly sure I fixed this issue with an index in this pull request:#4687
I had this exact issue but now I don’t after applying the indexes.
I would still like to validate that though on someone else’s dataset.
184.108.40.206_canary_2022-03-13 is a not-publicized not-autoupdating preliminary Canary release with:
Improved database query performance, thanks @jedthe3rd
so in theory you can evaluate it on your own, however there are other untested changes in there too.
The usual Canary advice applies. Be careful because it contains new fixes and potentially new bugs.
This one is unusually rough in packaging (you can do an install from
.zip file) due to signing issues.
It was mainly a quick attempt to solve a Dropbox problem, but it does exist and I’ll be running it soon.
I don’t actually have a working backup job yet to validate, but the sanitized database will probably be ok as something to look at, but since it doesn’t have a backup that we could run operations on that database against, all it will be useful for is seeing sizes of tables and one off queries. So it would be good for troubleshooting one persons issues but not broad performance troubleshooting.