Identified another slow query during backup

So I just started testing out Duplicati and ran into some major issues with query performance causing major delays in my backup successfully completing.

They are running on a ~300GB dataset that is a subset of the almost 40TB I will be eventually backing up. (more info here: Poor performance on TrueNAS)

Luckily I am good with databases so I will be following up with some proposed solutions after this backup successfully completes.

Problem queries:

Runtime - 3:42:24.304 - Location LocalDatabase.cs Line 756

SELECT "CalcLen", "Length", "A"."BlocksetID", "File"."Path"
FROM ( 
    SELECT "A"."ID" AS "BlocksetID", IFNULL("B"."CalcLen", 0) AS "CalcLen", "A"."Length"
    FROM "Blockset" A 
    LEFT OUTER JOIN ( 
        SELECT "BlocksetEntry"."BlocksetID", SUM("Block"."Size") AS "CalcLen"
        FROM "BlocksetEntry" 
        LEFT OUTER JOIN "Block" 
            ON "Block"."ID" = "BlocksetEntry"."BlockID"
        GROUP BY "BlocksetEntry"."BlocksetID" 
    ) B 
    ON "A"."ID" = "B"."BlocksetID" 
) A, "File"
WHERE "A"."BlocksetID" = "File"."BlocksetID" 
    AND "A"."CalcLen" != "A"."Length"

Runtime - 0:12:47:47.152 (currently running multiple times, backup stuck running these 13hr long queries) - Location LocalBackupDatabase.cs Line 586

SELECT COUNT(*)
FROM "File" 
INNER JOIN "FilesetEntry" ON "File"."ID" = "FilesetEntry"."FileID"
WHERE "FilesetEntry"."FilesetID" = 2 
    AND "File"."BlocksetID" = -200 
    AND NOT "File"."Path" IN (
        SELECT "Path"
        FROM "File" 
        INNER JOIN "FilesetEntry" 
            ON "File"."ID" = "FilesetEntry"."FileID"
        WHERE "FilesetEntry"."FilesetID" = 1
    )

I am going to have to monitor this backup until it successfully completes (probably in a week based on current query performance) then do some tests to see what can be done.

2 Likes

Fantastic - it is known that some of the database queries could use optimization. If you have the skill to help optimize, your help would be greatly appreciated. Please check out the project’s Github issues page and you’ll see some related to database query performance.

After doing some reading it seems the blocksize parameter and query performance are significantly related. (Why the heck CAN'T we change the blocksize? - #45 by ts678)
It also seems like the number of files being backed up might also have a major effect. I will probably test this with a large but few file backup and a large but many file backup.

Can anyone point me to a in depth explainer on the architecture and database design, probably specifically around how blocksize affects data on the database if any. I am reading through the stuff in the manual and wiki but anything else will help. (How the Backup Process Works - Duplicati 2 User's Manual) (Local database format · duplicati/duplicati Wiki · GitHub)

I think there is enough evidence to confidently say we should edit this page in the manual (Choosing Sizes in Duplicati - Duplicati 2 User's Manual) to say small and even default blocksize will cause poor backup performance. It should probably also be added under general options, with some rules of thumb (which we don’t really have a good idea of atm).

If there are any other tests people would want me to try let me know. I have a pretty robust environment with 10GbE to my destination (local windows pc) so we can nail down some rules of thumb performance wise if I know what tests to do.

I identified some indexes that basically solved my query performance issues:

CREATE INDEX "nnc_Metadataset" ON Metadataset ("ID","BlocksetID")
CREATE INDEX "nn_FilesetentryFile" on FilesetEntry ("FilesetID","FileID")

-- Line 602 & 603 LocalBackupDatabase.cs 
-- CREATE INDEX "tmpName1" ON "{0}" ("Path"),tmpName1
-- CREATE INDEX "tmpName2" ON "{0}" ("Path"),tmpName2

CREATE INDEX "nn_FileLookup_BlockMeta" ON FileLookup ("BlocksetID", "MetadataID")

CREATE INDEX "nnc_BlocksetEntry" ON "BlocksetEntry" ("Index", "BlocksetID", "BlockID")

I will create a PR once I figure out how to implement them into the code.

Also what is the appetite for rewriting all the implicit joins as explicit joins?
It is an antipattern to use implicit joins as they are harder to understand and follow. They can also lead to accidental cross joins.

2 Likes

I don’t think there’s anything that deep, but there are some additional thoughts about it at

Followup ensued, original author added some specifics, and I gave my general summary:

Developer category has some other information, e.g. if you want to get into the Channels.

Channel Pipeline might help if concurrency issues in DB handling arise. I suspect they do.

This is fantastic, as expertise is all too rare. There are others trying, but it’s hard to get into.
Sometimes I can answer questions based on looking at databases rather than writing SQL.

Moving the rest of this convo to this post (DB Query Performance Testing, Fixes, and Maintainability) since it fits better under the Developer category.

1 Like

Hi @tarianjed,

Give this a whirl, for the first query, it halves execution time for me:

select CalcLen, Length, F.BlocksetId, Path
from
(	
	select ifnull(sum(B.Size),0) as CalcLen, BS.Length, BS.Id as BlocksetId
    from Blockset as BS
		left outer join BlocksetEntry as BSE
			on BS.Id = BSE.BlocksetID
		left outer join Block as B
			on BSE.BlockId = B.Id
	group by BS.Id, BS.Length
	having BS.Length != ifnull(sum(B.Size),0)
) as BlockLenIssues
	inner join File as F
		on F.BlocksetId = BlockLenIssues.BlocksetId

I moved onto something else and then realised I can go further. This is 2.5 times faster:

select CalcLen, BS.Length, BS.Id as BlocksetId, Path
from Blockset as BS
	inner join File as F
		on F.BlocksetId = BS.Id
	left outer join
		(
			select BSE.BlocksetId, ifnull(sum(B.Size),0) as CalcLen
			from BlocksetEntry as BSE
				left outer join Block as B
					on BSE.BlockId = B.Id
			group by BSE.BlocksetId
		) as BlocksetLengths
		on BS.Id = BlocksetLengths.BlocksetID
where CalcLen != BS.Length

Hi @tarianjed,

That second query contains a “not in” which is why it’s performing poorly. Would you like me to fix it or are you under control?

Something like this right?

SELECT COUNT(*)
FROM File 
INNER JOIN FilesetEntry ON File.ID = FilesetEntry.FileID
LEFT JOIN (
        SELECT Path
        FROM File 
        INNER JOIN FilesetEntry 
            ON File.ID = FilesetEntry.FileID
        WHERE FilesetEntry.FilesetID = 1
) as F
ON F.Path = File.Path
WHERE FilesetEntry.FilesetID = 2 
    AND File.BlocksetID = -200 
    AND F.Path IS NULL

I have mainly been focusing on indexes because they are easier to get added into the codebase, but there is definitely room for most of these queries to be rewritten.

The indexes I created did resolve most of the performance issues I was experiencing but I will also include these query changes because it will make an even greater impact.

For some of the indexes I have been working on, I have been referencing the SQLite Query Optimizer docs as it isn’t apples to apples with SQL Server or MySQL (the two RDBMS I use regularly).
https://www.sqlite.org/optoverview.html
I also ran into a caveat with how clustered indexes work in SQLite: Clustered Indexes and the WITHOUT ROWID Optimization

You must have a really large dataset to work on. My dataset comes back quickly. The query I came up with is slightly different:

select count(*)
from FileLookup as FL1
	inner join FilesetEntry as FE1
		on FL1.Id = FE1.FileID
			and FL1.BlocksetId = -200
			and FE1.FilesetID = 1112
	left outer join
		(
			select FL2.PrefixID, FL2.Path
			from FileLookup as FL2
				inner join FilesetEntry as FE2
					on FL2.Id = FE2.FileID
						and FE2.FilesetID = 1111
		) as PathsForPreviousFileset
		on PathsForPreviousFileset.PrefixID = FL1.PrefixID
			and PathsForPreviousFileset.Path = FL1.Path
where PathsForPreviousFileset.PrefixID is null

Primarily because I’m not using the view which contains a string concatenation, I’m joining on the raw table by the two items being concatenated. Joining to a view can be bad news. According to my tests what I have above is faster… but I don’t have enough data to really tell. The strange thing is that the original query pulls back a significantly faster time, maybe its an index thing, but ideally the original query should run poorly. In any case please try them out on your dataset to see what happens. Please let me know, I’m curious.

The other difference is that I tend to put filtering criteria on their associated join… it’s just the way I think about the query. I’m also paranoid that the query optimiser will join the tables in their entirety and then filter the result if the filter criteria is in the where clause.

Thank you, I did browse this document the other day, but I must have another look. There’s also the query plan using the “explain” clause if you haven’t played with it: https://www.sqlite.org/eqp.html

I’ve been reading up about the clustered indexes… there a bit of a mess really. It doesn’t look like we can have clustered non-unique indexes either which is a bit of a shame.

In any case see how you go with the queries. If the original comes out as the fastest then I’d say it’s because of some serious indexing, I can’t see how the other two queries can’t beat it. The “explain” command will help you determine whether indexes are being used.