So I just started testing out Duplicati and ran into some major issues with query performance causing major delays in my backup successfully completing.
They are running on a ~300GB dataset that is a subset of the almost 40TB I will be eventually backing up. (more info here: Poor performance on TrueNAS)
Luckily I am good with databases so I will be following up with some proposed solutions after this backup successfully completes.
Problem queries:
Runtime - 3:42:24.304 - Location LocalDatabase.cs Line 756
SELECT "CalcLen", "Length", "A"."BlocksetID", "File"."Path"
FROM (
SELECT "A"."ID" AS "BlocksetID", IFNULL("B"."CalcLen", 0) AS "CalcLen", "A"."Length"
FROM "Blockset" A
LEFT OUTER JOIN (
SELECT "BlocksetEntry"."BlocksetID", SUM("Block"."Size") AS "CalcLen"
FROM "BlocksetEntry"
LEFT OUTER JOIN "Block"
ON "Block"."ID" = "BlocksetEntry"."BlockID"
GROUP BY "BlocksetEntry"."BlocksetID"
) B
ON "A"."ID" = "B"."BlocksetID"
) A, "File"
WHERE "A"."BlocksetID" = "File"."BlocksetID"
AND "A"."CalcLen" != "A"."Length"
Runtime - 0:12:47:47.152 (currently running multiple times, backup stuck running these 13hr long queries) - Location LocalBackupDatabase.cs Line 586
SELECT COUNT(*)
FROM "File"
INNER JOIN "FilesetEntry" ON "File"."ID" = "FilesetEntry"."FileID"
WHERE "FilesetEntry"."FilesetID" = 2
AND "File"."BlocksetID" = -200
AND NOT "File"."Path" IN (
SELECT "Path"
FROM "File"
INNER JOIN "FilesetEntry"
ON "File"."ID" = "FilesetEntry"."FileID"
WHERE "FilesetEntry"."FilesetID" = 1
)
I am going to have to monitor this backup until it successfully completes (probably in a week based on current query performance) then do some tests to see what can be done.
Fantastic - it is known that some of the database queries could use optimization. If you have the skill to help optimize, your help would be greatly appreciated. Please check out the project’s Github issues page and you’ll see some related to database query performance.
After doing some reading it seems the blocksize parameter and query performance are significantly related. (Why the heck CAN'T we change the blocksize? - #45 by ts678)
It also seems like the number of files being backed up might also have a major effect. I will probably test this with a large but few file backup and a large but many file backup.
I think there is enough evidence to confidently say we should edit this page in the manual (Choosing Sizes in Duplicati - Duplicati 2 User's Manual) to say small and even default blocksize will cause poor backup performance. It should probably also be added under general options, with some rules of thumb (which we don’t really have a good idea of atm).
If there are any other tests people would want me to try let me know. I have a pretty robust environment with 10GbE to my destination (local windows pc) so we can nail down some rules of thumb performance wise if I know what tests to do.
I identified some indexes that basically solved my query performance issues:
CREATE INDEX "nnc_Metadataset" ON Metadataset ("ID","BlocksetID")
CREATE INDEX "nn_FilesetentryFile" on FilesetEntry ("FilesetID","FileID")
-- Line 602 & 603 LocalBackupDatabase.cs
-- CREATE INDEX "tmpName1" ON "{0}" ("Path"),tmpName1
-- CREATE INDEX "tmpName2" ON "{0}" ("Path"),tmpName2
CREATE INDEX "nn_FileLookup_BlockMeta" ON FileLookup ("BlocksetID", "MetadataID")
CREATE INDEX "nnc_BlocksetEntry" ON "BlocksetEntry" ("Index", "BlocksetID", "BlockID")
I will create a PR once I figure out how to implement them into the code.
Also what is the appetite for rewriting all the implicit joins as explicit joins?
It is an antipattern to use implicit joins as they are harder to understand and follow. They can also lead to accidental cross joins.
I don’t think there’s anything that deep, but there are some additional thoughts about it at
Followup ensued, original author added some specifics, and I gave my general summary:
Developer category has some other information, e.g. if you want to get into the Channels.
Channel Pipeline might help if concurrency issues in DB handling arise. I suspect they do.
This is fantastic, as expertise is all too rare. There are others trying, but it’s hard to get into.
Sometimes I can answer questions based on looking at databases rather than writing SQL.
Give this a whirl, for the first query, it halves execution time for me:
select CalcLen, Length, F.BlocksetId, Path
from
(
select ifnull(sum(B.Size),0) as CalcLen, BS.Length, BS.Id as BlocksetId
from Blockset as BS
left outer join BlocksetEntry as BSE
on BS.Id = BSE.BlocksetID
left outer join Block as B
on BSE.BlockId = B.Id
group by BS.Id, BS.Length
having BS.Length != ifnull(sum(B.Size),0)
) as BlockLenIssues
inner join File as F
on F.BlocksetId = BlockLenIssues.BlocksetId
I moved onto something else and then realised I can go further. This is 2.5 times faster:
select CalcLen, BS.Length, BS.Id as BlocksetId, Path
from Blockset as BS
inner join File as F
on F.BlocksetId = BS.Id
left outer join
(
select BSE.BlocksetId, ifnull(sum(B.Size),0) as CalcLen
from BlocksetEntry as BSE
left outer join Block as B
on BSE.BlockId = B.Id
group by BSE.BlocksetId
) as BlocksetLengths
on BS.Id = BlocksetLengths.BlocksetID
where CalcLen != BS.Length
SELECT COUNT(*)
FROM File
INNER JOIN FilesetEntry ON File.ID = FilesetEntry.FileID
LEFT JOIN (
SELECT Path
FROM File
INNER JOIN FilesetEntry
ON File.ID = FilesetEntry.FileID
WHERE FilesetEntry.FilesetID = 1
) as F
ON F.Path = File.Path
WHERE FilesetEntry.FilesetID = 2
AND File.BlocksetID = -200
AND F.Path IS NULL
I have mainly been focusing on indexes because they are easier to get added into the codebase, but there is definitely room for most of these queries to be rewritten.
The indexes I created did resolve most of the performance issues I was experiencing but I will also include these query changes because it will make an even greater impact.
You must have a really large dataset to work on. My dataset comes back quickly. The query I came up with is slightly different:
select count(*)
from FileLookup as FL1
inner join FilesetEntry as FE1
on FL1.Id = FE1.FileID
and FL1.BlocksetId = -200
and FE1.FilesetID = 1112
left outer join
(
select FL2.PrefixID, FL2.Path
from FileLookup as FL2
inner join FilesetEntry as FE2
on FL2.Id = FE2.FileID
and FE2.FilesetID = 1111
) as PathsForPreviousFileset
on PathsForPreviousFileset.PrefixID = FL1.PrefixID
and PathsForPreviousFileset.Path = FL1.Path
where PathsForPreviousFileset.PrefixID is null
Primarily because I’m not using the view which contains a string concatenation, I’m joining on the raw table by the two items being concatenated. Joining to a view can be bad news. According to my tests what I have above is faster… but I don’t have enough data to really tell. The strange thing is that the original query pulls back a significantly faster time, maybe its an index thing, but ideally the original query should run poorly. In any case please try them out on your dataset to see what happens. Please let me know, I’m curious.
The other difference is that I tend to put filtering criteria on their associated join… it’s just the way I think about the query. I’m also paranoid that the query optimiser will join the tables in their entirety and then filter the result if the filter criteria is in the where clause.
Thank you, I did browse this document the other day, but I must have another look. There’s also the query plan using the “explain” clause if you haven’t played with it: https://www.sqlite.org/eqp.html
I’ve been reading up about the clustered indexes… there a bit of a mess really. It doesn’t look like we can have clustered non-unique indexes either which is a bit of a shame.
In any case see how you go with the queries. If the original comes out as the fastest then I’d say it’s because of some serious indexing, I can’t see how the other two queries can’t beat it. The “explain” command will help you determine whether indexes are being used.