Reducing Time Spent Deleting Blocks

tkohhh · June 20, 2023, 4:10am

I had a thought about an issue that was discussed here: Backup Runtime after 2.0.7.1 update - #23 by tkohhh

We were discussing the delete process, which runs through the following queries:

DELETE FROM "Fileset" WHERE "Timestamp" IN (1686226500) 
DELETE FROM "FilesetEntry" WHERE "FilesetID" NOT IN (SELECT DISTINCT "ID" FROM "Fileset")
DELETE FROM "ChangeJournalData" WHERE "FilesetID" NOT IN (SELECT DISTINCT "ID" FROM "Fileset")
DELETE FROM "FileLookup" WHERE "ID" NOT IN (SELECT DISTINCT "FileID" FROM "FilesetEntry") 
DELETE FROM "Metadataset" WHERE "ID" NOT IN (SELECT DISTINCT "MetadataID" FROM "FileLookup") 
DELETE FROM "Blockset" WHERE "ID" NOT IN (SELECT DISTINCT "BlocksetID" FROM "FileLookup" UNION SELECT DISTINCT "BlocksetID" FROM "Metadataset") 
DELETE FROM "BlocksetEntry" WHERE "BlocksetID" NOT IN (SELECT DISTINCT "ID" FROM "Blockset") 
DELETE FROM "BlocklistHash" WHERE "BlocksetID" NOT IN (SELECT DISTINCT "ID" FROM "Blockset") 
INSERT INTO "DeletedBlock" ("Hash", "Size", "VolumeID") SELECT "Hash", "Size", "VolumeID" FROM "Block" WHERE "ID" NOT IN (SELECT DISTINCT "BlockID" AS "BlockID" FROM "BlocksetEntry" UNION SELECT DISTINCT "ID" FROM "Block", "BlocklistHash" WHERE "Block"."Hash" = "BlocklistHash"."Hash") 
DELETE FROM "Block" WHERE "ID" NOT IN (SELECT DISTINCT "BlockID" FROM "BlocksetEntry" UNION SELECT DISTINCT "ID" FROM "Block", "BlocklistHash" WHERE "Block"."Hash" = "BlocklistHash"."Hash") 
UPDATE "RemoteVolume" SET "State" = "Deleting" WHERE "Type" = "Files" AND "State" IN ("Uploaded", "Verified", "Temporary") AND "ID" NOT IN (SELECT "VolumeID" FROM "Fileset")

The third from last and second from last queries are fairly slow, but they are essentially the same query. The first is inserting the results of the query into the DeletedBlock table, and the second is deleting those records from the Block table.

In the linked thread, @ts687 and I were discussing ways that we might be able to avoid running the same query twice. He posed the idea of a temporary table as a possible solution.

It occurred to me that we’re already writing the Hash, Size, and VolumeID to DeletedBlock. If Hash, Size, and VolumeID are sufficient for identifying a unique row in the Block table, then we could simply:

DELETE from Block 
WHERE exists (select 1
              from   DeletedBlock
              where  DeletedBlock.Hash = Block.Hash
              and    DeletedBlock.Size = Block.Size
              and    DeletedBlock.VolumeID = Block.VolumeID)

If Hash/Size/VolumeID is not unique, then perhaps we could add a column for BlockID to the DeletedBlock table, which would guarantee that we could delete the proper record from Block.

What do you think?

Jojo-1000 · June 21, 2023, 5:28am

I am pretty sure that combination must be unique (a volume stores the blocks in a zip file by its hash as filename, so there can’t be two of the same hash in it, unless there is some special case handling I missed).

ts678 · June 22, 2023, 12:44am

I think a zip file can have duplicate paths, and Duplicati sometimes makes them. As I don’t have an example immediately handy, I instead had 7-Zip rename one file to have same path as its neighbor:

Another case where duplicates may arise is below during the dindex file construction. See comment:

github.com

duplicati/duplicati/blob/d0f1498bd41b151d8512fd2acb57739f6a05587f/Duplicati/Library/Main/Operation/Backup/SpillCollectorProcess.cs#L109-L112


      
          // We copy all the blocklisthashes, which may create duplicates

          // but otherwise we need to query all hashes to see if they are blocklisthashes

          if (source.IndexVolume != null)

              source.IndexVolume.CopyTo(target.IndexVolume, true);

Another reason not to count on any one zip file for uniqueness is that there are many zip files in use.

I think (not sure) Duplicati doesn’t look in the DeletedBlock table to see if it has a block now needed, which means a new block is produced while an old block that is now waste idles in another dblock file.

Fix database issue where block was not recognized as being used #3801, and see if discussion fits…

A block is deleted but also in use (in two different volumes) and I was not expecting that when I wrote the query.

Although I did the test case behind that PR, I’m still not totally familiar with all the SQL. You can look…

As a general comment on this topic, it’s nice to know a fixable slow spot on a too-many-blocks backup, however what other slow spots turn up? And which misbehave especially badly on Unraid with FUSE?

Getting all to scale to many-million-block sizes is probably infeasible, so we’re trying to raise blocksize.
Anybody have expertise in database profiling? Single slow queries are easy to see, but what if queries running millions of times are slow? That’s harder to notice without doing an addition over all the uses…

I think commercial SQLite profiling systems exist, but I don’t know how well they can find the time sinks. For very small cases, one can sometimes look at EXPLAIN QUERY PLAN, and try to avoid table scans.

That’s an initial comment and I might have more coming. Ultimately, though, there’s a lack of resources which I’m not sure how to solve. Perhaps you two can be part of helping offload the (one) maintainer so pull requests can actually get in. I see some of that happening already, so thanks to all who are helping.

tkohhh · June 22, 2023, 3:57am

I’m happy to help. But, I have to admit that I’m unsure about diving into big questions when I have very little knowledge about Duplicati (outside of my experience as a user). I don’t want to be a time-suck by asking questions or posing ideas that have been well-trodden already.

With that said, if somebody wants to point me to a specific section of code with a clear objective… I can definitely do that. Even if it’s “grunt work” that will help offload the principal maintainers.

I’m good at writing queries, and I’m reasonably good at optimizing them by trial and error. I’ve got a cursory understanding of Explain Plan, but I’m definitely not a whiz. When it comes to general programming, I’m decent.

So, as I said, if there’s something specific that I can look at with a clear objective… send it my way and I’ll see what I can do with it!

Jojo-1000 · June 22, 2023, 7:10am

That sounds like a really bad idea, there is no way for a zip reader to distinguish both blocks if they have the same size but different content. Although in general we have a hash collision issue if that ever occurs, so it should be extremely rare.

If that happens and it is not caught during backup you probably can’t restore one of those blocks anyways. And if the database needs to be recreated, hash collisions (with same size) can’t be resolved (or even noticed) from remote files. I think this is not a reason to ignore this optimization possibility.

ts678 · June 23, 2023, 12:23am

Agree. I don’t know historically whether design of using hash along with size was due to collision worry. Looking at the Block table index set, there’s no index on just hash, although there’s one on hash + size.

CREATE UNIQUE INDEX “BlockHashSize” ON “Block” (“Hash”, “Size”)

UNIQUE INDEX gives me the best confidence of uniqueness, and changes less than the C# code, etc.

Also bear in mind that the one new maintainer is kind of swamped, and good focus on what to fix (and also good preparation such as this) may help get a PR in. I’d note that Duplicati still has integrity bugs, which I’m trying to get handled. You helped on one recently. Thanks. In general though, I’d prioritize as

Writing Quality Code: Practicing “Make It Work, Make It Right, Make It Fast” when picking what to work.
Rules Of Optimization has other thoughts, but at least we got the profiling data for two slow statements:

2023-06-14 13:16:37 -07 - [Profiling-Timer.Finished-Duplicati.Library.Main.Database.ExtensionMethods-ExecuteNonQuery]: ExecuteNonQuery: INSERT INTO “DeletedBlock” (“Hash”, “Size”, “VolumeID”) SELECT “Hash”, “Size”, “VolumeID” FROM “Block” WHERE “ID” NOT IN (SELECT DISTINCT “BlockID” AS “BlockID” FROM “BlocksetEntry” UNION SELECT DISTINCT “ID” FROM “Block”, “BlocklistHash” WHERE “Block”.“Hash” = “BlocklistHash”.“Hash”) took 0:00:41:31.935

2023-06-14 13:58:37 -07 - [Profiling-Timer.Finished-Duplicati.Library.Main.Database.ExtensionMethods-ExecuteNonQuery]: ExecuteNonQuery: DELETE FROM “Block” WHERE “ID” NOT IN (SELECT DISTINCT “BlockID” FROM “BlocksetEntry” UNION SELECT DISTINCT “ID” FROM “Block”, “BlocklistHash” WHERE “Block”.“Hash” = “BlocklistHash”.“Hash”) took 0:00:42:00.328

both of which are awful, but required a very oversized backup plus FUSE. What are the usual delays?
The counter-argument is we’re sort of in this one already, so maybe it’s a good time to just knock it off, however the recipe for getting a pull request accepted is still quite a mystery to me, and I don’t decide.

If we want to press on regardless, I’d say someone needs to test, preferably in Duplicati, and do a PR. Because @tkohhh has the affected backup, the Duplicati to test probably has to get onto that system.

In the above profiling log, a guess would be that the bulk of the time is in the common part of SELECT.
Block table is almost the biggest table, but BlocksetEntry might be bigger, depending on versions kept.
Avoiding doing basically the same probably-slow query is likely going to be a win in this particular case.
Experimenting with code might be possible if one is able to build a change, or it might be possible to do things in an SQL browser to try to simulate steps Duplicati would take. Beware of the OS cache effects.

I’d previously started to chase the “candidate list” idea I proposed vaguely, then better as time passed. Some of that is just my trying to get better with SQL because there’s so much SQL in the code already.

Here are some line-by-line timings from the posted database bug report where FileLookup is FixedFile.
This is using a 2012 vintage Core i5 and a hard drive, so don’t compare times with Unraid’s too closely.

DELETE FROM "Fileset" WHERE "ID" = 807
Result: query executed successfully. Took 2ms, 1 rows affected

DELETE FROM "FilesetEntry" WHERE "FilesetID" NOT IN (SELECT DISTINCT "ID" FROM "Fileset")
Result: query executed successfully. Took 890ms, 32303 rows affected

DELETE FROM "ChangeJournalData" WHERE "FilesetID" NOT IN (SELECT DISTINCT "ID" FROM "Fileset")
Result: query executed successfully. Took 0ms, 0 rows affected

DELETE FROM "FixedFile" WHERE "ID" NOT IN (SELECT DISTINCT "FileID" FROM "FilesetEntry")
Result: query executed successfully. Took 122ms, 8 rows affected

DELETE FROM "Metadataset" WHERE "ID" NOT IN (SELECT DISTINCT "MetadataID" FROM "FixedFile") 
Result: query executed successfully. Took 251ms, 8 rows affected

DELETE FROM "Blockset" WHERE "ID" NOT IN (SELECT DISTINCT "BlocksetID" FROM "FixedFile" UNION SELECT DISTINCT "BlocksetID" FROM "Metadataset")
Result: query executed successfully. Took 7274ms, 13 rows affected

DELETE FROM "BlocksetEntry" WHERE "BlocksetID" NOT IN (SELECT DISTINCT "ID" FROM "Blockset")
Result: query executed successfully. Took 6386ms, 26 rows affected

DELETE FROM "BlocklistHash" WHERE "BlocksetID" NOT IN (SELECT DISTINCT "ID" FROM "Blockset")
Result: query executed successfully. Took 4687ms, 4 rows affected

(DELETE FROM "DeletedBlock" so I can see new inserts, write changes so I can test from here)

INSERT INTO "DeletedBlock" ("Hash", "Size", "VolumeID") SELECT "Hash", "Size", "VolumeID" FROM "Block" WHERE "ID" NOT IN (SELECT DISTINCT "BlockID" AS "BlockID" FROM "BlocksetEntry" UNION SELECT DISTINCT "ID" FROM "Block", "BlocklistHash" WHERE "Block"."Hash" = "BlocklistHash"."Hash")
Result: query executed successfully. Took 600802ms, 30 rows affected (early run)
Result: query executed successfully. Took 655989ms, 30 rows affected (later run)

DELETE FROM "Block" WHERE "ID" NOT IN (SELECT DISTINCT "BlockID" FROM "BlocksetEntry" UNION SELECT DISTINCT "ID" FROM "Block", "BlocklistHash" WHERE "Block"."Hash" = "BlocklistHash"."Hash")
Result: query executed successfully. Took 120597ms, 30 rows affected (early run)
Result: query executed successfully. Took 100858ms, 30 rows affected (later run)

On those last two statements, one can see the speedup of what I’ve been thinking is OS file caching.
What’s strange is that if I run the SELECT by itself, the best time I can get seems to be 175 seconds.
Still, avoiding doing the SELECT on those huge tables twice will probably save quite a bit of time, but doing it not at all is even faster. Here’s a proof-of-concept of one way. This takes 1 second on my PC,
and it finds the same 30 blocks that can be found by working with huge tables. These are far smaller:

WITH
	"TheseFileIDs" AS (
		SELECT DISTINCT "FileID" FROM "FilesetEntry" WHERE "FilesetID" = 807
	),
	"OtherFileIDs" AS (
		SELECT DISTINCT "FileID" FROM "FilesetEntry" WHERE "FilesetID" <> 807
	),
	"MinusFileIDs" AS (
		SELECT * FROM "TheseFileIDs" EXCEPT SELECT * FROM "OtherFileIDs"
	),
	"MinusMetadataIDs" AS (
		SELECT "MetadataID" FROM "FixedFile" WHERE "ID" IN "MinusFileIDs"
	),
	"MinusBlocksetIDs" AS (
		SELECT "BlocksetID" FROM "FixedFile" WHERE "ID" IN "MinusFileIDs" AND "BlocksetID" > 0
		UNION
		SELECT "BlocksetID" FROM "Metadataset" WHERE "ID" IN "MinusMetadataIDs"		
	),
	"MinusBlocksetBlocks" AS (
		SELECT "ID" FROM "Block" JOIN "BlocksetEntry"
		ON "ID" = "BlockID"
		WHERE "BlocksetID" IN "MinusBlocksetIDs"
	),
	"MinusBlocklistHashBlocks" AS (
		SELECT "ID" FROM "Block" JOIN "BlocklistHash"
		ON "Block"."Hash" = "BlocklistHash"."Hash"
		WHERE "BlocklistHash"."BlocksetID" IN "MinusBlocksetIDs"
	)
SELECT "Hash", "Size", "VolumeID" FROM "Block"
WHERE "ID" IN "MinusBlocksetBlocks" OR "ID" IN "MinusBlocklistHashBlocks"

ts678 · June 23, 2023, 1:44am

Such experience is a great start. There are a lot of forum topics and GitHub Issues.
Sometimes giving some help or doing some investigation is great as a learning tool.

Don’t fret too much. Sometimes there’s documentation or old discussion to build on.

Many objectives are issue-driven. The objective is a fix, but one has to find the code.
There are some issues isolated down to steps to reproduce, so are more fix-ready…
Possibly some even have thoughts on the code area, but need someone to do code.

This gets back to figuring out which queries need help. If fast, don’t mess with query.
You might also be a good person to review proposed changes such as pull requests.
Having multiple people in SQL area is nice, as it gives us the luxury of such reviews.
Because SQL is so central, there’s a lot of worry about breaking it. Review may help.

C# is very useful here. Python shows up occasionally and might be nice for test tools.
GUI work is quite specialized. I’m thrilled to have @Jojo-1000 on it, but who reviews?
One thing that probably helps on the PRs I saw is that the documentation is very nice.
Personally, I think following someone else’s work can be harder than doing one’s own.

You can certainly try working on the topic right here. It’ll be good learning regardless.
I talked about trying to pick good work targets. Maybe your profiling log sees others?
The plan I pitched probably works best on fairly static backups, and worse on active.
It’s a demo, and (if it seems reasonable for a “typical” backup) needs some rewriting.
It might even be buggy, as I’m pretty new to SQL and am looking things up as I go…

gpatel-fr · June 23, 2023, 7:05am

this is even slower than my dev computer and my rig is not a speed daemon
Yet it beats handily the OP times. Seems to confirm that it’s a freak occurrence, not a general problem.

Optimizing Sql is tricky, as I saw myself when I did an optimization that did great on my dev system (10 times better in UI restore searching), but you noticed that it actually did worse on your Windows system - I confirmed your finding on a Windows system myself and learned (again) that Sql engines are hard. That’s why I did not include the optimization I have done for database rebuilding - even if you confirmed it was good on your system, I think that this is not enough feedback to risk it.

Jojo-1000 · June 23, 2023, 7:29am

I just created a new test set that essentially runs all of the database queries on an existing backup database, in order to catch SQL typos and because I was hoping to see if #4924 has significant performance improvements. Obviously this is not the same as an actual backup job (especially with short queries that run often), but hopefully it is more reproducible.

It showed that the best way to optimize was setting cache_size for SQLite to 200MB, which improved runtime from 6min to 5min on my DB. All of the SELECT DISTINCT changes performed pretty much the same, even though in my manual tests with DB explorer some queries were faster (might be due to caching).

I also saw when running multiple times, one of my runs took 3min longer as I was starting another program. The other runs were within a few seconds.

gpatel-fr · June 23, 2023, 7:37am

My few tests have shown that replacing UNION by UNION ALL is better at improving times, but caveat apply - should be looked at carefully because it may sometimes not functionally equivalent, and wins are not always impressive. I’m not keen on all that. Spending hours on studying a change that will win a few seconds on a query that is running at night on most systems does not enthuse me - especially when there are queries that are taking hours when people need to restore their systems.

Jojo-1000 · June 23, 2023, 7:42am

Maybe that is an option to consider: Have restore and recreate database automatically use more cache (maybe according to system RAM), while keeping the backup jobs the same so that the normal operation has no higher overhead.

tkohhh · June 23, 2023, 2:01pm

Just to be clear, I started this discussion in a new thread because it’s a point at which time savings can be made for ALL backup jobs that are deleting filesets, not just my particular issue where the queries in question were taking 40+ minutes to complete.

Regarding Unraid/FUSE, my thinking is that there are several people using several other apps that have problems with Unraid and FUSE. It would be nice if Unraid could figure out what’s going on and provide a fix. BUT, there is a completely functional workaround that bypasses FUSE. The only Unraid system where this workaround is NOT feasible is when the appdata user share is configured to use the main drive array where files might be saved on different logical drives. However, that use case is rare. Much more common is having the appdata user share configured to use a single logical drive in the form of an SSD cache pool.

So, all that is to say: the discussion I posed in this thread is valid (I believe) beyond of the scope of my original problem.

Very nice! This is roughly the same concept I was looking at, although you made it much more elegant with the CTE’s! I’m going to play around with this just to see if I agree with your findings.

ts678 · June 23, 2023, 2:16pm

and one question is whether there are old versions of SQLite that lack CTEs. More generally, issue is
Solidify SQLite dependencies #4024 whose May 2 reference talked about their removal of encryption.

I also wasn’t looking to see if NULL values were possible. Those can throw off things like IN testing…
It’s just a PoC… Have fun. I’m curious to know when going this direction gets slower than the old way.

tkohhh · June 23, 2023, 2:36pm

Wouldn’t sqlite be installed along with Duplicati? Meaning that if Dupliati gets updated, sqlite would also be updated?

ts678 · June 23, 2023, 2:47pm

Do you recall which ones were that slow? @tkohhh was looking for things to go after.
I’m still occasionally prototyping in Python to see how much faster a rewrite might run.
By that, I mean including much C# code. If SQL query tweaks can help, it’s far easier.

I think only Windows does it. Look in the SQLite folder and also notice the two flavors.
Most of Duplicati is portable C#, so doesn’t need to worry about processor differences.
This may change in the future on .NET 6 (or whatever) version after .NET Framework.

tkohhh · June 23, 2023, 5:03pm

Sorry, which SQLite folder are you referring to?

As for CTE’s and SQLite versions, looks like that functionality was added back in 2014: SQLite Release 3.8.3 On 2014-02-03

ts678 · June 23, 2023, 5:22pm

In Duplicati’s installation (or autoupdate, which will be elsewhere). Example of initial Windows setup:

That’s a long time ago, but CentOS 7 still has another year of support, and looks like it’s using 3.7.17.
By support, I mean from the OS vendor such as Red Hat. Duplicati doesn’t automatically follow that…

https://repology.org/project/sqlite/versions

Hmm. Do we hold off on CTEs a little longer? And some people hang on even after OS support ends.

EDIT:

Generate backup version numbers manually instead of in query. so no row_number() window function required, which needed SQLite 3.25 which is circa 2018, so a little more of an ask than being at 2014.

tkohhh · June 23, 2023, 6:24pm

I guess this gets into a philosophy question… do we intend to say “we will support running the latest Duplicati version on any system that is still supported by manufacturer (or whatever official body that provides support)”? My instinct is that it’s just fine to say “if you want to run Duplicati on an older system, you may need to run an older version of Duplicati.”

Either way… using CTEs certainly makes the query more readable, but I don’t think it’s strictly necessary. I can play around with using subqueries as well.

gpatel-fr · June 23, 2023, 10:48pm

The most annoying monster is what I tried to address in experimental faster db rebuild by gpatel-fr · Pull Request #4955 · duplicati/duplicati · GitHub

BTW I have not given up on this one. It’s just a question of time.
To test this, a system of decent size available for enough time is needed.

The big questions I want to have decent answers to are:

is the current set of queries having exponential query time when the database size grows geometrically (I strongly suspect yes) ?
is my attempt succeeding in having a query execution time growing geometrically when the database size grows ? (I am not sure).

So I needed a system with terabytes available and powerful enough to handle that. I hope to have a start this week end.