Backup Runtime after 2.0.7.1 update

tkohhh · June 15, 2023, 1:39pm

I do… while I can’t claim to wholly understand how Duplicati manages it’s database, in the example of the DELETE operation it’s pretty obvious to me what’s going on. I’m going to walk you through my thinking, so if I’ve made an error please let me know.

From the profiling logs, these are the SQL statements that are executed for the DELETE operation.

DELETE FROM "Fileset" WHERE "Timestamp" IN (1686226500) 
DELETE FROM "FilesetEntry" WHERE "FilesetID" NOT IN (SELECT DISTINCT "ID" FROM "Fileset")
DELETE FROM "ChangeJournalData" WHERE "FilesetID" NOT IN (SELECT DISTINCT "ID" FROM "Fileset")
DELETE FROM "FileLookup" WHERE "ID" NOT IN (SELECT DISTINCT "FileID" FROM "FilesetEntry") 
DELETE FROM "Metadataset" WHERE "ID" NOT IN (SELECT DISTINCT "MetadataID" FROM "FileLookup") 
DELETE FROM "Blockset" WHERE "ID" NOT IN (SELECT DISTINCT "BlocksetID" FROM "FileLookup" UNION SELECT DISTINCT "BlocksetID" FROM "Metadataset") 
DELETE FROM "BlocksetEntry" WHERE "BlocksetID" NOT IN (SELECT DISTINCT "ID" FROM "Blockset") 
DELETE FROM "BlocklistHash" WHERE "BlocksetID" NOT IN (SELECT DISTINCT "ID" FROM "Blockset") 
INSERT INTO "DeletedBlock" ("Hash", "Size", "VolumeID") SELECT "Hash", "Size", "VolumeID" FROM "Block" WHERE "ID" NOT IN (SELECT DISTINCT "BlockID" AS "BlockID" FROM "BlocksetEntry" UNION SELECT DISTINCT "ID" FROM "Block", "BlocklistHash" WHERE "Block"."Hash" = "BlocklistHash"."Hash") 
DELETE FROM "Block" WHERE "ID" NOT IN (SELECT DISTINCT "BlockID" FROM "BlocksetEntry" UNION SELECT DISTINCT "ID" FROM "Block", "BlocklistHash" WHERE "Block"."Hash" = "BlocklistHash"."Hash") 
UPDATE "RemoteVolume" SET "State" = "Deleting" WHERE "Type" = "Files" AND "State" IN ("Uploaded", "Verified", "Temporary") AND "ID" NOT IN (SELECT "VolumeID" FROM "Fileset")

From this I can infer the data hierarchy:

-Fileset
    |-FilesetEntry
        |-FileLookup
            |-Metadataset
            |-Blockset
                |-BlocksetEntry
                    |-Block
                |-BlocklistHash
                    |-Block
    |-ChangeJournalData

Looking back at the SQL statements, the methodology is clear:

Delete the root of the hierarchy (Fileset) for the fileset identified by timestamp 1686226500
Go to the first child of the root (FilesetEntry) and delete all records that do not have a parent.
Go to the second child of the root (ChangeJournalData) and delete all records that do not have a parent.
Go to the first child of FilesetEntry (FileLookup) and delete all records that do not have a parent.
etc…

Put another way, the methodology is to delete a parent, and then go to the child table to look for all orphans and delete them.

I’m suggesting that it’s more efficient to start at the bottom of the hierarchy and delete all records there with an ancestry from the identified root. Then, keep moving up the hierarchy and do the same thing.

Let’s look closer at the DELETE statement for the Block table:

DELETE
FROM "Block"
WHERE "ID" NOT IN (
		SELECT DISTINCT "BlockID"
		FROM "BlocksetEntry"
		
		UNION
		
		SELECT DISTINCT "ID"
		FROM "Block"
			,"BlocklistHash"
		WHERE "Block"."Hash" = "BlocklistHash"."Hash"
		)

In plain English: delete everything in Block that doesn’t have a parent in either BlocksetEntry or BlocklistHash.

We know what we want it to delete though… there’s no reason to search for all the orphans. Instead we can say: delete records in Block that have a BlocksetEntry or a BlocklistHash belonging to the Fileset with timestamp 1686226500.

In SQL:

DELETE
FROM block
WHERE id IN (
		SELECT blockid
		FROM blocksetentry
		WHERE blocksetid IN (
				SELECT id
				FROM blockset
				WHERE id IN (
						SELECT blocksetid
						FROM filelookup
						WHERE id IN (
								SELECT fileid
								FROM filesetentry
								WHERE filesetid IN (
										SELECT id
										FROM fileset
										WHERE "timestamp" IN (1686226500)
										)
								)
						)
				)
		)
	OR id IN (
		SELECT id
		FROM block
		JOIN blocklisthash ON block.HASH = blocklisthash.HASH
		WHERE blocksetid IN (
				SELECT id
				FROM blockset
				WHERE id IN (
						SELECT blocksetid
						FROM filelookup
						WHERE id IN (
								SELECT fileid
								FROM filesetentry
								WHERE filesetid IN (
										SELECT id
										FROM fileset
										WHERE "timestamp" IN (1686226500)
										)
								)
						)
				)
		)

It’s worth noting here that the existing code is running the same expensive query twice… once to insert the deleted blocks into the DeletedBlock table, and then a second time for the actual delete operation. It would be more efficient to run the query once, and then for each record returned, write it to DeletedBlock and then delete it. In Oracle I would do that with a cursor and a for loop. I don’t know exactly how that would be done in sqlite, but I’m sure it’s possible.

Carrying on with my proposed methodology, after the Blocks are deleted you can go up the hierarchy and delete the BlocklistHash and BlocksetEntry that belong to the Fileset with timestamp 1686226500.

Once that’s done, you can go up the hierarchy and delete the Blockset that belong to the Fileset with timestamp 1686226500.

And so on, until you’ve made it to the root and delete Fileset with timestamp 1686226500.

That’s my thinking. Of course you are correct that none of this answers my original question as to WHY the update caused my backup job to slow down. However, optimizing the delete operation will likely make my original question irrelevant.

I’m going to do a test of my methodology to see what it looks like performance wise. I’ll reply with the results once it’s done. In the meantime, I’d love to know what you all think.

EDIT: I’m now realizing that the Many-to-One relationship between FilesetEntry and FileLookup foils this plan. I need to think on it more, but at the very least, my suggestion about not running the same query twice is still valid.

ts678 · June 15, 2023, 3:11pm

BlocklistHash is kind of unusual in several ways. I mentioned that it identifies a block which is a blocklist, however its BlocksetID refers to an ID in Blockset table, so perhaps it’s the parent rather than the child.

ChangeJournalData is part of the USN change journal scheme that I don’t know well, so no comment…

Database and destination internals has background and links to some more. Nice job so far without that, although I suppose this series of statements might be the most concise place to try to infer relationships.

Documentation for the local database format also exists, but the relationship arrows don’t show direction.
I have a draft of an attempt to write that all up as text tables, but it got big, so I just did above internals.

My SQL is not nearly as good as yours, but I would change “either” to “any” for the needed requirement.

I’m looking at only the top part of the SQL (above the OR) now, and thinking it does a great job of finding blocks that the fileset expands into, but I’m not sure I’m seeing where it looks at block usage from others.

Maybe I’m missing something in the remaining textual explanation, but I thought I’d give initial comment.

EDIT:

To give an example of total backup deduplication, set option upload-unchanged-backups to make a new fileset with the same content as the current last (if no file changes). Delete one. What happens to other?

tkohhh · June 15, 2023, 3:23pm

You’re right about this… it’s a many-to-one relationship, which I didn’t consider. I need to ponder this a bit more.

However, I am certain that we cut the time in half on the expensive query. Excerpted from the log above:

2023-06-14 13:16:37 -07 - [Profiling-Timer.Finished-Duplicati.Library.Main.Database.ExtensionMethods-ExecuteNonQuery]: ExecuteNonQuery: INSERT INTO "DeletedBlock" ("Hash", "Size", "VolumeID") SELECT "Hash", "Size", "VolumeID" FROM "Block" WHERE "ID" NOT IN (SELECT DISTINCT "BlockID" AS "BlockID" FROM "BlocksetEntry" UNION SELECT DISTINCT "ID" FROM "Block", "BlocklistHash" WHERE "Block"."Hash" = "BlocklistHash"."Hash")  took 0:00:41:31.935
...
2023-06-14 13:58:37 -07 - [Profiling-Timer.Finished-Duplicati.Library.Main.Database.ExtensionMethods-ExecuteNonQuery]: ExecuteNonQuery: DELETE FROM "Block" WHERE "ID" NOT IN (SELECT DISTINCT "BlockID" FROM "BlocksetEntry" UNION SELECT DISTINCT "ID" FROM "Block", "BlocklistHash" WHERE "Block"."Hash" = "BlocklistHash"."Hash")  took 0:00:42:00.328

Those are the same query… the first time we’re inserting the results into DeletedBlock, and the second time we’re deleting the results. Both instances take ~40 minutes to execute in this case. In oracle, you could do this:

DECLARE

  cursor BLOCKS is
    SELECT "Hash", "Size", "VolumeID", "ID" FROM "Block" WHERE "ID" NOT IN (SELECT DISTINCT "BlockID" AS "BlockID" FROM "BlocksetEntry" UNION SELECT DISTINCT "ID" FROM "Block", "BlocklistHash" WHERE "Block"."Hash" = "BlocklistHash"."Hash";
    
BEGIN
  
  for i in BLOCKS loop
    INSERT INTO "DeletedBlock" ("Hash", "Size", "VolumeID")
    values (i.Hash, i.Size, i.VolumeID);
    
    delete Block
    where ID = i.ID;    
    
  end loop;
  
END;

I don’t know what the equivalent would be in sqlite, or if it’s even possible directly in sqlite. If it’s not possible, then you could certainly do the same thing with the code that’s controlling the whole thing.

ts678 · June 15, 2023, 3:56pm

SQLite seems to have very little in the way of the usual programmatic structures e.g. looping.
Recursive Query Examples and its neighbors show how one can try to use recursion instead.

Presumably they wanted to keep it small and simple, plus it’s usually driven from a language.

Because I don’t know Oracle, and know just a little SQLite, I can’t really translate that quickly.
Generally when entering a new language of any sort, it takes some time to find equivalences.

tkohhh · June 15, 2023, 4:07pm

I super agree on this point. UNION effectively deduplicates records on either side of the union, producing a result set with unique records. UNION ALL eliminates the deduplication effort. Since we’re just looking at the UNION query to determine if a matching record exists, there’s no reason to deduplicate that data.

On a related note, the DISTINCT clause in all of these queries isn’t helping anything. It may or may not be hurting performance, but it’s certainly not helping.

tkohhh · June 15, 2023, 4:52pm

If I’m reading that correctly, then I don’t think Recursive Queries are the answer because they are focused around CTE’s, which are temporary views for a single SQL statement. We need something that will use the same data for multiple SQL statements.

So, I suggest that the surrounding code be modified to

Pull the results of the query into an array
Loop through the array
For each loop iteration:
a. Insert the Hash, Size, and VolumeID to the DeletedBlock table
b. Delete the record from the Block table

ts678 · June 16, 2023, 1:00pm

I was having the same wish for a CTE without recursion, but found CTE is for only one statement.

Even if a CTE wouldn’t work, a temporary table might run faster than loop of single row statements.
SQLite supports these, and there are lots of examples of Duplicati’s uses that could be drawn from.
INSERT seems to do multiple rows naturally. With DELETE, it looks like an indirect way is possible.

Having said that (and about to say some more), developers are scarce, so things might not happen.
I’d also like to know why things got slow compared to before. I did notice 2.0.6.3’s plan came out as

which used an older index at id 12. 2.0.7.1 switched to the new index, but did that help or hurt here?
Some experimentation may still be in order, but I don’t have such a big backup, so can’t test queries.
There might also be cache experiments. You need a workaround, as a new Beta soon is too unlikely.

The problem with the bottom is that there are lots of blocks. Another option might be to look at middle.
I spoke earlier of a candidate list, and the idea could apply at higher levels too. For example, blockset blocks won’t turn into waste if another fileset is using blockset, and that’s a far smaller computation.

A blockset has to go completely out of use before any of its blocks turn into waste. Can that help here?

There’s probably also a lot of block-level similarity between versions of files. This allows deduplication during backup, but during fileset deletion, it might narrow down candidate waste blocks, but it’s unclear what amount of work that could be reduced to. The current somewhat brute-force plan is at least easy.

Regardless of long-term ambitious schemes, I assume we seek a path to improving your 2.0.7.1 speed preferably without changing code, which leaves things like indexes and cache sizes as possible tactics.

gpatel-fr · June 16, 2023, 2:14pm

@tkohhh

Probably a bug report could be useful. If you never used this function, it’s a long operation (without much feedback) that creates an anonymised (file names removed) database from the selected job db with a short text file about the system features.
I expect that compressed it could take about 3 GB so having a shared space big enough can be a problem (it will not be possible to upload it to the forum)

ts678 · June 16, 2023, 2:48pm

We went through that once over here, although that one seems less oversized than current.
If trying queries is part of the idea, the conversion to FixedFile has dropped three indexes.
I don’t know how much that matters though, as the main worry here is possibly the blocks…

gpatel-fr · June 16, 2023, 2:50pm

as the problematic query is

SELECT Hash,Size,VolumeID FROM Block WHERE ID NOT IN (SELECT DISTINCT BlockID AS BlockID FROM BlocksetEntry	UNION SELECT DISTINCT ID FROM Block,BlocklistHash	WHERE Block.Hash =BlocklistHash.Hash)

so involves Block, BlocksetEntry and BlocklistHash it does not matter I think.

ts678 · June 16, 2023, 2:56pm

If you like, I suppose you could start on the other bug report if you like. It’s over 5 million blocks, so probably not quite as large, but it’s bigger than any I’ve ever had, and opens in browser very slowly.

tkohhh · June 16, 2023, 3:09pm

Are you referring to the bug report from the other job I was having problems with? If so, that job doesn’t have this issue, so I don’t know if it will be relevant.

I have three daily backup jobs that run in this order:

Back up NAS files (minus media, but with photos) to B2
Back up media files (plus photos) to a local HDD
Back up the duplicati database files to B2

The only one that is running slower now after updating to 2.0.7.1 is number 2.

ts678 · June 16, 2023, 3:14pm

Yes. So delete has not slowed down much at all? Being smaller, I’d think it might slow less though.

I’m not sure exactly why the bug report was suggested, but regardless we’ve found a working way.

Sadly we don’t know if it’s sheer size (maybe needing more cache) hitting a wall, or something else.

tkohhh · June 16, 2023, 3:25pm

No, not on the main backup. Here’s the log from the delete portion of today’s backup (I turned off profiling for now):

2023-06-16 05:12:05 -07 - [Information-Duplicati.Library.Main.Operation.DeleteHandler:RetentionPolicy-StartCheck]: Start checking if backups can be removed
2023-06-16 05:12:05 -07 - [Information-Duplicati.Library.Main.Operation.DeleteHandler:RetentionPolicy-FramesAndIntervals]: Time frames and intervals pairs: 7.00:00:00 / 1.00:00:00, 28.00:00:00 / 7.00:00:00, 365.00:00:00 / 31.00:00:00, 1826.00:00:00 / 365.00:00:00
2023-06-16 05:12:05 -07 - [Information-Duplicati.Library.Main.Operation.DeleteHandler:RetentionPolicy-BackupList]: Backups to consider: 6/15/2023 5:00:00 AM, 6/14/2023 5:00:00 AM, 6/13/2023 5:00:00 AM, 6/12/2023 5:00:00 AM, 6/11/2023 5:00:00 AM, 6/10/2023 5:00:00 AM, 6/9/2023 5:00:00 AM, 6/3/2023 8:57:35 PM
2023-06-16 05:12:05 -07 - [Information-Duplicati.Library.Main.Operation.DeleteHandler:RetentionPolicy-BackupsToDelete]: Backups outside of all time frames and thus getting deleted: 
2023-06-16 05:12:05 -07 - [Information-Duplicati.Library.Main.Operation.DeleteHandler:RetentionPolicy-AllBackupsToDelete]: All backups to delete: 6/9/2023 5:00:00 AM
2023-06-16 05:12:05 -07 - [Information-Duplicati.Library.Main.Operation.DeleteHandler-DeleteRemoteFileset]: Deleting 1 remote fileset(s) ...
2023-06-16 05:12:27 -07 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Delete - Started: duplicati-20230609T120000Z.dlist.zip.aes (15.50 MB)
2023-06-16 05:12:28 -07 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Delete - Completed: duplicati-20230609T120000Z.dlist.zip.aes (15.50 MB)
2023-06-16 05:12:28 -07 - [Information-Duplicati.Library.Main.Operation.DeleteHandler-DeleteResults]: Deleted 1 remote fileset(s)

Meanwhile, the delete portion from the media backup ran slowly again today:

2023-06-16 05:17:23 -07 - [Information-Duplicati.Library.Main.Operation.DeleteHandler:RetentionPolicy-StartCheck]: Start checking if backups can be removed
2023-06-16 05:17:23 -07 - [Information-Duplicati.Library.Main.Operation.DeleteHandler:RetentionPolicy-FramesAndIntervals]: Time frames and intervals pairs: 7.00:00:00 / 1.00:00:00, 181.00:00:00 / 31.00:00:00
2023-06-16 05:17:23 -07 - [Information-Duplicati.Library.Main.Operation.DeleteHandler:RetentionPolicy-BackupList]: Backups to consider: 6/15/2023 5:15:17 AM, 6/14/2023 5:16:10 AM, 6/13/2023 5:15:00 AM, 6/11/2023 5:15:40 AM, 6/9/2023 5:20:16 AM, 5/24/2023 5:22:23 AM, 4/21/2023 5:35:27 AM, 3/20/2023 5:35:47 AM, 2/17/2023 4:26:56 AM, 1/17/2023 4:23:55 AM
2023-06-16 05:17:23 -07 - [Information-Duplicati.Library.Main.Operation.DeleteHandler:RetentionPolicy-BackupsToDelete]: Backups outside of all time frames and thus getting deleted: 
2023-06-16 05:17:23 -07 - [Information-Duplicati.Library.Main.Operation.DeleteHandler:RetentionPolicy-AllBackupsToDelete]: All backups to delete: 6/15/2023 5:15:17 AM
2023-06-16 05:17:23 -07 - [Information-Duplicati.Library.Main.Operation.DeleteHandler-DeleteRemoteFileset]: Deleting 1 remote fileset(s) ...
2023-06-16 06:41:45 -07 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Delete - Started: duplicati-20230615T121517Z.dlist.zip (3.77 MB)
2023-06-16 06:41:45 -07 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Delete - Completed: duplicati-20230615T121517Z.dlist.zip (3.77 MB)
2023-06-16 06:41:47 -07 - [Information-Duplicati.Library.Main.Operation.DeleteHandler-DeleteResults]: Deleted 1 remote fileset(s)

I did set the CUSTOMSQLITEOPTIONS_DUPLICATI environment variable to cache_size=-200000 but it does not seem to have had an effect. I couldn’t find any documentation on how to utilize that variable, so I’m not sure if the value I used is correct.

gpatel-fr · June 16, 2023, 3:39pm

Yes, thanks there are actually 900000 blocksetentry records, so it’s a bit more.
I get 23-25s and 15-21s with and without the new index respectively. Similar results, a bit slower than with my smaller database but nothing like 40’.

ts678 · June 16, 2023, 5:16pm

If you set About → Show log → Live → Verbose (or similar log file), it would let you know it picked it up

github.com

duplicati/duplicati/blob/d0f1498bd41b151d8512fd2acb57739f6a05587f/Duplicati/Library/SQLiteHelper/SQLiteLoader.cs#L146-L153


      
          	    // set custom Sqlite options

                      var opts = Environment.GetEnvironmentVariable("CUSTOMSQLITEOPTIONS_DUPLICATI");

                      if (opts != null) {

                          var topts = opts.Split(new char[]{';'}, StringSplitOptions.RemoveEmptyEntries);

                          if (topts.Length > 0) {

                              using (var cmd = con.CreateCommand()) {

                                  foreach (var opt in topts) {

                                      Logging.Log.WriteVerboseMessage(LOGTAG, "CustomSQLiteOption", @"Setting custom SQLite option '{0}'.", opt);

Environment variable setting is OS dependent. For a Linux, I think you can put it in /etc/default/duplicati. Using systemctl status duplicati will find its PID. Then ps -p <PID> eww to look at environment.

tkohhh · June 16, 2023, 5:34pm

I know how to set environment variables on my host… I just don’t know the correct usage of this particular Environment Variable. The value I have set is cache_size=-200000 Is that correct? Does it need a trailing semicolon?

tkohhh · June 16, 2023, 5:42pm

When I restart the Duplicati container, this is the only message I see in the Live Log:
Screenshot 2023-06-16 104114

ts678 · June 16, 2023, 5:59pm

From a Google search for site:duplicati.com "cache_size", I’d think yes and no respectively.

I’m surprised it even knew it was in live log at start. How about if you go look at job log? That opens database for the job, and should be a better time to see whether a Verbose message makes it out…

Since you mention container, what’s the storage on its host? A hard drive versus SSD might matter, although a slow drive seems like it would have been slow before. I’m on a hard drive, and Windows caching seems to make a big difference. A SELECT COUNT(*) FROM "Block" took over 7 minutes, however reopening database (I’m testing on the other 5 million block one) did it in under 2 seconds.

tkohhh · June 16, 2023, 6:06pm

I’m on Unraid. The Duplicati application is running on an NVMe. The backup destination for this job is an HDD.

I’m not sure I understand what I should be looking at… the job logs that I can see (Expand the Job in the UI, click “Show Log” in the “Reporting” section) look like they are only related to backup job operations, and nothing to do with application startup. Am I missing something?