Why the heck CAN'T we change the blocksize?

Yes, based on my rough rule-of-thumb based on DB Recreate time with very small blocksize.
I use small blocksize to make large number of blocks on a backup that my hardware can hold.

The Recreate (also a bit applicable to disaster recovery Direct restore from backup files) gets
slower at a more-than-linear rate. This has been noted before in forum reports. Raw test note:

        backup          dindex  BlocksetEntry   recreate
1 KB    00:22:18        :10     05:52:20        06:02:54
2 KB    00:20:24        :03     00:43:07        00:46:13
4 KB    00:06:49        :02     00:05:20        00:06:41
8 KB    00:05:17        :01     00:00:59        00:01:48
10 KB   00:03:34        :01     00:00:23        00:01:04

1 KB:
2020-03-24 15:12:47 -04 - [Profiling-Timer.Finished-Duplicati.Library.Main.Database.ExtensionMethods-ExecuteNonQuery]: ExecuteNonQuery: INSERT INTO "BlocksetEntry" ("BlocksetID", "Index", "BlockID") SELECT DISTINCT "H"."BlocksetID", "H"."Index", "H"."BlockID" FROM (SELECT "E"."BlocksetID" AS "BlocksetID", "D"."FullIndex" AS "Index", "F"."ID" AS "BlockID" FROM (  
        SELECT
            "E"."BlocksetID",
            "F"."Index" + ("E"."BlocklistIndex" * 32) AS "FullIndex",
            "F"."BlockHash",
            MIN(1024, "E"."Length" - (("F"."Index" + ("E"."BlocklistIndex" * 32)) * 1024)) AS "BlockSize",
            "E"."Hash",
            "E"."BlocklistSize",
            "E"."BlocklistHash"
        FROM
            (
                    SELECT * FROM
                    (
                        SELECT 
                            "A"."BlocksetID",
                            "A"."Index" AS "BlocklistIndex",
                            MIN(32 * 32, ((("B"."Length" + 1024 - 1) / 1024) - ("A"."Index" * (32))) * 32) AS "BlocklistSize",
                            "A"."Hash" AS "BlocklistHash",
                            "B"."Length"
                        FROM 
                            "BlocklistHash" A,
                            "Blockset" B
                        WHERE 
                            "B"."ID" = "A"."BlocksetID"
                    ) C,
                    "Block" D
                WHERE
                   "C"."BlocklistHash" = "D"."Hash"
                   AND
                   "C"."BlocklistSize" = "D"."Size"
            ) E,
            "TempBlocklist-B89B6D32B7B3CA4BBF9E7E189433275B" F
        WHERE
           "F"."BlocklistHash" = "E"."Hash"
        ORDER BY 
           "E"."BlocksetID",
           "FullIndex"
) D, "BlocklistHash" E, "Block" F, "Block" G WHERE "D"."BlocksetID" = "E"."BlocksetID" AND "D"."BlocklistHash" = "E"."Hash" AND "D"."BlocklistSize" = "G"."Size" AND "D"."BlocklistHash" = "G"."Hash" AND "D"."Blockhash" = "F"."Hash" AND "D"."BlockSize" = "F"."Size"  UNION SELECT "Blockset"."ID" AS "BlocksetID", 0 AS "Index", "Block"."ID" AS "BlockID" FROM "Blockset", "Block", "TempSmalllist-8B3E068157D37942AC2D46A8DF3907D8" S WHERE "Blockset"."Fullhash" = "S"."FileHash" AND "S"."BlockHash" = "Block"."Hash" AND "S"."BlockSize" = "Block"."Size" AND "Blockset"."Length" = "S"."BlockSize" AND "Blockset"."Length" <= 1024 ) H WHERE ("H"."BlocksetID" || ':' || "H"."Index") NOT IN (SELECT ("ExistingBlocksetEntries"."BlocksetID" || ':' || "ExistingBlocksetEntries"."Index") FROM "BlocksetEntry" "ExistingBlocksetEntries" ) took 0:05:52:19.986

I don’t recall detailed history of above. I might have looked for when things got slow, then looked for some SQL that seemed to account for a large part of the slowness. At least that’s how the table at top appears.

There seems to also be an open question of whether it’s blocks in destination (corresponding roughly to dblock files given the fixed blocksize (and BTW, note in SQL above how that factors into the SQL query) that slows things down, or number and length of source paths. I’m guessing blocks, but it needs testing somewhat better than what I started down above (which was probably just a backup at some blocksize, followed by looking somewhat at Recreate time). I doubt I worked really hard at crafting special sources.

There are possibly just some better ways to write the SQL. Not my area, and a DB expert could help us.
I know enough to mostly explain the DB layout and its rough usage, but my SQL expertise is not great…

Failing that, there may be some ways to brute-force it by giving SQLite more resource, as I suggested at Backups suddenly taking much longer to complete. So far in testing, @amloessb seems to have found “dramatic effect on backup duration” (is that a concern to you?) but the etilqs temporary file theory was not supported (at least at first glance). I had as a theory an effect similar to virtual memory thrashing that can cause a sudden drop in performance. At a lower level, cache limitations can really slow down a CPU.

I don’t know if it needs PMs to give generalities because this is a very new area. You know better than I do where it hurts for you. You were in the My experience with Database Rebuild / Recreate topic, and wound up at “I went with deleting and recreating” which I think means the backup. DB recreate was running slow.

Is this hardware Windows, Linux, or something else? The OS performance tools will probably be useful in trying to track down why things were slow. At the Duplicati level, the profiling level log gives the SQL times.
Use your favorite large-log viewer (I usually use glogg) to look for a took time that’s long, e.g. into minutes.

I forget what sorts of hardware and OS @Sami_Lehtinen has, but Sami was also in that Recreate topic.
Slow DB Recreation On OneDrive had a comment about CPU performance and single thread maxed out, which is likely what one is going to see (might need to switch to Logical processors for Task Manager) when Duplicati has SQLite being busy in some slow single SQL statement. Profiling being the level where SQLite statements become visible was very likely to find slow ones and make them fast, but it needs time and SQL skill (little of either right now AFAIK), so anybody who’s willing, please chip away at what you can.

That’s pretty much all the entire project can do. It can’t hire, so needs volunteer contributions for progress.

Thanks for considering helping out in some way. It’s clear from original post that this is a pain point for you. Though small systems need support too, maybe we can at least get better data on large backup behavior.

The hardware is a ProxMox server. AMD Ryzen 7 with 16 cores, 64GB RAM and a ZFS file system of about 22TB of which about 7T is in use so far. I can stand up a VM in there of just about any OS we want, and allocate it an arbitrary amount of disk space for short-term use. I also have 500Gbit raw outbound network bandwidth depending on the backup targets we want to test.

I definitely saw that – one vCPU maxed out, all else idle – when I was running that massive DB recreate last month. That was on Debian.

Theory is that large backups without large blocksize get SQLite very busy. One contributing factor might be SQLite not having good statistical information to do good Query Planning because ANALYZE hasn’t yet run.
Run PRAGMA optimize upon closing database connection #3745 helps regular work. Maybe not recreate?

One thing I’d personally like to see is whether my rough rule-of-thumb of 1 million blocks holds for genuine large backups instead of my artificial tiny-blocksize testing. Whatever other areas get explored, it might be necessary to see if results found with normal large backups can scale down OK to developer-friendly size.

I’ve been asking around for staff OS preference for that VM. No clear preference yet. Based on backups in https://usage-reporter.duplicati.com/, Windows has more. I don’t know which OS has more large backups.

I think this is largely tailored around your large-backup pains, so maybe drive tests based on where it hurts, maybe also considering what OS you might find more suitable for needed setup and performance findings.

I’m pretty sure that drive speed is sometimes a bottleneck. I’m not sure how well tools in VM can measure. On my physical Windows system I look at things like Task Manager disk %busy, but after it hits 100% I use Resource Monitor to see the disk queue length. I forget whether Linux has equivalent drive measurements.

So for my first test I can

  1. Clone my largest backup destination files
  2. Time recreations of the DB in identical VMs with blocksizes of 50M, 10M, 5M, 1M, 500K, 100K

That will at least tell us if there’s any “there” there.

I can start tonight after work. What other measures will be of interest? And should I install Windows for this? My go-to these days is Debian 11. If Windows - what flavor? Please no 11. 10, or a Server.

1 Like

That sounds like a wonderful intent, but the blocksize is already set if you start from destination files.
Maybe I misunderstand the details of steps. I’d have thought there’d be a backup fresh start involved.
Although first focus is on recreate, that would be a good chance to look at the other times and sizes.

I guess ultimately the goal is to find slow spots that are potentially addressable. Broad metrics would include CPU load (maybe both aggregate and per-logical-processor), drive load (ideally including the “overload” factor that queue length implies), and someday network activity, although DB recreate that becomes painfully slow typically is processing small files such as dindex, so download isn’t a factor.

In terms of processing steps, a log file at verbose level shows this. Live log is fine if a peek is needed, however log-file has better time resolution (seconds) and is easier to maybe match with other metrics.

Profiling level log is good for finding slow SQL. One can (I think) also see some patterns, e.g. where a dindex file is downloaded, then taken apart for block processing. I don’t recall the exact rhythm I once observed, but an example would be that a default 50 MB dblock at 100 KB blocksize might have 500 blocks (or more) to process. The recreate reads the dindex file that corresponds to a specific dblock.

My rough experiment was just looking for the “long pole” in SQL time, maybe using a regular expression looking in took times for a minutes digit that stopped being 0. I’m pretty sure I didn’t script to find smaller delays that repeated a lot. The results table in my raw note showed one query soon dominated the time.

I still hope for a volunteer with SQL expertise who can help sanity-check and troubleshoot the slow spot.

Even though SQLite can occupy a logical processor, I suspect it’s also busy working with database files. Possibly it’s also using internal temporary files (named with etilqs). One can casually look at disk load with a friendly tool such as Task Manager, but Resource Monitor is good at naming names, Performance Monitor is better at logging, Sysinternals Process Monitor is good for seeing the details. Because that’s a lot, @amloessb suggested here that ProcessActivityView has a readable summary (and posted some).

Because several people have been looking at performance, I’d encourage community input, not just mine.

I’m also not as familiar with Linux performance tools, although there seem to be a lot of them (varying by distro?). Of ones that focus on logs, I’ve used sar in the past, and strace can show the low-level details, including what paths are being accessed, e.g. if one wants to see if etilqs files are part of the slowness.

Windows 10 is common. I have it here, and probably a lot of people do. Does Server provide better ways for studying performance? If so, have at them, but also try to find something that those on 10 can look at.
If there’s no good reason to use Server, probably just run with Windows 10 if you decide to run Windows.

If you go with Debian, I suspect the tools it has are available in other distros (but maybe not by default), so that might be a good fit for followup by the developer who might follow up (provided you can scale it down).

There are so few developer volunteers that reproducible issues and some problem isolation can help a lot. Steps so far are external by readily available tools, but (if you like) code-level instrumentation can be done.

Thanks for volunteering some equipment and time. Every step taken potentially adds to the understanding.

Okay! I have this set up. About 7.6T of raw data to backup, and a suitable destination prepared. Here is my initial backup config:

mono /usr/lib/duplicati/Duplicati.CommandLine.exe backup 
file:///home/david/dupli-target/ /media/david/nas-media/ 
--backup-name=Backup_50M 
--dbpath=/usr/lib/duplicati/data/SJKWAPRIEO.sqlite 
--encryption-module=aes 
--compression-module=zip 
--dblock-size=512MB 
--retention-policy="1W:1D,4W:1W,12M:1M" 
--blocksize=50MB 
--log-file=/home/david/50M_backup.log 
--log-file-log-level=Profiling 
--disable-module=console-password-input 

If we need to change anything now’s the time. I plan to re-execute this unchanged each time except for the --blocksize parameter. And then I will post an extract from the logs.

2 Likes

I assume there’s a Recreate between different --blocksize backups. Starting at 50MB should run faster.
You will probably make a smaller database too, although the table with the paths won’t be shrinking any.
Your destination size will probably be larger at 50 MB, so that might be a tradeoff for faster operations…

well, yah. That’s what we’re really testing.

I would like to chime in also with some data. I have a Duplicati task that backs up ~300GB of .VHDX files every day on a 30-day rolling window. The database size is currently about 5GB.

When the backup task was created I had left the blocksize at the default 100KB. For the most part the backup jobs themselves run fast enough and are primarily limited by my 10Mbps upload speed. I run into problems whenever a Compaction is called for. I have never tried a Recreate on this job, but I’m guessing it would take a very long time.

During the last compaction, the job appeared to be stuck so I switched on the Profiling logs to see what it was up to. It was running a SQL query that ended up taking longer than five hours to complete… and as soon as that query finished it launched into another that appeared to be taking just as long.

I ended up killing the task and repairing the database because at that rate the job might as well never finish… but to satisfy my curiosity I copied the query it was trying to run in a SQL DB Browser and ran it from there, and it returned a result in about four seconds. Given that I had interrupted the backup job, I don’t know if the query was pertinent anymore, but I found the execution speed disparity suspicious.

I tried looking into the Duplicati source to see how it was building the query and what it planned on doing with the result but my C# has gotten rusty enough that I didn’t make very good progress.

The query, if anyone is interested was this one:

SELECT 
  "A"."Hash", 
  "C"."Hash" 
FROM 
  (
    SELECT 
      "BlocklistHash"."BlocksetID", 
      "Block"."Hash", 
      * 
    FROM 
      "BlocklistHash", 
      "Block" 
    WHERE 
      "BlocklistHash"."Hash" = "Block"."Hash" 
      AND "Block"."VolumeID" = 28497
  ) A, 
  "BlocksetEntry" B, 
  "Block" C 
WHERE 
  "B"."BlocksetID" = "A"."BlocksetID" 
  AND "B"."Index" >= ("A"."Index" * 3200) 
  AND "B"."Index" < (
    ("A"."Index" + 1) * 3200
  ) 
  AND "C"."ID" = "B"."BlockID" 
ORDER BY 
  "A"."BlocksetID", 
  "B"."Index"
1 Like

It would be interesting (if you’re willing) to see if you can get a slow run on a copy of your actual database.
DB Browser for SQLite has an “Execute SQL” tab. It even tells you how long the execution took, I believe.

If slow run is reproducible in that environment, then DB experts (if we find volunteers) can try to dissect it.

I usually try to reverse-map SQL queries to the source just by searching for chunks, and doubling quotes.

It might be the below. Someone can try to trace how that method gets used. Maybe test log will show one.

Ran for 2.5 days and then this

2021-10-02 08:10:09 -04 - [Error-Duplicati.Library.Main.Operation.BackupHandler-FatalError]: Fatal error
System.Exception: Unexpected number of remote volumes marked as deleted. Found 0 filesets, but 1 volumes
  at Duplicati.Library.Main.Database.LocalDeleteDatabase+<DropFilesetsFromTable>d__5.MoveNext () [0x0027e] in <e60bc008dd1b454d861cfacbdd3760b9>:0 
  at System.Collections.Generic.LargeArrayBuilder`1[T].AddRange (System.Collections.Generic.IEnumerable`1[T] items) [0x00046] in <d22af090bceb4be792f53595cf074724>:0 
  at System.Collections.Generic.EnumerableHelpers.ToArray[T] (System.Collections.Generic.IEnumerable`1[T] source) [0x0003c] in <d22af090bceb4be792f53595cf074724>:0 
  at System.Linq.Enumerable.ToArray[TSource] (System.Collections.Generic.IEnumerable`1[T] source) [0x0002c] in <d22af090bceb4be792f53595cf074724>:0 
  at Duplicati.Library.Main.Operation.DeleteHandler.DoRun (Duplicati.Library.Main.Database.LocalDeleteDatabase db, System.Data.IDbTransaction& transaction, System.Boolean hasVerifiedBackend, System.Boolean forceCompact, Duplicati.Library.Main.BackendManager sharedManager) [0x0018d] in <e60bc008dd1b454d861cfacbdd3760b9>:0 
  at Duplicati.Library.Main.Operation.BackupHandler.CompactIfRequired (Duplicati.Library.Main.BackendManager backend, System.Int64 lastVolumeSize) [0x000a5] in <e60bc008dd1b454d861cfacbdd3760b9>:0 
  at Duplicati.Library.Main.Operation.BackupHandler.RunAsync (System.String[] sources, Duplicati.Library.Utility.IFilter filter, System.Threading.CancellationToken token) [0x00da9] in <e60bc008dd1b454d861cfacbdd3760b9>:0 

The expected 7T of backup files are in place tho.

Thoughts?

A rather rare mystery.

Unexpected number of deleted filesets 1 vs 0 #1583 has a similar report on a new backup, followed by a pointer to some code, and some explanation. It’s comparing two values that “should” give same number.

This is especially surprising on new backup with everything running well until it did post-backup deletions.
It’s rather rare that a profiling log catches one of these errors, but lines from around then might be helpful.

EDIT 1:

You might also want to copy the database in case it’s useful later. Maybe someone can connect it to code.

EDIT 2:

Do you have any log output like below? From other code, it looks like all this “should” be after all uploads:

EDIT 3:

Debian 11 offers sqlitebrowser, if you want to look at DB yourself, e.g. see if Fileset table has one row with reasonable UNIX epoch date for your backup. If so, in Remotevolume table there should be a Files row with dlist file for the version. If its state got updated to Deleting by the SQL, the question is why?

Here’s the last 200 lines of my log, out of 404,915. If you want more let me know.

Poking around on Explore mode in my DB with a SQL browser seems unlikely to be productive. But if you have any particular queries you want me to try I will do that.

It’s not SQL queries, and not mere poking. I’m looking for specific data near the error.
Below is a new backup that I did, to use as an example of what I’m talking about here.

image

I’m not using any filtering in the above because I expect only one fileset (backup version).
1633204774 (by https://www.epochconverter.com/) is October 2, 2021 7:59:34 PM.
The ID here is 1, which tells me it’s the first one. VolumeID of 1 points to the table below.
I used a Files filter on Type column because this is the type of file relevant to your error:

The State here is Verified. I’m not sure what yours is, but it looks like it went to Deleting which is bad.

2021-10-02 08:10:03 -04 - [Profiling-Timer.Begin-Duplicati.Library.Main.Database.ExtensionMethods-ExecuteNonQuery]: Starting - ExecuteNonQuery: UPDATE "RemoteVolume" SET "State" = "Deleting" WHERE "Type" = "Files" AND "State" IN ("Uploaded", "Verified", "Temporary") AND "ID" NOT IN (SELECT "VolumeID" FROM "Fileset") 
2021-10-02 08:10:09 -04 - [Profiling-Timer.Finished-Duplicati.Library.Main.Database.ExtensionMethods-ExecuteNonQuery]: ExecuteNonQuery: UPDATE "RemoteVolume" SET "State" = "Deleting" WHERE "Type" = "Files" AND "State" IN ("Uploaded", "Verified", "Temporary") AND "ID" NOT IN (SELECT "VolumeID" FROM "Fileset")  took 0:00:00:06.171
2021-10-02 08:10:09 -04 - [Error-Duplicati.Library.Main.Operation.BackupHandler-FatalError]: Fatal error
System.Exception: Unexpected number of remote volumes marked as deleted. Found 0 filesets, but 1 volumes

If you actually got the duplicati-20210929T233834Z.dlist.zip.aes at the destination, it might be possible to continue with the original Recreate test. Sorry for the unplanned interruption, but this was something that wasn’t supposed to happen. Rather than save the database, or look at it, you can run Creating a bug report and get a link to that posted or sent somehow, to let somebody else examine it.

Although this issue is rare, it occurs enough that it would be great if somebody could look into it more.

EDIT:

And it might be timing-related, which would be worse, but you can see that the errant UPDATE needs

“ID” NOT IN (SELECT “VolumeID” FROM “Fileset”)

which is why Fileset table is relevant, and also needs

“Type” = “Files” AND “State” IN (“Uploaded”, “Verified”, “Temporary”)

in RemoteVolume, which is why that’s relevant. Thanks for that log too. It seemingly had no versions to delete (thus the 0 in your message for the filesets side), yet changed a remote volume (thus your 1).

2021-10-02 08:09:51 -04 - [Information-Duplicati.Library.Main.Operation.DeleteHandler:RetentionPolicy-BackupList]: Backups to consider: 9/29/2021 7:32:19 PM

2021-10-02 08:09:51 -04 - [Information-Duplicati.Library.Main.Operation.DeleteHandler:RetentionPolicy-AllBackupsToDelete]: All backups to delete:

I will get the sqlitebrowser later this morning (EDT) and check this.

FWIW I do have that duplicati-20210929T233834Z.dlist.zip file so I will proceed to try and recreate.

I will also file a bug report.

It took three tries to get the backup going. The first time, the network volume containing the source had gotten unmounted, so we backed up an empty dir. The second time, the /tmp/ volume was too small. I enlarged it and the third time was (almost?) successful.

Here is the bugreport.zip

Thanks. That might be relevant. Maybe this issue only happens on messier situations than a straight run?
We’ll probably have some evidence for or against that on future runs, which likely will have fewer hiccups.

Also thanks for that bug report. I’ll have a look (which may or may not solve, but there’s actually a theory).

I’ll post some technical notes as I explore. I’m kind of wondering if the Temporary file from second backup caused your 1 value at the end of the third then hid its Deleting State evidence due to transaction rollback.

When a database-using program ends cleanly, it does a commit to write a consistent transaction into DB.
When it doesn’t (perhaps here), it doesn’t want to write half-done changes, so next open rolls them back.

2021-10-02 08:10:09 -04 - [Profiling-Timer.Finished-Duplicati.Library.Main.Database.ExtensionMethods-ExecuteNonQuery]: ExecuteNonQuery: UPDATE “RemoteVolume” SET “State” = “Deleting” WHERE “Type” = “Files” AND “State” IN (“Uploaded”, “Verified”, “Temporary”) AND “ID” NOT IN (SELECT “VolumeID” FROM “Fileset”) took 0:00:00:06.171

Being Temporary would pass the first part of above, and VolumeID is not in Fileset which has only the first and third backups, so that passes the second, so the number of updated rows winds up as 1 as you saw.

image

The Temporary state did not always allow a row UPDATE, but this was added in the “Stop now” fix here.
The general issue goes way back. I’m not saying this is the only case, but we might have a lead on one.
Possibly your log will come in handy again, if we need to figure out how things failed when /tmp filled up.

Thanks for the info!

I guess my main question right now is: should I start a raw create-from-backup? Or wait?