Why the heck CAN'T we change the blocksize?

ts678 · September 27, 2021, 1:08pm

Yes, based on my rough rule-of-thumb based on DB Recreate time with very small blocksize.
I use small blocksize to make large number of blocks on a backup that my hardware can hold.

The Recreate (also a bit applicable to disaster recovery Direct restore from backup files) gets
slower at a more-than-linear rate. This has been noted before in forum reports. Raw test note:

        backup          dindex  BlocksetEntry   recreate
1 KB    00:22:18        :10     05:52:20        06:02:54
2 KB    00:20:24        :03     00:43:07        00:46:13
4 KB    00:06:49        :02     00:05:20        00:06:41
8 KB    00:05:17        :01     00:00:59        00:01:48
10 KB   00:03:34        :01     00:00:23        00:01:04

1 KB:
2020-03-24 15:12:47 -04 - [Profiling-Timer.Finished-Duplicati.Library.Main.Database.ExtensionMethods-ExecuteNonQuery]: ExecuteNonQuery: INSERT INTO "BlocksetEntry" ("BlocksetID", "Index", "BlockID") SELECT DISTINCT "H"."BlocksetID", "H"."Index", "H"."BlockID" FROM (SELECT "E"."BlocksetID" AS "BlocksetID", "D"."FullIndex" AS "Index", "F"."ID" AS "BlockID" FROM (  
        SELECT
            "E"."BlocksetID",
            "F"."Index" + ("E"."BlocklistIndex" * 32) AS "FullIndex",
            "F"."BlockHash",
            MIN(1024, "E"."Length" - (("F"."Index" + ("E"."BlocklistIndex" * 32)) * 1024)) AS "BlockSize",
            "E"."Hash",
            "E"."BlocklistSize",
            "E"."BlocklistHash"
        FROM
            (
                    SELECT * FROM
                    (
                        SELECT 
                            "A"."BlocksetID",
                            "A"."Index" AS "BlocklistIndex",
                            MIN(32 * 32, ((("B"."Length" + 1024 - 1) / 1024) - ("A"."Index" * (32))) * 32) AS "BlocklistSize",
                            "A"."Hash" AS "BlocklistHash",
                            "B"."Length"
                        FROM 
                            "BlocklistHash" A,
                            "Blockset" B
                        WHERE 
                            "B"."ID" = "A"."BlocksetID"
                    ) C,
                    "Block" D
                WHERE
                   "C"."BlocklistHash" = "D"."Hash"
                   AND
                   "C"."BlocklistSize" = "D"."Size"
            ) E,
            "TempBlocklist-B89B6D32B7B3CA4BBF9E7E189433275B" F
        WHERE
           "F"."BlocklistHash" = "E"."Hash"
        ORDER BY 
           "E"."BlocksetID",
           "FullIndex"
) D, "BlocklistHash" E, "Block" F, "Block" G WHERE "D"."BlocksetID" = "E"."BlocksetID" AND "D"."BlocklistHash" = "E"."Hash" AND "D"."BlocklistSize" = "G"."Size" AND "D"."BlocklistHash" = "G"."Hash" AND "D"."Blockhash" = "F"."Hash" AND "D"."BlockSize" = "F"."Size"  UNION SELECT "Blockset"."ID" AS "BlocksetID", 0 AS "Index", "Block"."ID" AS "BlockID" FROM "Blockset", "Block", "TempSmalllist-8B3E068157D37942AC2D46A8DF3907D8" S WHERE "Blockset"."Fullhash" = "S"."FileHash" AND "S"."BlockHash" = "Block"."Hash" AND "S"."BlockSize" = "Block"."Size" AND "Blockset"."Length" = "S"."BlockSize" AND "Blockset"."Length" <= 1024 ) H WHERE ("H"."BlocksetID" || ':' || "H"."Index") NOT IN (SELECT ("ExistingBlocksetEntries"."BlocksetID" || ':' || "ExistingBlocksetEntries"."Index") FROM "BlocksetEntry" "ExistingBlocksetEntries" ) took 0:05:52:19.986

I don’t recall detailed history of above. I might have looked for when things got slow, then looked for some SQL that seemed to account for a large part of the slowness. At least that’s how the table at top appears.

There seems to also be an open question of whether it’s blocks in destination (corresponding roughly to dblock files given the fixed blocksize (and BTW, note in SQL above how that factors into the SQL query) that slows things down, or number and length of source paths. I’m guessing blocks, but it needs testing somewhat better than what I started down above (which was probably just a backup at some blocksize, followed by looking somewhat at Recreate time). I doubt I worked really hard at crafting special sources.

There are possibly just some better ways to write the SQL. Not my area, and a DB expert could help us.
I know enough to mostly explain the DB layout and its rough usage, but my SQL expertise is not great…

Failing that, there may be some ways to brute-force it by giving SQLite more resource, as I suggested at Backups suddenly taking much longer to complete. So far in testing, @amloessb seems to have found “dramatic effect on backup duration” (is that a concern to you?) but the etilqs temporary file theory was not supported (at least at first glance). I had as a theory an effect similar to virtual memory thrashing that can cause a sudden drop in performance. At a lower level, cache limitations can really slow down a CPU.

I don’t know if it needs PMs to give generalities because this is a very new area. You know better than I do where it hurts for you. You were in the My experience with Database Rebuild / Recreate topic, and wound up at “I went with deleting and recreating” which I think means the backup. DB recreate was running slow.

Is this hardware Windows, Linux, or something else? The OS performance tools will probably be useful in trying to track down why things were slow. At the Duplicati level, the profiling level log gives the SQL times.
Use your favorite large-log viewer (I usually use glogg) to look for a took time that’s long, e.g. into minutes.

I forget what sorts of hardware and OS @Sami_Lehtinen has, but Sami was also in that Recreate topic.
Slow DB Recreation On OneDrive had a comment about CPU performance and single thread maxed out, which is likely what one is going to see (might need to switch to Logical processors for Task Manager) when Duplicati has SQLite being busy in some slow single SQL statement. Profiling being the level where SQLite statements become visible was very likely to find slow ones and make them fast, but it needs time and SQL skill (little of either right now AFAIK), so anybody who’s willing, please chip away at what you can.

That’s pretty much all the entire project can do. It can’t hire, so needs volunteer contributions for progress.

Thanks for considering helping out in some way. It’s clear from original post that this is a pain point for you. Though small systems need support too, maybe we can at least get better data on large backup behavior.