Why the heck CAN'T we change the blocksize?

At a 1 KB blocksize (the minimum), I guess that means about that many GB. I don’t have that much laying around as installed programs (less likely to have personal paths), so maybe I can ransack my collection of Windows ISOs (all obsolete now…) and copy those and hope that having only a few files shows the issue.

Before trying that, I guess I should ask @rsevero if some similar approach is possible on local systems, because transferring the database might be challenging. I might try a WeTransfer, and I think it does 2 GB.

@ts678 if you could create the database, it would be great as I’m a complete Duplicati newbie. I believe WeTransfer should deal well with the resulting database.

Isn’t it possible for you to just scramble the filenames. I believe the actual filenames aren’t really important for this tests.

The unit tests are here:

1 Like

Running now, but the poor mechanical hard drive in this desktop is very busy (queue length is about 5).

This is just an unencrypted backup with 1 KB blocksize of four folders checked in GUI to a local folder.
The only slight tricky part should be that blocksize is on the Options screen 5 in Advanced options.

image

I can certainly help with any usage questions, and to some extent on the DB layout and how it’s used.
How the backup process works is helpful too, although it’s not really getting down to SQL table levels.

1 Like

Doing this on a known-good backup Database, the new query generates output implying wrong lengths.

As for my testing to build a big database, I got one but it’s not slow. I wonder if more files would slow it?
I’m not too keen on going wholly-custom on data building. I wonder if throwing small files in might slow?

Row statistics comparison:

                Kahomono        ts678
Block           14973745        16230200
BloocksetEntry  14943985        15980592
FixedFile/File  39489           341

Kahomono DB ran the new query (after editing File to FixedFile) for 11 hours before I stopped it.
Process Monitor showed it reading the database heavily. Process Explorer suggested 250 MB/sec, however almost none of it got through Windows to the disk, which Task Manager often showed idle.

DB Browser for SQLite results of the query on my database (not using a dump, just directly with it):

        old     new
rows    0       2970 with 760 good, meaning the functionality of query changed.
time    20      325
1 Like

I really need access to a complete database that shows slowness to further work on this issue.

1 Like

One approximation would be to use the FixedFile view. Admittedly performance behavior may differ from actual because there’s no longer a File view that has to go to other tables to return the result to Duplicati.

CREATE VIEW “File” AS SELECT “A”.“ID” AS “ID”, “B”.“Prefix” || “A”.“Path” AS “Path”, “A”.“BlocksetID” AS “BlocksetID”, “A”.“MetadataID” AS “MetadataID” FROM “FileLookup” “A”, “PathPrefix” “B” WHERE “A”.“PrefixID” = “B”.“ID”

Do you have an opinion on how to slow down the query after looking at the statistics differences between the slow-but-private DB and my not-so-slow-but-not-so-private DB? Is File count a plausible guess for it?

@Kahomono do you still have equipment (and can you do scripting) to write dummy data to try to match problem on your actual data? Guidance from the first paragraph might help figure out the recipe for “slow”.

Probably possible given enough tools and coding. A fixed value might be easier, but will that defeat tuning?
What I’m talking about is scrambling in the DB (we’re not DB experts), assuming that is what was meant. Duplicati’s dump does a nice job, but it also gets rid of the view for a table (where performance may vary).

The options might be:

  1. Try to figure out how to make my non-private backup slow.
  2. Try to figure out how to dummy up data to do slow backup.
  3. Try to figure out how to sanitize DB without hurting tunings.

EDIT 1:

Option 1 not going well simply adding more files. I added 40000 empty files and it didn’t slow, and the odd thing is that the File view showed only the folder not its files. I made the files non-empty. Still not changing.
What finally got the File view update (Refresh button wasn’t enough) was closing and reopening browser, however original query is still pretty fast (22 seconds). One interesting thing is the same query running in Duplicati is hitting the disk hard (unlike the query in browser), and is making it slow, but not ridiculously so.

Still looking for a good recipe for a complete database that shows slowness…

1 Like

Just wanted to add that this is my experience too, both with Duplicati and with other apps that use the Sqlite. There is a certain point beyond which performance drops off a cliff.

1 Like

My interim report on this slow query is at Query during backup takes too long, does not finish in the hope of seeing if its other slow query shows the same symptoms under lots of general-purpose performance tools.

Summary is that SQLite is reading at a crazy rate (maybe about 200 MB/sec, but not actually to the drive). Writing doesn’t seem to be happening much, which makes me wonder if 17 days’ run made any headway.

Architecture of SQLite makes me wonder if the super heavy read sequences are from the Pager using up all its cached pages (cache is indirectly configurable, so experiment is possible), so thrashing I/O instead.

The other issue at least maybe has hope of getting a database bug report with the relevant tables intact to attempt a performance analysis (and maybe SQL revamp) because it doesn’t use File/FixedFile table.

It also had a query that was semi-slow before things fell off the cliff. Maybe that one could be tuned instead.

Ick. I’ve been sort of wondering if there are any SQLite issues in here, but it’s hard to prove. Have you seen any well-explored-and-documented writeups? If any workarounds were identified, that would be even better.

Sorry, this was a fair way back, maybe 7-8 years ago, and the only solution I found was to avoid the database getting too large by splitting it up.

Hopefully I will have a better internet connection soon, in which case I might start over with one of those large backups and see how it performs with a much larger block size.

I skimmed this thread and appreciate all the research you’ve done. It seems I would benefit from increasing blocksize from 0.5MB to 5MB. Is a blocksize conversion feature likely to be added at some point? It sure would be helpful so we could keep our backup history without doubling the backup size.

1 Like

Looks unlikely with current level of community support. All progress depends on community volunteers.
Volunteers are a bit unpredictable. Sometimes they go. Sometimes new ones arrive. It’s hard to plan…

Please read further here on why IMO this is more than just a feature. I’ll highlight, but offer good news.

DB Query Performance Testing, Fixes, and Maintainability had many great thoughts on possibilities, however all of this could probably use some community help. If nothing else, see if below fix helped.

Please change default blocksize to at least 1MB #4629

I am fairly sure I fixed this issue with an index in this pull request:#4687

I had this exact issue but now I don’t after applying the indexes.

I would still like to validate that though on someone else’s dataset.

It’s unclear to me if sanitized database bug reports help, but maybe they do. Thoughts @tarianjed?

2.0.6.101_canary_2022-03-13 is a not-publicized not-autoupdating preliminary Canary release with:

Improved database query performance, thanks @jedthe3rd

so in theory you can evaluate it on your own, however there are other untested changes in there too.
The usual Canary advice applies. Be careful because it contains new fixes and potentially new bugs.

This one is unusually rough in packaging (you can do an install from .zip file) due to signing issues.
It was mainly a quick attempt to solve a Dropbox problem, but it does exist and I’ll be running it soon.

1 Like

I don’t actually have a working backup job yet to validate, but the sanitized database will probably be ok as something to look at, but since it doesn’t have a backup that we could run operations on that database against, all it will be useful for is seeing sizes of tables and one off queries. So it would be good for troubleshooting one persons issues but not broad performance troubleshooting.