Processing time for Compacting

Jojo-1000 · July 7, 2023, 8:41pm

Yes, right now both conditions have to be satisfied. I think putting a percentage threshold on the total wasted space cannot scale properly as the backup grows over time.

You don’t even need a low threshold for this. A high threshold is even worse, because it delays compacting until a significant portion of the entire backup needs to be compacted.

After every backup operation there is a chance of a small leftover, so even if you were to compact after every backup this would not change much overall.

Maybe there could also be a smarter selection process, where enough volumes are picked so that the result fits in a whole number of volumes. For example, assuming there are 5 volumes with 25% waste, you could compact 4 of them into 3 new ones and leave the 5th for next time. But that would require a bit more design and testing, as probably this won’t work out so evenly in real cases.

Sami_Lehtinen · July 10, 2023, 7:28am

Also having a low threshold as @ts678 suggested, simply creates insane amount of backup volume churn. I wouldn’t call that “improving efficiency”. As far as I can see, it just makes whole process a lot less efficient wasting all resources, except destination storage.

I’ve also asked for compaction time limit option. Currently the compaction can run for several days. Of course currently you can run the compact with timeout, which simply kills the process after N seconds of compacting.

ts678 · July 10, 2023, 11:38am

As in this?

Is that done externally, e.g. with Python subprocess (which is what I used for my timeout kill tester)?

Feature Request: Time Limit for Compaction also asks for this. I also found one GitHub issue asking

[Feature request] Show progress during compaction #3397

Did you see any other change requests or useful discussions beyond my cited 2 in forum and 1 issue?

If low threshold causes churn and high threshold wastes storage, maybe an efficient way to compact under a time limit using the current algorithm (which applies the threshold to both total and volumes) would be to compact volumes in something like descending order of space wasted in specific volume.

Under a changed algorithm which looks only at volume waste, this would be sort of a natural outcome.

onurbi · December 29, 2024, 1:04pm

Update observation under Version 2.1.0.2:
I realized, that the waiting time from start to log the first line has been disappeared! With the former Version it lasted approx. 1 minute until the first action occured. This means the pure backup runs much faster now!

In the backup batch, I switched off compacting, to prevent a long suprising waiting time. Therefore today I used the separate compacting command from the GUI. The threshold parameter is a global one I assume. Next time I’ll use the compact command on the CLI an play around with the threshold parameter.

The logfile documents the course of compacting: Duplicati-2.1.0.2-compacting.zip (3.2 KB)

It shows between 12:03:02 and 12:25:30 no log entry. Perhaps the verbosity level is not high enough. In this time I saw in the resource monitor, that a 12G ZIP file (without an extension) in the TMP folder has been processed and copied.

Between 12:34:29 and 13:12:28 the main activity was operating with the SQLite-DB. 45 minutes are really long for “only” DB-processing.

This compacting operation was the longest since using Dulpicati: 1:12h!

I hoped a little bit, that the new version had got perfomance enhancements for the compacting process.
Onurbi

ts678 · December 29, 2024, 2:28pm

That seems odd. Do you mean the first line which seems like just announcing a start?

2024-12-29 12:00:02 +01 - [Information-Duplicati.Library.Main.Controller-StartingOperation]: Die Operation Compact wurde gestartet

If you mean the first download, some performance work might have reduced the delay.

Blog post: Speeding up inner workings of DoCompact() by up to 1000x

The linked pull request says:

the preliminary work of determining which files to download and compact took up a considerably amount of time

This backup might be misconfigured. What’s the Options screen Remote volume size?
Refer to help link for that in the GUI. Advanced option dblock-size has a similar effect.

2024-12-29 12:00:05 +01 - [Information-Duplicati.Library.Main.Database.LocalDeleteDatabase-CompactReason]: Compacting because there are 29 small volumes and the maximum is 20
2024-12-29 12:00:05 +01 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: List - Started:  ()
2024-12-29 12:00:05 +01 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: List - Completed:  (69 Bytes)
2024-12-29 12:00:05 +01 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Get - Started: duplicati-b0b33806d8450427693296219a206d4b9.dblock.zip (12.56 GB)
2024-12-29 12:03:00 +01 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Get - Completed: duplicati-b0b33806d8450427693296219a206d4b9.dblock.zip (12.56 GB)

means that at some past time, you had remote volume size set 12 about GB high or higher.

2024-12-29 12:31:55 +01 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Put - Started: duplicati-ba1290403dcca401a85dee0463024488f.dblock.zip (12.34 GB)

says you still do. We discussed blocksize above. Is yours still 100 KB? That’s a lot of blocks, probably stressing the SQL and its cache (discussed too – are you now using a higher value?).

onurbi · December 29, 2024, 2:57pm

To be more precise: I mean the time from pressing enter to start the backup batch script (compacting disabled) to the frist line Duplicati writes out.

I set dblock-size=100GB. Perhaps too much. My intention was to increase the default value considerably.

ts678 · December 29, 2024, 4:12pm

I would say so. That’s 2000 times the default, and maybe causes long infrequent compacts.

I was wondering if the 12 GB file was in the 29 small files downloaded. Now more plausible:

C:\Duplicati\duplicati-2.1.0.2_beta_2024-11-29-win-x64-gui>Duplicati.CommandLine help small-file-size
  --small-file-size (Size): Volume size threshold
    When examining the size of a volume in consideration for compacting, a small tolerance value is used, by default 20 percent of the volume size. This ensures that large volumes which may have a few bytes
    wasted space are not downloaded and rewritten.

C:\Duplicati\duplicati-2.1.0.2_beta_2024-11-29-win-x64-gui>Duplicati.CommandLine help small-file-max-count
  --small-file-max-count (Integer): Maximum number of small volumes
    To avoid filling the remote storage with small files, this value can force grouping small files. The small volumes will always be combined when they can fill an entire volume.
    * default value: 20

With 100 GB remote volume size, 20 GB is where a file is considered small, per above help.

Remote volume size is new manual’s warning about the impact on restore. Old manual has:

Remote Volume Size

onurbi · December 29, 2024, 4:47pm

I’ll reduce to 500M. The next compact job after the next 5 backups will show, if there is a difference.

ts678 · December 29, 2024, 5:26pm

Work will probably at least be more spread out. Sometimes SQL speed also degrades more than linearly with size, due both to algorithms and to overflowing its memory cache as I had described.

A full analysis would need a lot heavier logging, and possibly detailed analysis as in linked article. Developer would probably have to lead you through what’s needed, if they wish to pursue further.

EDIT:

SQL is always by C library, so .NET 8 speedup may be minimal. Usual way to find slow SQL is to log at profiling level to see if there are any individual ones that can be sped up, e.g. with indexes.

What’s more exotic is to look at reason why queries are slow. Sometimes one can count program file (database, database rollback log, database etilqs temporary file) uses in Process Explorer, or detail it in Process Monitor. Looking at drive-level activity is not enough, due to Windows caching.