Processing time for Compacting

onurbi · June 4, 2023, 4:09pm

Hi,

when compacting is neccessary, processing can take relative long time. There are several posts concerning this issue.

Ok, I didn’t read about in the release notes, but is there anyway a chance that with 2.0.7.1_beta_2023-05-25 compacting will work faster?

Onurbi

gpatel-fr · June 4, 2023, 7:01pm

It should not run faster.

By setting the env variable
CUSTOMSQLITEOPTIONS_DUPLICATI
to something like
cache_size=-200000
before starting Duplicati, you would make some Sqlite queries run a bit faster, but I have no idea about the specific impact on compacting.

ts678 · June 4, 2023, 10:13pm

2.0.6.102_canary_2022-04-06 index additions add a chance, but no guarantee:

Improved database query performance, thanks @jedthe3rd

Add indexes to improve backup query performance #4687 was the code change.

Identified another slow query during backup was related discussion in the forum.

You can see the focus was backup, but indexes might speed SQL in other ways.
The cache_size change is also kind of a generic tool not specifically for compact.

onurbi · June 5, 2023, 4:06pm

Ok, I’ll give it a try!

onurbi · June 5, 2023, 4:07pm

I’m looking forward when compacting will be triggered next time!

ts678 · June 5, 2023, 4:44pm

Compacting involves lots of downloading to do a less amount of uploading, in addition to using SQL. Depending on your system and network speed, one or the other of those items might be the big one.

The COMPACT command has a few options, e.g. how sensitive it is to wasted space. You can make compact run more often by setting that lower, but it might just churn more. You can certain try things.

onurbi · June 5, 2023, 9:24pm

I’ll have a deeper look on the options.

Depending on your system and network speed, one or the other of those items might be the big one.

I see.
During a running compacting process I watched, that 95% of time the local files in the tmp folder are read and written. They are located on my M2-SSD on the I7 8.th Gen with 16 GB. Beside some sql-lite DB operations are done of course.

It seems, that the tmp file operation could be done more efficient. That is of course a very subjective impression :=)

ts678 · June 5, 2023, 10:17pm

What files? Ones that start with dup- are Duplicati files, e.g. downloaded dblock files. Look at size.
Ones with etilqs are Temporary Files Used By SQLite based on Duplicati SQL queries of course.

Do you do either Duplicati C# or SQL? You seem to be pretty good at watching activity. That’s good.
For SQL queries, one good logging is to log at profiling level which says how long each query takes.

You also make the good point that drive speed is another performance factor in how fast things go…
Generally I would consider an M2 SSD as quite good, although some people have used a RAM disk.

Logging even at information level will show download speeds. Compact downloads dblocks, fills new dblock volume to whatever remote volume size you used (default 50 MB), uploads it, and deletes old.

If you are seeing a big number of reads from etilqs files, that might be an indication that the default SQLite cache size is insufficient for the backup size. How big is it, and did you change the blocksize?

The above environment variable can raise cache size if that’s slowing things. RAM is faster than SSD.

ts678 · June 6, 2023, 10:41am

First step is to figure out which type of tmp file does that. If SQL, see previous reply.
If Duplicati, then the challenge is how to copy parts of files without reads and writes.
If the destination is remote, then possibly the .zip files involved are also encrypted.

Do you see gaps in the reads and writes? I’m not sure how well downloads/uploads
overlap with the file-to-file copy. If badly, that might be a possible improvement area.

I/O performance to destination hasn’t been described at all, but could be a factor too.

For any processing time question, you have to try to find out what limits in your case.

Another generic improvement to many time issues is to use a larger blocksize, as for
large backups (over 100 GB), the default 100 KB makes millions of fairly small blocks
which slows down the SQL but also slows block copying in compacts (your concern).

Depending on your source file size and your destination, an extreme form (which will
probably degrade deduplication hugely unless files are identical – you may not care)
would be to set blocksize to remote volume size, thus there is less partial file copying.

Remote file containing only a wasted block is just deleted. No repacking of remainder.

You can’t change blocksize on an existing backup though, as it’s central to everything.

I already mentioned how one can make it run more often (reduce threshold), but it’s
unclear how that would change time per run, as threshold applies to each volume too.
There was talk on the forum of trying to separate the two thresholds. It’s not done yet.
There are far too few qualified volunteers.

Compact - Limited / Partial

onurbi · June 6, 2023, 5:41pm

What files?

Yes, that are duplicati files.

Do you do either Duplicati C# or SQL?

How do you mean?

although some people have used a RAM disk.

I am convinced, that my 16G are unfortunately too less, to build a helpful size. But would be great to have!

How big is it, and did you change the blocksize?

For the next run at weekend I’ll define it. gpatel-fr worte an example value of CUSTOMSQLITEOPTIONS_DUPLICATI at “cache_size=-200000”. Is there really a “-” in the value string? I could not find a description in the Duplicati documentation.

Do you see gaps in the reads and writes?

Do you mean time gaps? No not really. But this is hard to observe.

to use a larger blocksize

You can’t change blocksize on an existing backup.

Good to know! Means “–blocksize=50MB” will be ignored, when I add it in my batchfile.

make it run more often.

I had already the feeling about this! More often then a week I will not work organizational.

reduce threshold

I try to switch to 10 now for a test.

Next days I will play around with the COMPACT command. With this, I should see if any of the parameters changes the processing time.

gpatel-fr · June 6, 2023, 6:27pm

yes, that’s a Sqlite pragma.

https://www.sqlite.org/pragma.html#pragma_cache_size

ts678 · June 6, 2023, 7:29pm

Specifically that is saying that the file names begin with dup- then have semi-random characters.
They would be about 50 MB if that’s the remote volume size, and Duplicati copies between them.

You said it could be done more efficiently. Is there a technical basis? Do you work with such code?

If this is talking about the radical idea of huge blocksize to ease compacting, any attempt to change blocksize of an existing backup will be rejected as invalid. It won’t be ignored. This is an experiment anyway, and the best value might need a bit of looking. It works better with large source files. Small source files might create an entire small dblock file each, which may hurt, depending on destination.

I can’t quite read that, I think you mean the organization will not tolerate frequent compacts. Why so?
Are these systems trying to run specific backup schedules, e.g. weekdays. Are weekends more idle?
Although there may be risk of interference, it’s possible to compact at times when backups aren’t run.

onurbi · June 6, 2023, 8:30pm

Do you work with such code?
I developed software, yes. Perl, Python, PHP and C. I have SQL knowledge too.

Observing that one file is copied to another creates the impression, that should work faster.
But probably is this done in small portions which takes long time. I don’t know the exact steps which are neccessary to compact of course,

I think you mean the organization will not tolerate frequent compacts.

That was my fault not to take the right words. I wanted to say, that I am not able to manage a more frequent backup generally. Due to the possibility of catchig a ransomware, it’s more secure to not connect the external remote USB-HDD permanently to the laptop. So I connect it once every 7 days at my workdesk with the docking station to run Duplicati. Normally I use the laptop somewhere else.

The transfer of the ZIP files after compacting takes time in an expecting speed.

ts678 · June 6, 2023, 8:49pm

https://github.com/duplicati/duplicati/blob/master/Duplicati/Library/Main/Operation/CompactHandler.cs

if you care to look. It looks to me like it’s reading through the blocks in the old file, running a UseBlock query to the Block table to see if it’s in use. If so, put it in the new dblock .zip file, compressing block.

The SQL in general does lots of small queries, so probably hurts performance. Larger might be faster. There’s probably a fair amount of speed up possible, but it takes volunteers, and those are all too few.

100 KB blocks by default. Those who like speed at the cost of space (less deduplication) can increase. Possibly compression will improve a bit, which might offset the reduction in deduplication. I don’t know.

I’m not talking about the radical idea of trying to typically have blocksize approach the volume size, but increasing somewhat, e.g. to 1 MB (or more if you have more than 1 TB) might work generally faster…

ts678 · June 6, 2023, 9:27pm

I’m still not quite understanding the problem with compacting time, but as a guess, sometimes you want rather quick backup and then go elsewhere? If you can predict less busy time, turn on no-auto-compact (you’re not tight on external drive space, right?), and use the Compact button when suitable time comes.

There’s not really a pressing need to compact, if space is never an issue, but speed will likely slow a bit.

onurbi · June 7, 2023, 5:52pm

Oh yes, I’ll have look. Could indeed be interesting.

I can imagine, to try it in autumn when there is more time.

No, the USB-Drive is big enough.

turn on no-auto-compact

It is perhaps not a bad idea to switch it off generally and run compact manually when enough time for that can be calculated.

Sami_Lehtinen · July 7, 2023, 9:13am

Just as example, Restic does run compacting more often, splitting it up into smaller compact runs. Duplicati seems to maximize compaction efficiency, also making the the compact as slow as it can be. I wrote already about this topic, so I’m not going to repeat the obvious things.

But running compaction more often, is one way to reduce the duration of it. Yet Duplicati doesn’t offer afaik that option.

ts678 · July 7, 2023, 5:04pm

The COMPACT command

--threshold=<percent_value>
The amount of old data that a dblock file can contain before it is considered to be replaced.

I think one can turn this down so low that it runs compact continuously (oops) so don’t do that.

Command line help says more:

  --threshold (Integer): The maximum wasted space in percent
    As files are changed, some data stored at the remote destination may not
    be required. This option controls how much wasted space the destination
    can contain before being reclaimed. This value is a percentage used on
    each volume and the total storage.
    * default value: 25

and the last sentence is something that has been suggested as maybe being a bad thing, as a low threshold on the total storage seems to be the trigger. I think others know the code better than I do. Individual volumes are then tested, so if that’s true then a low value for total storage leads to lots of churning of volumes which then drives run time back up (and possibly adds expense depending on storage provider charge algorithm). Separating these two usages might be one basic improvement, however more complex schemes are possible. Some people even want to be able to set time limits, however that worries me if they set it so low that storage use grows forever due to lack of compacts.

I’m disagreeing that there’s no way to run compacts more often, but agreeing things need improving.

Jojo-1000 · July 7, 2023, 7:54pm

Triggers for compact

I looked at the code to find the exact logic. The triggers are evaluated purely based on the local database, so nothing need to be downloaded to determine what need to be compacted.

These are written as info messages to the log (I think only live log), so you should see which one applies:

There are volumes containing only deleted data (this on its own only deletes those volumes, does not run a full compact).
The wasted space is over the threshold and there are at least two volumes over threshold.
The total wasted space is calculated by adding only the wasted space of volumes over threshold, and it is compared to the full backup size. So this means volumes and total storage are over threshold. This is probably why it triggers so late. If new data keeps getting added, the total size grows which makes it more difficult to get over the threshold.
I think there need to be at least two, because otherwise there is nothing to merge and it would just shrink the volume.
There are small volumes (default up to 20% dblock size) which add up to a full dblock size.
The help for --small-file-size is wrong in that it does not specify a percentage, but an absolute size (in MB by default). Only the default value is in percent.
The number of small volumes is too large (default over 20).

Special behavior for auto-compact after backup or delete

Auto-compact only tests the above triggers, if

at least one backup version was deleted, or
the last volume added in the current backup is small

So, there might be no auto-compact if it was disabled for a previous run (that deleted versions or added a small volume) and is enabled now, but the conditions don’t apply any more. Otherwise it should always trigger a compact as soon as required.

Compact process

Delete all volumes which only contain unused data
Download volumes which are too small or over threshold
Combine used blocks in new volumes
Upload new volumes once they become full
This is all single-threaded, unlike backup which runs parallel uploads and processing

Why is it so slow?

The compact operation does not utilize multiple threads for processing (although it is probably I/O bound anyways). At least networking seems to be somewhat parallel.
The very strict waste threshold means that, if a compact is triggered only via that condition (not intermittently by small file compacting), it needs to download at least 25% of the entire backup.
I don’t know why the requirement of total wasted space exists. It probably makes sense from a theoretical point of view (don’t need to bother compacting for a few percent of space). In practice I think that is an excessive amount of data to process in a single run. It would be better to spread it across many backups, then we would also see fewer interrupted compact operations that can potentially corrupt the backup.

I would suggest to change the --threshold flag and default behavior to only look at compacted volumes. 25% seems like a reasonable number on an individual file level. If it is desired to run compact less often, a --over-threshold-max-count flag could specify that at least n volumes need to be over threshold to trigger a compact (default right now is hard-coded to 2).
If someone wants to have the current behavior back, a --threshold-total flag could be added, but maybe with a lower value than for individual blocks (also maybe as an absolute size, so it does not keep growing with the backup size).

It would be interesting to compare their logic to this one (if you know more about it), maybe there are some other things to consider. I think removing the total space threshold requirement would mostly fix the long compact times.

Only volumes over the threshold are touched at all, so I would not say this is the case. You could set the threshold very low to maximize compaction efficiency.

ts678 · July 7, 2023, 8:06pm

Do you mean individual volumes instead of trying to work them into total waste? Thanks for looking. Changing it that way might (although possibly a little disruptive) help avoid the avalanche effect of a threshold that’s set low then compounding the action by doing lots of individual volumes separately.

Counter-argument is doing only a few volumes increases chance of ending up with a small leftover, however your proposed new option could guard against that. Either way seems like it should lead to compact running more often with fewer volumes involved than the current design is probably doing.