Compact - Limited / Partial

For now the best you could do is increase the default value for --threshold, but that’s just the start trigger - there is no “only compact until 15% or less is wasted” end trigger.

--threshold
As files are changed, some data stored at the remote destination may not be required. This option controls how much wasted space the destination can contain before being reclaimed. This value is a percentage used on each volume and the total storage.
Default value: “25”

Yes, I think --threshold should do what you want.

The threshold parameter is used to determine if a volume is considered “wasteful”, so if you set it high, like 95, it should not download the dblock unless it holds 95% unused data:

If --threshold=100 is set, does that effectively disable compacting - even if a manual compact is run?

Ok, treshold=100 practically disables compacting for non-fully deletable volumes. After this deleting fully deletable volumes is quick. But it disables threshold based partial compacting, which then needs to be run separately.

How does Duplicati behave in a situation where I run compact separately, and just kill it based on timer? If the application is working securely in segmented transactions. That should work. If not, something will break and create corruption or situation which requires manual cleanup (sigh).

Actually I could test run this case. Starting and killing compaction with large data set with random kill time to see if it breaks. Yet, if it’s known it works well, there’s no need for testing. Who let the chaos monkeys lose?

The worst that SHOULD happen is you have some leftover local temp files that may not have gotten cleaned up correctly.

The compacting process basically:

  • downloads the necessary remote files
  • uses the data to be kept to create fewer new compressed files (with new file names) and stores their sizes in the local database (I think it’s at this point that the downloaded local files are deleted)
  • uploads the new files
  • verifies the uploaded file sizes against the local database
  • if the sizes match then the old files are flagged for deletion
  • as part of the current (or a future) compacting / cleanup the files flagged for deletion will actually be deleted

I think (but haven’t verified) that the uploaded size verification and database flagging of deletable files is transactional (so it all succeeds or all fails as a set) but if I’m wrong then it’s possible you could kill Duplicati at just the right moment between those steps to make Duplicati not realize those old files can be deleted…

3 posts were split to a new topic: Bandwidth limits for compact process

Well, next run should clean-up the temp files. Of course there’s no way to deal with temp files if process is killed. But next run should take care of left-overs, if program is working correctly.

But back to the compaction process. I just tested it, it seems that the auto-compact also compacts ALL files that could be compacted, even a little bit. I would actually prefer model where only files above the compaction limit are compacted. This would make the compaction process more efficient. Because compacting files with very little to compact, is highly inefficient. As far as I see, there’s little to gain, when doing “perfect” compacting, opposed to compacting only worst offenders.

If there would be two separate parameters compact when overall expired data above limit % and compact only files with more than % wasted. It would even allow modifying the thresholds. This is also one way to limiting the time compaction takes, because as stated, compacting files with very little to compact, is the most time and bandwith intensive and most wasteful step of the compaction process.

Anyone, any thoughts about this?

I always assumed it only compacted files that met the wasted space threshold, resulting in (eventually) enough small files for them to be re-compacted into fewer files.

If you still recall after some time has passed, could you please describe the test method that indicated this? Testing 2.0.3.13 canary here only compacted dblock files that exceeded the threshold setting (default 25%) without any change to the file where only about 9% was wasted. This test used zip files I had laying around, backing up 1 and 10 MB, then 2 and 20 MB, then 3 and 15 MB. I then deleted 1, 20, and 15 MB and set the retention to 1 version, then backed up. This produced waste of 1/11, 20/22 and 15/18 for the three dblocks, however only the two that were above the default 25% threshold were compacted. Duplicati messages said:

2018-11-06 21:43:12 -05 - [Verbose-Duplicati.Library.Main.Database.LocalDeleteDatabase-FullyDeletableCount]: Found 0 fully deletable volume(s)
2018-11-06 21:43:12 -05 - [Verbose-Duplicati.Library.Main.Database.LocalDeleteDatabase-SmallVolumeCount]: Found 1 small volumes(s) with a total size of 511 bytes
2018-11-06 21:43:12 -05 - [Verbose-Duplicati.Library.Main.Database.LocalDeleteDatabase-WastedSpaceVolumes]: Found 2 volume(s) with a total of 68.01% wasted space (34.91 MB of 51.33 MB)
2018-11-06 21:43:12 -05 - [Information-Duplicati.Library.Main.Database.LocalDeleteDatabase-CompactReason]: Compacting because there is 68.01% wasted space and the limit is 25%

A quick peek at source seemed to say that first a list of dblock files exceeding waste threshold is made, then the waste is computed as sum of their waste (fixable by compact), then waste/total is compared to threshold. This seems supported by the rough numbers above, and by the –threshold documentation stating dual use.

Having a small threshold means frequent compaction, but it might not gain back a whole lot from a given file. Having a large threshold increases the chance of good per-file gains, but compact gets huge when it fires…
Or so I speculate. I haven’t worked with this much. This post is mainly to question the compact-all-files claim.

I didn’t make any specific test set, I just observed operation. But actually when I encountered that other thread it made me think about that dblock size thread, it could have been related to it. Therefore it’s like ly that I had changed dblock size (quite a while before the auto compact triggered) and then I just observed the results of it, much later. Making me wonder, why all of the files of the backup set have been rewritten.

As stated I’ve got three digit number of different backup sets, and after observing those, it seems that the most of the old blocks are still left after several compaction rounds, which is exactly the expected result and negates the “full auto compact” issue I were describing.

Actually I checked around 20 several months old daily backup storage sets to confirm this. Large blob of old files, then a few medium files tracking changes and additions to the original data set, and then the “revolving set” of fresh and small files. → Works just as intended afaik.

Yes, this is exactly the reason why I’m looking for option to limit / run partial compaction.

@JonMikelV

If I can go back to what you initially misunderstood in tghis discussion I actually have a question related to database compacting:

I have a daily backup of our company storage (about 1M files) running 6 days / week with 1 year retention.
The sqlite database has grown over time to about 24Gb and it seems to be growing still.
I may be wrong but I believe this size might be reduced by compacting the database.

How should I go about compacting the database?
Do you expect this can affect the backup speed? This is currently a bit of an issue since, even after enabling check-filetime-only, it’s taking 5 hours every night.
I believe compacting the database might also benefit the backup time but this is just a guess and I would like your opinion here as well.

Thanks in advance for your help.

Ah, I forgot about other info that may help you addressing the point.
In order to speed-up the backup process I have already:

  • moved the sqlite database to a 512Gb SSD
  • set the tempdir on the same SSD (should help with sqlite as far as I can understand)
  • disabled encryption
  • set zip compression level to 3

The destination storage is set on a NAS (with SFTP) on the local network, based on my tests transfer rate seems not to be an issue, and anyway just about 100-200 Mb data (in 50 Mb dblocks) seem to be written on the destination storage every time

It seems to me that the botteneck here might be either the CPU usage (at 100% for the first 2-3 hours during each backup) or the database access (hence the question about database compacting)

1 Like

Have you tried vacuuming the database?

@mikaelmello

I did not realize until half an hour ago that there was such an option, then looking at sqlite commands to compact database I learned about vavuum, and with this hint I realized that this option was available in Duplicati as well.

I have started the vacuum command (running Duplicati.CommandLine.exe vacuum URL and the --dbpath option specified, I hope this is the correct procedure) exactly 11 minutes ago: given the db size I expect this will take some 3 hours to complete.

I’ll let you know how this goes…

Thanks for the pointer!

2 Likes

Hope it all goes well, keep us posted!

Back again

Surprisingly it took just 21 minutes to complete. The db size didn’t change much: from 24Gb to 21Gb

I’ll let you know tomorrow how this affected the next backup, scheuled to start at 20:00 this evening.
Usually it takes about 5 hours (used to be 16 hours before moving the db to an SSD and setting check-filetime-only)

Thanks again!

The backup process tonight took 4hours 20minutes, not a sizeable improvement compared to the 5 hours in took before then.
In any case thanks for helping with the vaccuum command. I guess I’ll have to look elsewhere to fins the reason for the long backup times.
What I don’t understand is that the backup starts at 20:00 and I don’t know what’s it doing for the first 2 hours:

    2019-01-08 20:00:00 +01 - [Information-Duplicati.Library.Main.Controller-StartingOperation]: The operation Backup has started,
    2019-01-08 22:04:47 +01 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: List - Started:  (),
    2019-01-08 22:05:04 +01 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: List - Completed:  (12.55 KB),
    2019-01-08 23:48:02 +01 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Put - Started: duplicati-ba1332b4229164271a20b379c1dd4a401.dblock.zip (49.95 MB),
    2019-01-08 23:48:22 +01 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Put - Completed: duplicati-ba1332b4229164271a20b379c1dd4a401.dblock.zip (49.95 MB),

Sorry I didn’t get back to you sooner, but it wouldn’t have mattered since @mikaelmello already covered what I would have suggested.

The question of what’s going on for that first hour is valid - but my guess is it’s doing something like backend validation but not logging it at the normal log level.

I’d recommend either watching the ‘lastPgEvent’ block (bottom of the About -> “System info” tab) during that period or adding --log-file and --log-level=profiling to the job just long enough to see what’s going on then.

(BTW - I edited your previous post by putting “~~~” around the logs to make them easier to read.)

@JonMikelV Sorry for the delay in my reply, I was on a trip, and thanks for editing my previous post: it actually was unreadable!
I can’t watch the log while it’s running, I’ll setup the logging and report back.

@JonMikelV

So I had the log running overnight and here come the results:

  • From 20:00 to 21:30 it basically listed all files the folders to ba backed up saying for each one of them
[Verbose-Duplicati.Library.Main.Operation.Backup.FileEnumerationProcess-IncludingPath]: Including path as no filters matched: /path/filename
  • Then from 21:30 to 22:03 many SQL queries as follows (just 1 example):
[Profiling-Timer.Finished-Duplicati.Library.Main.Database.ExtensionMethods-ExecuteScalarInt64]: ExecuteScalarInt64: SELECT COUNT(*) FROM (SELECT DISTINCT "Path" FROM (
  • And then the “real” backup started skipping files where they had not changed and so on

All the rest I believe is pretty much straight forward …

I only do not know if there’s anything I’m doing wrong which causes the first hour and a half basically doing something I’m not totally sure is necessary.

In any case thanks for your help.

Regards!