Compact - Limited / Partial

Sami_Lehtinen · May 13, 2018, 6:00am

Is there a way to run partial / limited compact?

If not, what’s the recommended way to dealing with huge backup sets where standard compaction is taking ages?

My suggestion is separating the compaction filter level (what’s being compacted) and compaction trigger (when compaction is started).

This would allow running compaction bit more often, yet not compacting files, which got small amount of deletable data.

JonMikelV · May 17, 2018, 3:29pm

As far as I know there’s just compact - no “levels” of compacting are available.

What version of Duplicati are running?
When is the compacting happening (automatic, manual, etc.)?
How big is your .sqlite file?
How big and how many files are your Source folders?

Sami_Lehtinen · May 18, 2018, 11:47am

2.0.2.1_beta_2017-08-01
Automatic (25%, default)
Around 500 megs
~80 gigs and ~100k files

Most of bytes are in a few large files which are slightly modified daily. And there’s bunch of ‘static’ files and then of course logs and stuff, which are rotated, which is large part of the small files. Those come and go. I guess it’s quite typical scenario. Except, that the large files are (slightly) modified daily.

Edit: Continued…

Most of time spent for the compaction is spent while downloading the b files, which are then largely ignored and only (very) small remaining data amount is migrated to the new archive. Anyway, with long retention history and four backups daily. The amount of data stored before compaction triggers is quite large. That’s the source reason. If I lower the threshold then there will be more upload, even if runs more often. But if there would be minimum and maximum threshold it could be used to make the process more efficient. It triggers at 25% and starts from oldest blocks over 25% and stops when 15% average / total is reached. -> Compaction would run more often.

JonMikelV · May 20, 2018, 2:27am

I know there are backups out there with much larger sqlite files and source folders so Duplicati is certainly able to handle such things, but I’m not sure why it’s giving you problems.

A number of performance changes were made between 2.0.2.1 beta and 2.0.3.3 beta - would you consider updating to 2.0.3.3 and see if that improves things for you?

Sami_Lehtinen · May 22, 2018, 11:51am

I think the compact performance isn’t database related. It’s file transfer related.

I’ve got a feeling you’re missing the actual reason for the issue I’m experiencing. Let’s rephrase it into simple FAQ format:

Q: “How do I efficiently limit the (amount of data being downloaded during | duration) of single compaction session.”

Thanks

JonMikelV · May 25, 2018, 5:15pm

Yep - you’re right. I was thinking database compact when you meant destination archive compacting (as in the job menu “Compact now” link). Sorry!

I don’t know that you can do much to reduce the overhead associated with a archive compacting - all the small or sparse files have to be downloaded so they can be re-compressed into fewer large or full utilized files.

About the best you can do us reduce the frequency of it increasing –small-file-max-count or decreasing –small-file-size settings. Of course this likely won’t make it run faster when the compacting does happen - but it should happen less frequently. (Does this count as the “compaction trigger” you mentioned in your first post?)

--small-file-max-count
To avoid filling the remote storage with small files, this value can force grouping small files. The small volumes will always be combined when they can fill an entire volume.
Default value: “20”

--small-file-size
When examining the size of a volume in consideration for compacting, a small tolerance value is used, by default 20 percent of the volume size. This ensures that large volumes which may have a few bytes wasted space are not downloaded and rewritten.
Default value: “”

Just for reference for anybody else reading through this, here’s the docs for command line compact and here’s a little blurb about automated compacting.

Sami_Lehtinen · June 1, 2018, 8:46am

Small files count and size are fine for us. That’s not the source cause.

The problem for long (compaction) execution time is large number of large block files, which do contain “small” amount of data getting migrated into new blocks when pruning. Because our backups are database backups, it’s highly unlikely that whole block file becomes obsolete and get’s deleted. But most of data in that block file does become obsolete. This leads to situation that when compaction is triggered, there’s large number of large files to be downloaded for processing. Yet, amount of data being uploaded back in new blocks is relatively small.

I guess this is different for systems, where files are usually completely rewritten. Which allows block files to be deleted as soon as those expire.

That’s why I would love to have a way, to trigger “partial” compaction, which would be limited in someway. Either time, number of blocks, or target threshold. Start compaction at 25% and stop when 15% is reached, or so.

JonMikelV · June 4, 2018, 2:28am

For now the best you could do is increase the default value for --threshold, but that’s just the start trigger - there is no “only compact until 15% or less is wasted” end trigger.

--threshold
As files are changed, some data stored at the remote destination may not be required. This option controls how much wasted space the destination can contain before being reclaimed. This value is a percentage used on each volume and the total storage.
Default value: “25”

kenkendk · June 8, 2018, 9:40am

Yes, I think --threshold should do what you want.

The threshold parameter is used to determine if a volume is considered “wasteful”, so if you set it high, like 95, it should not download the dblock unless it holds 95% unused data:

github.com

duplicati/duplicati/blob/master/Duplicati/Library/Main/Database/LocalDeleteDatabase.cs#L193




private readonly long m_wastethreshold;

private readonly long m_volsize;

private readonly long m_maxsmallfilecount;



public CompactReport(long volsize, long wastethreshold, long smallfilesize, long maxsmallfilecount, IEnumerable<VolumeUsage> report)

{

    m_report = report;

    

    m_cleandelete = (from n in m_report where n.DataSize <= n.WastedSize select n).ToArray();

    m_wastevolumes = from n in m_report where ((((n.WastedSize / (float)n.DataSize) * 100) >= wastethreshold) || (((n.WastedSize / (float)volsize) * 100) >= wastethreshold)) && !m_cleandelete.Contains(n) select n;

    m_smallvolumes = from n in m_report where n.CompressedSize <= smallfilesize && !m_cleandelete.Contains(n) select n;



    m_wastethreshold = wastethreshold;

    m_volsize = volsize;

    m_maxsmallfilecount = maxsmallfilecount;



    m_deletablevolumes = m_cleandelete.Count();

    m_fullsize = report.Select(x => x.DataSize).Sum();

    

    m_wastedspace = m_wastevolumes.Select(x => x.WastedSize).Sum();

JonMikelV · June 9, 2018, 6:15am

If --threshold=100 is set, does that effectively disable compacting - even if a manual compact is run?

Sami_Lehtinen · August 14, 2018, 11:02am

Ok, treshold=100 practically disables compacting for non-fully deletable volumes. After this deleting fully deletable volumes is quick. But it disables threshold based partial compacting, which then needs to be run separately.

How does Duplicati behave in a situation where I run compact separately, and just kill it based on timer? If the application is working securely in segmented transactions. That should work. If not, something will break and create corruption or situation which requires manual cleanup (sigh).

Actually I could test run this case. Starting and killing compaction with large data set with random kill time to see if it breaks. Yet, if it’s known it works well, there’s no need for testing. Who let the chaos monkeys lose?

JonMikelV · August 16, 2018, 3:14pm

The worst that SHOULD happen is you have some leftover local temp files that may not have gotten cleaned up correctly.

The compacting process basically:

downloads the necessary remote files
uses the data to be kept to create fewer new compressed files (with new file names) and stores their sizes in the local database (I think it’s at this point that the downloaded local files are deleted)
uploads the new files
verifies the uploaded file sizes against the local database
if the sizes match then the old files are flagged for deletion
as part of the current (or a future) compacting / cleanup the files flagged for deletion will actually be deleted

I think (but haven’t verified) that the uploaded size verification and database flagging of deletable files is transactional (so it all succeeds or all fails as a set) but if I’m wrong then it’s possible you could kill Duplicati at just the right moment between those steps to make Duplicati not realize those old files can be deleted…

JonMikelV · October 16, 2018, 12:50pm

3 posts were split to a new topic: Bandwidth limits for compact process

Sami_Lehtinen · October 20, 2018, 6:26am

Well, next run should clean-up the temp files. Of course there’s no way to deal with temp files if process is killed. But next run should take care of left-overs, if program is working correctly.

But back to the compaction process. I just tested it, it seems that the auto-compact also compacts ALL files that could be compacted, even a little bit. I would actually prefer model where only files above the compaction limit are compacted. This would make the compaction process more efficient. Because compacting files with very little to compact, is highly inefficient. As far as I see, there’s little to gain, when doing “perfect” compacting, opposed to compacting only worst offenders.

If there would be two separate parameters compact when overall expired data above limit % and compact only files with more than % wasted. It would even allow modifying the thresholds. This is also one way to limiting the time compaction takes, because as stated, compacting files with very little to compact, is the most time and bandwith intensive and most wasteful step of the compaction process.

Anyone, any thoughts about this?

JonMikelV · November 6, 2018, 1:43pm

I always assumed it only compacted files that met the wasted space threshold, resulting in (eventually) enough small files for them to be re-compacted into fewer files.

ts678 · November 7, 2018, 4:20am

If you still recall after some time has passed, could you please describe the test method that indicated this? Testing 2.0.3.13 canary here only compacted dblock files that exceeded the threshold setting (default 25%) without any change to the file where only about 9% was wasted. This test used zip files I had laying around, backing up 1 and 10 MB, then 2 and 20 MB, then 3 and 15 MB. I then deleted 1, 20, and 15 MB and set the retention to 1 version, then backed up. This produced waste of 1/11, 20/22 and 15/18 for the three dblocks, however only the two that were above the default 25% threshold were compacted. Duplicati messages said:

2018-11-06 21:43:12 -05 - [Verbose-Duplicati.Library.Main.Database.LocalDeleteDatabase-FullyDeletableCount]: Found 0 fully deletable volume(s)
2018-11-06 21:43:12 -05 - [Verbose-Duplicati.Library.Main.Database.LocalDeleteDatabase-SmallVolumeCount]: Found 1 small volumes(s) with a total size of 511 bytes
2018-11-06 21:43:12 -05 - [Verbose-Duplicati.Library.Main.Database.LocalDeleteDatabase-WastedSpaceVolumes]: Found 2 volume(s) with a total of 68.01% wasted space (34.91 MB of 51.33 MB)
2018-11-06 21:43:12 -05 - [Information-Duplicati.Library.Main.Database.LocalDeleteDatabase-CompactReason]: Compacting because there is 68.01% wasted space and the limit is 25%

A quick peek at source seemed to say that first a list of dblock files exceeding waste threshold is made, then the waste is computed as sum of their waste (fixable by compact), then waste/total is compared to threshold. This seems supported by the rough numbers above, and by the –threshold documentation stating dual use.

Having a small threshold means frequent compaction, but it might not gain back a whole lot from a given file. Having a large threshold increases the chance of good per-file gains, but compact gets huge when it fires…
Or so I speculate. I haven’t worked with this much. This post is mainly to question the compact-all-files claim.

Sami_Lehtinen · November 27, 2018, 9:21am

I didn’t make any specific test set, I just observed operation. But actually when I encountered that other thread it made me think about that dblock size thread, it could have been related to it. Therefore it’s like ly that I had changed dblock size (quite a while before the auto compact triggered) and then I just observed the results of it, much later. Making me wonder, why all of the files of the backup set have been rewritten.

As stated I’ve got three digit number of different backup sets, and after observing those, it seems that the most of the old blocks are still left after several compaction rounds, which is exactly the expected result and negates the “full auto compact” issue I were describing.

Actually I checked around 20 several months old daily backup storage sets to confirm this. Large blob of old files, then a few medium files tracking changes and additions to the original data set, and then the “revolving set” of fresh and small files. → Works just as intended afaik.

Yes, this is exactly the reason why I’m looking for option to limit / run partial compaction.

Amanz · January 8, 2019, 8:53am

@JonMikelV

If I can go back to what you initially misunderstood in tghis discussion I actually have a question related to database compacting:

I have a daily backup of our company storage (about 1M files) running 6 days / week with 1 year retention.
The sqlite database has grown over time to about 24Gb and it seems to be growing still.
I may be wrong but I believe this size might be reduced by compacting the database.

How should I go about compacting the database?
Do you expect this can affect the backup speed? This is currently a bit of an issue since, even after enabling check-filetime-only, it’s taking 5 hours every night.
I believe compacting the database might also benefit the backup time but this is just a guess and I would like your opinion here as well.

Thanks in advance for your help.

Amanz · January 8, 2019, 9:38am

Ah, I forgot about other info that may help you addressing the point.
In order to speed-up the backup process I have already:

moved the sqlite database to a 512Gb SSD
set the tempdir on the same SSD (should help with sqlite as far as I can understand)
disabled encryption
set zip compression level to 3

The destination storage is set on a NAS (with SFTP) on the local network, based on my tests transfer rate seems not to be an issue, and anyway just about 100-200 Mb data (in 50 Mb dblocks) seem to be written on the destination storage every time

It seems to me that the botteneck here might be either the CPU usage (at 100% for the first 2-3 hours during each backup) or the database access (hence the question about database compacting)

mikaelmello · January 8, 2019, 11:31am

Have you tried vacuuming the database?