Best compression

ts678 · July 29, 2021, 2:54pm

Thanks for more info.

The research paper I cited tested the Silesia Compression Corpus, so tested Samba source code, which might be similar to your situation except you have short files instead of a tar file. Hard to find better reports covering all the compressors Duplicati has (I didn’t try searching subsets) and also testing small test files.

Random questions in what appears to be a data compression forum says:

Yes, bzip2 can be better than lzma for small text files (<=900kb).
In these cases, ppmd would be likely even better.

In the paper I cited, the samba source tar file compressed a tiny bit better on PPMd than LZMA, with Bzip2 quite a bit worse but a bit better than Deflate. There were speed differences, but you say it doesn’t matter.

Regarding deduplication, Duplicati deduplicates on fixed blocks with default 100 KB blocksize which is the compression block too. This means that the compressor gets small files (unless you increase blocksize). One time I think I looked to see if such a short file hurt compression, and didn’t come to a firm conclusion.

Choosing sizes in Duplicati talks about the topic. As a rule of thumb, I try to stay below 1 million blocks in backups (otherwise, the database slows down). Larger blocks seem like they might compress better but deduplicate worse. Larger backups might deduplicate better, but if it breaks it’s a bigger repair job or loss.

Totally identical files will deduplicate perfectly. Ones with appends will do very well. Changes in middle will throw off the fixed blocks and hurt deduplication. Some other backup tools try to vary the block boundaries.

Purge of things from all versions is possible, but takes quite a bit of caution, and the UI seems pretty poor.
Will Duplicati Delete Newly-Ignored Directories? has some links and comments to give you a feel for that.

You might care about performance of decompress for the total loss, but for less revert, Duplicati will scan the original source path for blocks it can use, and will scan the target path for blocks already set (which is likely most of them), so there might not be a huge amount of fetching of backup files to extract the blocks.

Current Beta is likely plenty reliable enough for current “not really critical backups”, but I always tell people that archive-then-delete is not a great idea with Duplicati. Here’s where I started down that path yesterday:

Duplicati have possibilities for deleting folders&files after complite archivation?

however your use sounds like Duplicati would not be the long-term archival store, but you just want to get files out of the Duplicati “best-effort” backup someday, and not rely on it as a permanent store with simple retrieval methods for files that are long-gone (maybe at different times), needing special reliability and UIs.

Back to the original question, you probably need to benchmark a bit (or attempt some further web search).