Deduplication - why is Duplicati performing so poorly

Disclaimer -
not here to start flame war on which is better / worse; just want to curious on technical reasons.

What is happening -
Duplicati seems to not be really deduplicating files which are theorectically 100% deduplicatable (is that a word?)

Background -
Doing some benchmarking / comparison work as documented here (https://forum.duplicati.com/t/big-comparison-borg-vs-restic-vs-arq-5-vs-duplicacy-vs-duplicati)

When researching found the following thread on Restic board which I thought would be useful to add to test suite (https://forum.restic.net/t/dedup-only-0-3-efficient-on-100-duplicate-data)

Observation / Steps to Replicate -

As per thread linked above - it is a back-to-back backup of component files and then a combined file comprised of urandom files of the following sizes:

  1. 100 * 1MB
  2. 16 * 8MB
  3. 4 * 128MB

The backup size was compared after each run. As the combined file is simply a concatenation of the components, a perfect dedup algorithm could attain 100% efficiency (and no changes in the backup size sans small metadata impacts)

Duplicati seems to be having real difficulty. For the smaller sizes; all programs struggle but Duplicati clearly still behind. For the last, the file sizes are big enough that dedup kicks in for others to a virtually full extent but Duplicati is still struggle. Given that last is 128MB files - it can’t be because of chunks exceeding the file size right?

Results in the below. I ran Duplicati twice coz the result seemed out of line but it turns out to be consistent

https://f000.backblazeb2.com/file/backblaze-b2-public/Backup_Tool_Comparison.xlsx

1 Like

Duplicati uses a fixed block size for deduplication. Default block size is 100KB, but can be adjusted before the initial backup is made.
Each file will be separated in 100KB chunks of raw data (the last block is probably smaller and contains the last part of the source file). Block data must be identical for deduplication to apply.

If you merge multiple files into one file, the individual block data will be different, unless the file sizes are exactly a multiple of the block size. In all other situations, data in each block is shifted and thus will be different from existing remote blocks.

Reason for using fixed block size is that the algorithm for dynamic block sizes is much more difficult and probably doesn’t save much space. In common situations, data is added to the end of an existing file instead of inserted. For situations where insertion of data is common (text files), there is not much space to save, these filetypes are often very small.

Try the deduplication again by renaming, copying and moving files and you will see that the remote data (.DBLOCK files) will not increase.

I did some further testing and it does sound consistent with above —

  1. If I change around to use n files of size m - the deduplication rate is always around 1/n — so basically the first segment of file is dedup but then the block boundaries dont match the subsequent files.

  2. If I do a bunch of files exactly 1000KB then the dedup ratio is high and close to 100% - basically for that file size the block boundaries land on the file boundaries.

Yep, inserting a single byte at the start of the file will defeat the simplistic deduplication in Duplicati.

More robust deduplication mechanisms can detect such insertions/deletions, or use variable block sizes - presumably at the expense of more processing time.

1 Like

Yeh a bit a bummer since seems like most other solutions do use some rolling hash approach. Duplicati ticks most of the feature boxes and no big downsidesand this is a notable omission.

It could be more robust, but I don’t think it’s that big of a deal personally. Duplicati has so many strong points that this one didn’t bother me.

Guaranteed shortest differential, is computationally extremely intensive task. It’s quite simple to implement, at least in naive way. Yet I assume nobody would ever use it. Works well, if the data set is in range of few megabytes, after that, is starts getting way crazy slow. We tried that once for differential updates. Basically it’s not too different from situation where whole data set is being used as compression dictionary. Everything needs to be compared with everything. And it becomes exponentially slower when amount of data grows. See: Delta Compression

1 Like

It isn’t so horrible on performance; the other backup solutions I have been testing are posting timing results that are not much different from duplicati despite being able to dedup on arbitrary boundaries.