Deduplication vs compression

sylerner · February 19, 2018, 9:10pm

I’m completely new to Duplicati, so I’ll start off making sure I understand how deduplication works in Duplicati before I get to my main question.

As I understand things, Duplicati breaks files into blocks and identical blocks that occur across multiple files are stored only once.

My question has to do with the order that Duplicati performs deduplication versus compression.

In brief: are files first deduplitaed and then compression is applied to the stored blocks? Or are whole files first compressed and then the compressed files are deduplicated?

I’m guessing the first, since it would allow an insertion in a file to not break dedupicating everything from that point forward, but would like to know rather than guess.

On a related note, how well does Duplicati deal with edits or insertions in the middle of a file, in terms of not having to back up unmodified parts of the file that follow the mid-file changes?

Thanks in advance for humoring a newbie’s questions!

Pectojin · February 19, 2018, 9:40pm

Yes, files are split into blocks, then each block is compressed and put into a zip archive

So insertions into a file, that fit into a single block, will only require a single new block to be added and compressed.

mm1 · February 19, 2018, 9:48pm

Hi Pectojin,
I’m also new to Duplicati
Does your answer mean, that a flexible block size is used?

Thanks and Regards
Michael

Pectojin · February 19, 2018, 9:54pm

It’s good to see so many new people

Yes, block sizes are flexible as long as it fits within the defined block-size. There’s a great article on all the details here: How the backup process works • Duplicati

It gives the example where a 4KB file is uploaded in a single block, which is up to 100KB, but in this case the 4KB file contents are just added as a block

kees-z · February 19, 2018, 10:37pm

Unfortunately not. Duplicati uses a fixed block size, that can be set using advanced option --blocksize (default 100 KB).
So each file will be split up in blocks of exactly 100 KB. Only the last part of each file will be stored in a block with a smaller size (unless the file size is exactly a multiple of 100KB).

When inserting data in the middle of a file, the data after the insertion is shifted the number of bytes that is inserted. Unless this is exactly the block size, each block after the insertion will be different, causing deduplication to fail from that position.

Pectojin · February 19, 2018, 10:49pm

Ah, good point. I had glossed over that in my head.

mm1 · February 20, 2018, 8:41am

Hi Pectojin, Hi kees-z,

thanks a lot for your help and detailed explanation.

Regards
Michael