Deduplication for large files?

How deduplication work for large files?

If I have a 100 Mb file, and only 1 byte is changed, how big the deduplicated incremental backup will be for this change?

Duplicati uses fixed size chunking (default block size is 100KiB) for deduplication. If a single byte is modified in a file, then it should only require one additional block to be stored on the back end.

However, if a byte is inserted or deleted, it will shift all subsequent bytes accordingly. Duplicati’s fixed chunking doesn’t deal with this so it may require many new blocks to be stored on the back end. Depends on the size of the file and where the data was inserted or deleted within the file.

1 Like

Great! Thanks!

Usually the atomic level of an incremental backup software is a single file, no matter how big is. So if a big file is just a bit changed, it is backing up the whole file.

Glad to hear, that deduplication feature of Duplicati is much better, because it uses 100 KB atomic level chunks. So incremental backup with 100 KB filechunk based deduplication means, that no changed file is stored again, just the changed 100 KB filechunks.

So if I change header of 5-10 MB JPEG files, they are not stored again, just the changed 100 KB chunks.

Is that correct?

Yep, that’s correct assuming only the first 100KiB of the file changed and the rest of the JPEG is exactly the same.

What it boils down to is Duplicati will not store the exact same chunk more than once, regardless if the chunk is part of the same file or a different file.

1 Like

Compressed file types are excluded from de-duplication by default. So in your JPEG example no de-dupilcation will occur at all and the entire file will be backed up. If you are only changing the meta data of compressed file types rather than the compressed content of the file, then you’ll need to remove it from the “default_compressed_extensions.txt” file or equivalent on your operating platform.

This is awesome.
Why this is not described in the feature list of Duplicati?

Because it’s not really a “feature”, it’s there to save CPU/Compression time. There’s no point in de-duplicating data that will always change and compresses already compressed data. So to improve backup speed these files are taken as a whole. It’s documented in a number of places, including the user guide.

There is no evidence how strong compression was used in a compressed file.
Therefore a compressed file should be added without compression.

They may not be compressed but they are still chunked and deduplicated.

plus

leaves me confused. The second line is the default. If it meant to say “with compression” not “without”,
one can override default (which I think helps most people) with a custom –compression-extension-file.

I suspect most pre-compressed files do better than Duplicati because Duplicati first breaks the file into relatively tiny (100 KB) blocks by default (see Choosing sizes in Duplicati if you don’t like default value which in my view is on the small size for large files and backups), and looks to see if the block exists in the backup already. If so, deduplication happens, and that block is referenced and not uploaded again. Finding a new 100 KB block means it must be uploaded, but compression before uploading is optional.

How the backup process works