Deduplication for large files?

klor · November 6, 2019, 11:10am

How deduplication work for large files?

If I have a 100 Mb file, and only 1 byte is changed, how big the deduplicated incremental backup will be for this change?

drwtsn32 · November 6, 2019, 5:57pm

Duplicati uses fixed size chunking (default block size is 100KiB) for deduplication. If a single byte is modified in a file, then it should only require one additional block to be stored on the back end.

However, if a byte is inserted or deleted, it will shift all subsequent bytes accordingly. Duplicati’s fixed chunking doesn’t deal with this so it may require many new blocks to be stored on the back end. Depends on the size of the file and where the data was inserted or deleted within the file.

klor · November 6, 2019, 7:35pm

Great! Thanks!

Usually the atomic level of an incremental backup software is a single file, no matter how big is. So if a big file is just a bit changed, it is backing up the whole file.

Glad to hear, that deduplication feature of Duplicati is much better, because it uses 100 KB atomic level chunks. So incremental backup with 100 KB filechunk based deduplication means, that no changed file is stored again, just the changed 100 KB filechunks.

So if I change header of 5-10 MB JPEG files, they are not stored again, just the changed 100 KB chunks.

Is that correct?

drwtsn32 · November 6, 2019, 7:43pm

Yep, that’s correct assuming only the first 100KiB of the file changed and the rest of the JPEG is exactly the same.

What it boils down to is Duplicati will not store the exact same chunk more than once, regardless if the chunk is part of the same file or a different file.

samw · November 10, 2019, 8:18am

Compressed file types are excluded from de-duplication by default. So in your JPEG example no de-dupilcation will occur at all and the entire file will be backed up. If you are only changing the meta data of compressed file types rather than the compressed content of the file, then you’ll need to remove it from the “default_compressed_extensions.txt” file or equivalent on your operating platform.

klor · November 10, 2019, 8:34am

This is awesome.
Why this is not described in the feature list of Duplicati?

samw · November 10, 2019, 8:52am

Because it’s not really a “feature”, it’s there to save CPU/Compression time. There’s no point in de-duplicating data that will always change and compresses already compressed data. So to improve backup speed these files are taken as a whole. It’s documented in a number of places, including the user guide.

klor · November 10, 2019, 9:23am

There is no evidence how strong compression was used in a compressed file.
Therefore a compressed file should be added without compression.

drwtsn32 · November 10, 2019, 2:44pm

They may not be compressed but they are still chunked and deduplicated.

ts678 · November 10, 2019, 2:46pm

plus

leaves me confused. The second line is the default. If it meant to say “with compression” not “without”,
one can override default (which I think helps most people) with a custom –compression-extension-file.

I suspect most pre-compressed files do better than Duplicati because Duplicati first breaks the file into relatively tiny (100 KB) blocks by default (see Choosing sizes in Duplicati if you don’t like default value which in my view is on the small size for large files and backups), and looks to see if the block exists in the backup already. If so, deduplication happens, and that block is referenced and not uploaded again. Finding a new 100 KB block means it must be uploaded, but compression before uploading is optional.

How the backup process works

samw · December 23, 2019, 1:57am

Sorry about the late reply. I believe they are not chunked in the normal sense. I mean only the hash of the file itself is kept, but no hashes for the chunks. Always happy to be proven wrong, but I remember reading that there was no point in chunking the file as the entire file will always change due to compression. Happy to be proven wrong as always. Cheers.

samw · December 23, 2019, 2:00am

I’m not sure I understand your reply. The idea is that most compression algorithms will change the entire file, so backing it up in pieces makes no sense. In addition there are many efficient compression algorithms, not just for files, but for particular file types. For example you have MP3/MP4/Opus for audio files; MKV/WEBM/MP4V for video files. So say for example I’m working on an audio file compressed with FLAC and it changes everyday, Duplicati would not be able to perform block level changes, as the compression algorithm would change the entire file.

ts678 · December 23, 2019, 5:55pm

Although I’m not the ultimate authority, I’m siding with the idea that large files are chunked based on –blocksize which defaults to 100KB. You can just pick a compressed file, backup without encryption, observe result with a zip program that shows compressed size. 7-Zip or Windows File Explorer work.

Below is where the core processing loop outputs blocks, and it passes along file’s compression hint, There are times near backup’s end where hints go unavailable, so expect some compressed blocks.

github.com

duplicati/duplicati/blob/1cc10b1caabb95f5e11c61904322d8085e80de3b/Duplicati/Library/Main/Operation/Backup/StreamBlockSplitter.cs#L128-L128


      
          await DataBlock.AddBlockToOutputAsync(self.BlockOutput, blkey, blocklistbuffer, 0, blocklistoffset, CompressionHint.Noncompressible, true);

It makes sense in the sense that whatever original file size, output size is limited to the usual –dblock-size which defaults to 50 MB, and normal backup and restore code just works without special handling.

Choosing sizes in Duplicati talks about these two sizes, but IMHO some of the text is somewhat wrong about where compression is done. I’m still siding with compression being done on deduplication blocks which are so small that I don’t think recompression makes any sense. Of course, deduplication also is pretty much inapplicable to most compressed file situations. Mostly, I think you just get chunk-and-ship, with chunks packed into default 50 MB dblock files. This solves the small-file and the huge-file problem.

drwtsn32 · December 26, 2019, 5:15pm

You can prove it to yourself. Set up a new test backup without encryption. Save to a local folder. Select a single large compressed file to back up and run the backup. After it’s complete, you will see several dblock zip files (default size 50MiB) in the destination folder. Open one of those zip files and you will see several chunks within the zip (default size 100KiB).

I get what you’re saying - compressed files, when modified, usually change significantly or maybe even entirely. So they do not deduplicate well at all. But this isn’t true 100% of the time. Think of cases where you may change just the metadata in some media file. Simple example: correcting a character in the ID3 tag of some MP3 file. Only the metadata header is rewritten, possibly only a single byte has changed.

Or think of situations where someone has multiple copies of the same, large compressed file. Deduplication will work fantastic for this.

Or what about when people reorganize files on their system. Deduplication doesn’t really care about file location so will handle this just fine. Other backup programs that don’t use deduplication will see all the moved data as “new”, resulting in a length backup operation with significant data transfer.

Duplicati has no way of knowing how well any particular file may deduplicate in advance - so all files are run through the process.