Setup recommendation - 4-5 uncomprssed files to remote with average 50GB

Stefan_Bauer · February 18, 2020, 6:41pm

Hi,

our setup is simple. We want to backup a set of 5 VM image backups to webdav.

backup1.img
backup2.img
backup3.img
backup4.img
backup5.img

Images are uncompressed disks (ntfs,ext4) - each around 50GB.

Now the strange thing:

By only changing just a few bytes in the virtual machines and dumping another backup1.img and transfering this files to webdav, i would expect only a small amount of data to be transfered remotely.

However mostly around 20-30% of the total VM size, will get transfered again. am i doing something wrong?

I’m sticking to defaults and havent touched any options so far.

Thank you.

drwtsn32 · February 18, 2020, 8:29pm

Probably not. Duplicati takes the file and breaks it up into 100KiB chunks. If a single byte within that chunk is different, it is considered a new block and the entire chunk needs to be reuploaded.

Say you have a 1GB file and a few bytes are INSERTED near the beginning of the file. The 100KiB block that contains those few bytes will be new, as will ALL blocks after it. This is because the data is shifted and the 100KiB boundaries will be in different spots.

What are you doing to create these VM “dumps”? If they were just copies of the VHD/VMDK files, I would think the deduplication would be pretty decent, if only a small amount of data is changed within those disks. But if you are doing something more complex than VHD/VMDK copyinig, that may not be the case.

Stefan_Bauer · February 19, 2020, 6:58am

It’s just a vzdump with proxmox. The resulting files are the raw disks - containing 1-3 partitions - without compression.
The changes are indeed small, but as you mentioned, due to the shifting, even a tiny change, will shift alll blocks until the end.

Just tested again. Changed a 1MB file in a 5,5GB image resulted in around 1,4GB of new data at the destination.

drwtsn32 · February 19, 2020, 1:45pm

I’m not familiar with vzdump but if it generates dumps by talking to the hypervisor while the VMs are running, then yeah I’m guessing it creates dumps that are enough different from the original VM disks that it causes duplicati deduplication to not work so well. Also, if the dumps are compressed then that’s almost guaranteed to cause poor(er) deduplication. You MAY want to experiment with using vzdump with no compression (if that’s possible). Edit to add: sorry, just re-read that these are uncompressed files.

Is deduplication efficiency your top priority with backups? Some other backup programs have more robust deduplication, where the dedupe chunk size is variable to help catch things like this. I don’t know if it would work better in your case or not, but it might.