Deduplication: How many versions back is possible?

I was wondering about a possible scenario where deduplication may fail:

Let’s say you have a backup with four versions, each done a day apart:
Day 1 has Folders A, B and C
Day 2 has Folders A and C
Day 3 has Folders B and C
Day 4 has Folders A, B and C
All three folders’ files are unchanged, just that Days 2 and 3 have said folders omitted for some reason

Will deduplication still occur for all three folders in these four days, or just Folder A for Day 2 and Folder C for all 4 days?

Thanks!

Depends on your retention settings. Assuming you are retaining your backups longer than 2 days, then dedupe will work fine.

On Day 3 folder B reappears. Duplicati will process it as if it were new data, but it will realize that the blocks match what already exists on the “back end” and not need to re-upload anything.

Same would happen on Day 4 when Folder A reappears.

Wanted to add that even if you had something odd like 1 day retention, then worst case is that the blocks of data for the folder that disappeared would be removed from the back end storage. When the folder reappears, Duplicati would back it up again and re-upload blocks to the back end storage.

Thanks, so deduplication is effective as long as my rentention policy is long enough to cover all affected periods?

Maybe someone else can explain more clearly. But does it make sense to you that if you delete something off your PC, and the retention period is passed, the data would no longer be recoverable? That’s because the data on the back end is removed.

If your scenario is something like:

  • Backup 1: File exists
  • Backup 2: file does not exist
  • Backup 3: File exists again (same contents as before)
  • Backup 4: File does not exist again
  • Backup 5: File exists again (still same contents as before)

Then yes - as long as any version is in the retention period the deduplication will happen (so in the above example only one ‘copy’ of the file will be backed up but will ‘belong’ to 3 different backup sets).

Even if there were a Backup 6 where the file was gone again, the data itself would be kept in the destination because it belongs to other backups that still need it.

This part I understand, I just need some confirmation that I understood how retention policy and deduplication works together correctly

Ah thanks, sounds like I should keep my retention policy long enough for at least one of these backups

Yep! In fact with deduplication there really isn’t a reason that you can’t have a retention period measured in years.

I back up every 4 hours and use a graduated retention policy: all versions are kept for the first week, then only daily backups for 3 months, then weekly backups for 2 years, and beyond that only monthly backups.

The option looks like this:

--retention-policy=7D:0s,3M:1D,2Y:1W,99Y:1M

Thanks! I can consider that but we’ll have to see how much in total my backups will be first; right now I am just uploading each rsnapshot dataset I made (which is not periodically) as a “daily” backup. This is why I am asking about deduplication (and changing to Duplicati!) since rsnapshot can’t do so

Speaking about deduplication, does it do so over folders and even disk drives too? E.g.:
Disk A: Folder A/sometext.txt
Disk B: Folder B/sometext.txt
=> sometext.txt will be saved only once instead of twice

Asking this as I can also foresee myself merging some data in two or more smaller disks into a bigger one in the long run

Yes, deduplication works for all files included in the backup no matter their location, but unlike Dupliciti it does not work across DIFFERENT backups.

Note that when moving a file Duplicati will consider the file at the old location deleted and the (same) file at the new location as a new file. This means the the history of the old file location will end and the new location will start a new history.

But through all of that, deduplication will still work and only one cut of the file will have been uploaded.

Ah I see. I actually think that Duplicati’s idea that a file is deduped only within its backup instead of all backups makes more sense; it allows recovery from just a defined set of backed up files a lot easier I suppose. I however can also see some use cases where having the dedupe for all backup sets are more useful

This is quite relieving to hear seeing that I will be reorganising stuff within the dataset once I move all rsnapshots to the cloud; being able to revert to previous arrangements is allowing me to be a lot more bolder in the reorganisation