Can someone explain this to me like I’m 5? How does Duplicati know which files to remove from a backup set?
I’m running Duplicati in a VM and testing writing to my NAS. I told Duplicati to keep 5 backups. I’m backing up the database file for Duplicati as I know that file will change with each run and Duplicati will treat it as a file that’s been changed. (I’m assuming it’s using the archive bit to track changes in the file?)
Looking at the files Duplicati creates for the backups, I’m noticing the file set will continue to grow by 2mb on each manual run of Duplicati (the 2mb is irrelevant). Then after it gets to about 17 files (zip.aes etc) it trims it down to 7 files. It compacts all the incremental backups into one file then it deletes the incremental files. The cycle will repeat when it reaches my target of 5 backups.
How does it know which files to remove? I need to come up with some sort of retention period as I’m using Wasabi as my s3 endpoint. You must keep files on their service for a minimum of 90 days. I don’t want Duplicati trimming the fat at 85 days if you catch my drift. Is there an explanation of this somewhere? I’d read the source code but I’m not a programmer.
I’ll try to give you a short and simple answer.
When you run a backup, Duplicati processes your files in chunks (100KB by default). Any chunk (block) of data it hasn’t seen before (hasn’t already stored on the back end) will be packaged into dblock volumes for upload. (Duplicati never stores the exact same chunk more than once on the back end - this is how the deduplication engine works.)
Duplicati records all blocks referenced by each backup version. In this way it knows which chunks in which dblock volumes on the back end are referenced by backup versions. After some backup versions are pruned by your retention settings, it can figure out which chunks are no longer referenced by any backup version and are therefore candidates for deletion.
Because chunks are combined into larger dblock volumes (50MB by default), Duplicati waits before reclaiming space by unreferenced chunks until some threshold is reached. It may either just delete entire dblock volumes (if all chunks inside the volume are no longer referenced), or it may download 2 or more dblock volumes, repackage them (without the unreferenced chunks) into fewer volumes, upload the new volumes, and finally delete the old volumes. The goal of this compaction process is to reduce the data usage and number of files on the back end.
Technically there’s nothing wrong with deleting early. It’s just that Wasabi will charge you as if the object were still present for the full 90 days.
The backup process explained might go along well with the above (which I prefer, but it’s more technical).
suggests you’re at least somewhat technical, and the answer to that is no. It uses timestamps then reads contents of candidate files to try to figure out which blocks changed. The job log counts what was opened.
Cost optimisation on wasabi gets into that issue further. If you’re an expert on their charging, you can post.