S3 buckets, "smart" retention, and old data files

jangliss · June 23, 2022, 9:05pm

I was poking around the S3 buckets that duplicati sends backups to, and was a little puzzled by the age of some of the files. On the backup job I have it set to use smart retention, so I was quite surprised to find dindex and dblock files going back 4 years. I checked the logs, and do see it reporting that retention rules are being processed, for example:

“2022-06-20 23:13:45 -07 - [Information-Duplicati.Library.Main.Operation.DeleteHandler:RetentionPolicy-StartCheck]: Start checking if backups can be removed”,
“2022-06-20 23:13:45 -07 - [Information-Duplicati.Library.Main.Operation.DeleteHandler:RetentionPolicy-FramesAndIntervals]: Time frames and intervals pairs: 7.00:00:00 / 1.00:00:00, 28.00:00:00 / 7.00:00:00, 365.00:00:00 / 31.00:00:00”,
“2022-06-20 23:13:45 -07 - [Information-Duplicati.Library.Main.Operation.DeleteHandler:RetentionPolicy-BackupList]: Backups to consider: 6/19/2022 11:00:00 PM, 6/18/2022 11:00:00 PM, 6/17/2022 11:00:00 PM, 6/16/2022 11:00:00 PM, 6/15/2022 11:00:00 PM, 6/14/2022 11:00:00 PM, 6/13/2022 11:00:00 PM, 6/7/2022 11:00:00 PM, 5/31/2022 11:00:00 PM, 5/24/2022 11:00:00 PM, 4/19/2022 11:00:00 PM, 3/15/2022 11:00:00 PM, 2/8/2022 10:00:00 PM, 1/4/2022 10:00:00 PM, 11/29/2021 10:00:00 PM, 10/24/2021 11:00:00 PM, 9/19/2021 11:00:00 PM, 8/15/2021 11:00:00 PM, 7/11/2021 11:00:00 PM”,
"2022-06-20 23:13:45 -07 - [Information-Duplicati.Library.Main.Operation.DeleteHandler:RetentionPolicy-BackupsToDelete]: Backups outside of all time frames and thus getting deleted: ",
“2022-06-20 23:13:45 -07 - [Information-Duplicati.Library.Main.Operation.DeleteHandler:RetentionPolicy-AllBackupsToDelete]: All backups to delete: 6/13/2022 11:00:00 PM”,
“2022-06-20 23:13:45 -07 - [Information-Duplicati.Library.Main.Operation.DeleteHandler-DeleteRemoteFileset]: Deleting 1 remote fileset(s) …”,
“2022-06-20 23:15:03 -07 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Delete - Started: duplicati-20220614T060000Z.dlist.zip.aes (97.79 MB)”,
“2022-06-20 23:15:03 -07 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Delete - Completed: duplicati-20220614T060000Z.dlist.zip.aes (97.79 MB)”,
“2022-06-20 23:15:04 -07 - [Information-Duplicati.Library.Main.Operation.DeleteHandler-DeleteResults]: Deleted 1 remote fileset(s)”,

However, the log looks like it’s only cleaning up the dlist files, not the dblock or dindex files. Is that intentional? Why would it not clean up the actual data files as well as the lists?

I’m assuming, because I don’t have the dlist files, I can safely delete anything outside of the smart retention (1 year)? Or I can choose to leave them and then I have 4 years of un-indexed backups, and can still restore from them using the restore option if it can read the dblock or dindex files. Is that correct?

drwtsn32 · June 24, 2022, 3:38am

Because you probably have a lot of data blocks (pieces of files you’re protecting) that have never changed since you started using Duplicati.

This is Duplicati’s deduplication engine at work. It is making your backup process faster and uses less bandwidth. So it is working as intended.

Your recoverable file versions are what your retention settings are set to (1 year maximum).

Certainly not… you will destroy much of you backup data, even for backups that are less than 1 year old.

ts678 · June 24, 2022, 12:00pm

This will destroy your backup, as noted. Don’t do it.

Features

Incremental backups
Duplicati performs a full backup initially. Afterwards, Duplicati updates the initial backup by adding the changed data only. That means, if only tiny parts of a huge file have changed, only those tiny parts are added to the backup. This saves time and space and the backup size usually grows slowly.

This means that those old files still contain substantial amounts of old data, maybe your original backup which later backups build on as a base. Wasted space is eventually cleaned up (unless you disabled it).

Compacting files at the backend

Upload volumes (files at the backend) likely contain blocks that do belong to old backups only, as well as blocks that are used by newer backups. Because the contents of these volumes are partly needed, they cannot be deleted, resulting in unnecessary allocated storage capacity.

The compacting process takes care of this. When a predefined percentage of a volume is used by obsolete backups, the volume is downloaded, old blocks are removed and blocks that are still in use are recompressed and re-encrypted. The smaller volume without obsolete contents is uploaded and the original volume is deleted, freeing up storage capacity at the backend.

EDIT:

Yes. A version deletion deletes dlist. Compact removes wasted space, changing dblock and dlist files.

Because it can be very time and bandwidth consuming to try to compact the entire backup every time.
The COMPACT command does give you some controls if you really dislike the default compact levels.

How the backup process works gets technical, but covers some of the details of deduplication method.

Finally, the file C:\data\extra\samevideo.mp4 is processed. Duplicati will treat each block individually, but figure out that it has already made a backup of this block and not emit it to the dblock file. After all 3 blocks are computed, it will then create a new block to store these 3 hashes, but also finds that such a block is already stored as well.

This approach is also known as deduplication ensuring that each “chunk” of data is stored just only once. With this approach, duplicate files are detected regardless of their names or locations.

Here, “already made a backup of this block” refers to it being in a previous (maybe very old) dblock file.

jangliss · June 26, 2022, 9:10pm

Thank you both for the responses, lots of good information. I’ve been doing backups on and off in enterprise environments for who knows how long now, and it baffled me files were still hanging about, but pointing to the fact that duplicati does one full, and then everything after is incremental clarifies everything for me. No more grandfather-father-son