New retention policy deletes old backups in a smart way

drwtsn32 · April 16, 2019, 2:39pm

You bet - glad it helped!

lyffle · July 1, 2019, 12:09pm

Apologies - newbie alert.

Is there a way to configure Duplicati (with or without Smart Retention) in such a way, that a file that has been deleted from the source, will always be available in the backup? (f.i., also after 13 months, when using Smart Retention)

Thanks!

JonMikelV · July 1, 2019, 10:42pm

If you do “keep all versions” that will of course keep even deleted files.

Beyond that, sort of.

If using a retention policy, you should use Custom and be sure to include a final rule like @drwtsn32’s 99y:1y which will keep a copy (even of deleted files) for to to 99 years BUT… Only if the file is in the ONE backup chosen to be kept for the year.

It’s not perfect - in fact it’s easy to “lose” files in the time gaps but it’s the best we’ve got for now.

Personally, I’d love a secondary retention policy for deleted files - including never remove.

lyffle · July 2, 2019, 2:13pm

Thank you!

I wasn’t sure about “keep all versions”, in the scenario where the local file got deleted. Good to know how that works. This really helps with designing my backup strategy.

FWIW, I like the way Bvckup handles deleted files. There’s an option to ‘archive’ deleted files - when Bvckup notices a file got deleted, it moves that file (on the backup side) to an ‘Archive’ directory.

AimoE · July 16, 2019, 8:28am

I was looking for a solution and found this thread discussing the same problem. But I still haven’t found a proper solution. What I want is a retention policy that keeps all versions for a short while, say, a month, and then keeps only one version for up to three years. Based on the discussion here, I tried “–retention-policy=1M:U,3Y:3Y”, but that gave me an error message. I figured it was the “3Y:3Y” part that was not accepted.

So I have to ask what is the recommended way to do it?

For now, I have “–retention-policy=1M:U,2M:1M,3Y:2Y”, which in my understanding should do what I’m trying to achieve, but surely this cannot be the recommended way?

drwtsn32 · July 16, 2019, 2:02pm

When you say “one version” for up to three years, do you mean one file version? Because that’s not how Duplicati backup retention works. Retention is by entire backup set versions - it does not support retention at the file version level.

So you wanting 1M:U,3Y:3Y (if it were accepted) would cause all but one backup job to be deleted after one month. And after 3 years you’d have zero backup jobs retained.

AimoE · July 16, 2019, 2:15pm

That I had not comprehended yet. Maybe I have too long a history with version-control systems that build up the history of a configuration from the history of files (unlike Git, which is revolutionary in that regard).

I will have to rethink the retention oolicy for all of my backups now.

ts678 · July 16, 2019, 4:51pm

I possibly miss the reference given, but if you consider a configuration to be conceptually like a backup version, Duplicati builds it as a set of files where history of any given file is represented as the blocks it contains, whether those blocks are new or old. That’s one way block-based deduplication happens, so space use of a slightly changed version of a backup is a small change upload, plus a lot of references.

Block-based storage engine

How the backup process works

Someone with many files having frequent changes will face space issues, and deleting versions will help, however any file that arrived and disappeared between the surviving versions will not be any version at all.

I’m not clear on the reason for “only one version for up to three years”, but I wanted to mention the design.

AimoE · July 16, 2019, 5:25pm

Backup system is not a VCS, and I am not trying to equate them in any way. I am just explaining where I am coming from; I have a long history with coding and VCS systems.

I hope this helps.

The difference between Git and all its predecessors is that the traditional VCS tools hold file versions as the primary objects and derive the history of a configuration from file versions. This derivation order has been traditionally motivated by performance aspects, but is often incomplete or otherwise flaky. This is where Linus Torvalds turned the tables quite completely. In Git, the version of a configuration is the primary object, and if one needs at all to see the version history of a file (as a tree or some such presentation), it can be derived from the history of the configuration that contains the file, but the derivation process is not always unambiguous.

So which is more important, the history of the whole configuration or the history of its individual files? In VCS, the configuration always is more important.

When it comes to backup sets, the individual files usually are more important than the state of the whole set, but not always. So now I will have to figure out what I actually need.

ts678 · April 30, 2023, 4:20pm

Phew. That’s a lot, so before heading into small corrections, let me hit what I think is the major worry.

A backup version is the files as they were seen at backup. A file might be the same, or might change.

Both totally and partially the same files use deduplication, so that only new blocks need be uploaded.

Blocks are not deleted until they are used by no versions. Deleting first version means you lose things (perhaps a file version) only available in that version. Anything that continues continues, otherwise the idea of versioned backups wouldn’t make much sense. File presence (or not) gets similarly versioned.

You might have seen caveats in forum posts that thinning means that a short-lived file or version in no other version will no longer be available as a file version if the whole-backup version it’s in got deleted.

Having things fall off the end or get trimmed out of the middle only matters to recovering the form then. Setting a maximum age, for example, will give you that long after deletion to realize you want file back.
Trimming in the middle wonders if a new file that you created then deleted was really all that important,
or if the file is really old, a version may still exist that’s close enough after much time. It’s your decision.

runmode · May 1, 2023, 8:40am

Thank you, I did not catch this point while reading the docs and through the forum. This is key for understanding, for people like me who are not familiar with versioning systems or change over from traditional backup software.

ts678 · May 1, 2023, 11:26am

One additional note on that is that they’re not deleted instantly, but periodically when enough build up, however this is just how wasted space cleanup is done, and doesn’t relate to restoring what’s deleted from the backup. The backup protects against source deletions. You decide how long protection lasts.

Compacting files at the backend

When a predefined percentage of a volume is used by obsolete backups, the volume is downloaded, old blocks are removed and blocks that are still in use are recompressed and re-encrypted. The smaller volume without obsolete contents is uploaded and the original volume is deleted, freeing up storage capacity at the backend.

The backup process explained tries to describe above wasted space removal with less technical terms:

From time to time, Duplicati will notice that there are a few bags that contain bricks it does not need anymore. It grabs those bags, sorts the bricks. It throws away the bricks that are not needed anymore, then it puts the required bricks into new bags and puts them back into the box.

runmode · May 1, 2023, 7:08pm

Thank you, with your help I feel to have a clearer view of Duplicati now. Let me reflect with my own words, trying to meet a detail level which covers many newbie comprehension questions in the forum, including mine. I deliberately use “version” and “backup” only distict, not generalised, and I get along without “dblocks”, “dlists”, “database” and the cleanup procedure, which are beyond the basics. Please, correct me, if I have something wrong, as this or that might be still hypothetic:

Duplicati is a backup and version tracking tool on file system level. Smallest distingt objects are file chunks of specified size. File chunks and their directory paths are referred and indexed by their unique hashes, together with the hash and the timestamp of the file they belong to.

At every scan of the source Duplicati looks for changed files or folders which is recognized solely from comparing full paths and time stamps of files and folders against the previous index. Only file versions which are appearent at scan time get indexed in the versioning system which makes an indexed “version”. The scans are triggered by a scheduler or manually, also called a run.

At every run Duplicati breakes down new or changed files into chunks and records their hashes to the index of the current version, while the signatures of unchanged files are copied from the previous version’s index.

The purpose of the index is to manage a backup copy of all versions of the source. Duplicati keeps and maintains one single copy of each unique chunk at a backup destination. Chunks which appear at the source in multiple copies of a file (duplicates), multiple file versions or multiple indexed versions are kept only once at the backup destination. This method saves backup space and traffic and still allows to restore every indexed version or parts of it. When a file gets restored, only chunks with wrong hashes compared to the index need to be replaced by chunks downloaded from the backup destination.

The volume at destination grows with every new chunk. To limit growth, retention rules can be applied to the index collection, to remove dispensable versions. Chunks which are not referred by any indexed version any longer, get purched at the backup destination. Duplicati has three types of rules for scheduled retention, that is
–keep-versions, which erases the oldest indices that exceed in number,
–keep-time, which erases indices which are older then specified, and
–retention-policy, which erases indices (except of the newest) that do not fit in a schema of timeframes and intervalls:
The customizable schema comprises one or more overlapping timeframes that all start at presence and should have different lengths. For each timeframe a certain number of backup versions are kept in a specified regular interval, e.g. “during the next 4 weeks keep 1 version a day” and “during the next 6 months keep 2 versions per month. By design, rules are overlapping, of shorter timeframes have priority, and effectively cut off time from longer rules. When indices age through the schema, they will be passed from one rule to the next. Every rule checks if an entering index fits in the intervall, and if the distance to the preceding index is to short, the newer index will get purged. The most current index is excluded, so whenever you run a backup manually it is kept, until the next run, then the rules apply.

(Post edited after beeing discussed)

ts678 · May 1, 2023, 10:23pm

I think you’re in deeper than most newbies, but basic sketch here seems mostly right, except:

The most recent backup is exempt from the interval rule. It makes no sense to delete this one.
There are ways to force that, but I don’t know why you’d want to. For details, the code is here.

When some version ages into a longer timeframe, everything there is already spaced properly.
The interval violation would be by the newer version. What else can you purge to resolve that?
If you say purge the one just older, then do that each time, wouldn’t that build into a giant gap?

I’m commenting on user visible functionality. For internals, maybe move to Developer category.

runmode · May 2, 2023, 9:49am

found it: the code removes the current index (0) from the working list so it can not be added to the purge-list. Only in the next run, if a new index is added, it is no longer #0 and treated as normal (checked against the interval).

I’m totally with you, just wanted to point out.

Thank you for your review and clarifications, I will next edit my post, if that is still possible, not to confuse other readers with wrong hypothesis.

ts678 · May 2, 2023, 12:17pm

There are lots of things that could be explained better, preferably in manual, not buried in forum post.
Sometimes it’s helpful to explain it at several different levels (as the manual does for backup design), however it stops at some depth which is needlessly deep (e.g. deep internals) for the average reader.

GUI smart and custom backup retention aren’t covered #83 covered some things that might be useful.
If you can get back to standard terminology without inventing too many new words, your writeups may
be helpful to cover what I mentioned there. If you do pictures or GitHub PRs, that’d be even better. Or
make a new forum topic as a drafting area, including the formatting. That’s the next best thing to a PR.

runmode · May 4, 2023, 3:17am

I finally made friends with ‘versions’ and wrote my first pull request on GitHub where I proposed a help text for --retention-policy. Thx for help!

runmode · June 12, 2023, 11:08am

The new help text for -retention-policy is online now. It incorporates all clarifying and still valid statements found in the forum.