New retention policy deletes old backups in a smart way

Or just 1W:1D,200Y:1W…

I know it’s practically the same, but I think it’s still a valid point to have a semantically logical expression.

If you just want to keep them forever it’s weird that you have to make up some arbitrary number. I mean, I literally don’t care and the expression still made me pick an arbitrary number, that number being 10 years. And more paranoid people, like @JonMikelV, will tell you they picked 100 - but secretly they picked 9999 just to be safe :wink:

3 Likes

That’s not how it works, at least, I hope so.
Backup retention is not about file versions, it thins out backup versions (or restore points, whatever you want to name it).

Every backup version is a representation of everything included in your source selection. If you delete a file and start a backup job, that file will not be part of the most recent backup, but it is still part of all other backup versions (provided that the oldest backup is more recent than the file creation date).

When a file exists on your system, it will be part of the most recent backup, You can restore it from every backup version of the last year. Only backup versions that are older than a year are deleted. So if you modified a file more than a year ago and didn’t modify it since than, only the current version is in your backup.

If you accidentally deleted or corrupted thast file, you can successfully restore it from any of the backup versions.

That would be great!
The U value could also be used to replace --keep-versions.

  • --keep-versions=30 could be specified as --retention-policy=U:30.
    --retention-policy=1Y:1W;U:1M:30 would mean: keep 1 weekly backup for the last year, from then 1 monthly backup, but never delete backups if there are less than 30.

Discussed at Github:

Somebody has his backup job set to run every day and has set --retention-policy=1M:1D. Usually that would mean 30 backups are kept, if the PC is turned on all the time and created a new backup daily.
Now that person goes on holidays for a month. Upon return Duplicati will then delete all backups except the newest one, since they are all older than a month.

Not seen as a problem in that discussion, but I guess for most people it’s unintended behaviour in this scenario. Adding a minimum number of backup versions would resolve that.

I hadn’t thought about U:number, only U:date. I guess it reads as “unlimited keep 30”, but I initially misread it as “keep one backup for every 30 days forever” because I missed that there was no letter.
I’m not sure if it’s more user friendly or if it would lead to confusion. I think in that Github thread we ended up deciding that --keep-versions should not become legacy exactly because it’s so much more easy to understand than custom repressions.

Also would 1M:30,1Y:1M be valid if 1M:30 is valid? It then reads “Keep no more than 30 versions (deleting the oldest) for 1 month, then keep 1 per month for 1 year”.
If one is valid and the other isn’t, then I think it’s confusing because it’s inconsistent. At the same time it seems weird if you make 30 backups in a row one day, then there is no backups left from the previous 29 days of the month that can be turned into 1Y:1M.

In the case of U for example I’d say 1M:U,1Y:1M is valid because that then replaces 1M:0s,1Y:1M - Which by the way I always thought was poor semantic form to say “unlimited” by saying 0s and I think it even works with 0D, 0M, or 0Y, which all looks pretty confusing to me.

I think this might be a complicated way of writing `1Y:1W,30M:1M. Although I guess technically, if you missed a full month of backups, then you’d only have 29 backups after two and a half year of backups, instead of 30 (then going back 31 months). But how good a use case is that? Would it make sense?

My main thoughts/concerns are:

  • Specifying a number should Always main: I want at least this number of versions and not: I want no more than this number of versions. So 1M:30,1Y:1M should not mean “Delete all versions older than the first 30 versions”, but “Stop deleting when there are 30 remaining versions”. This will prevent deleting more versions than intended.
  • I was thinking about some way to define the total number of versions, not the versions in a specific time frame. Not thought of the exact syntax, but somehow a trailing number could be supplied, indicating the minimum number of versions to keep. You could add an optional K: to supply the minimum total number of versions.
    Alternatively, you could add an optional third value for each time frame, so 1M:1W would mean "one backup for every week last month, and 1M:1W:3 would be interpreted as “one backup for every week last month, but stop deleting backups this months if there are 3 remaining backups in this time frame”.

My main concern is a common use case like 1M:1D. If I choose this policy, my intention is that I have (around) 30 versions to choose from when doing a restore operation.
If my computer is switched of for 1 month (holiday), all backup versions are deleted except 1, resulting in destroying all file versions with doing nothing. I see this as a potential problem.
Combining with --keep-versions is not supported, so there is no way to keep my retention policy (if I backup twice a day, 50% can be deleted) and keep at least 20 versions. Something like 1M:1D;K:20 would resolve that.

1 Like

Aaaaah, now that makes sense! I didn’t get that part from the example, but looking back I see what you meant now :slight_smile:

Then something like 1M:30 make sense and it also answers my concerns about it working in any order. Although it cannot be 30:1M, that makes no sense, right?

I think a letter definition will be good. 1M:K30 cannot possibly be a typo where 1M:30D was accidentally turned into 1M:30 by deleting the letter.

It all makes a lot of sense to me now, though.

Unlike U, which is semantical, K actually supports purposes that were possible by combining retention with keep policy. Except this is more flexible than retention+keep policy.

3 Likes

That’s exactly what I tried to point out. Sorry if I was unclear about that, but this is quite complex stuff. A small change can have all kinds of unwanted side effects.
Didn’t think too much about how to resolve it in detail, but I’m concerned a bit about the number of versions that could be deleted unintentionally.
I guess the syntax can be improved. For example: when choosing a letter for versions to keep, you could choose K (keep), V (versions), N (number of versions) and so on. It’s just a thought.

It is. It’s the classical problem when writing software for arbitrary inputs. There are millions of potential inputs and they all have to work consistently, but they should also cover every type of use case. It’s good to be able to discuss it :slight_smile:

I think that’s a good point.

I think it’s best to settle on one. The syntax is already a bit overwhelming and I’ll admit I spent a good 5-10 minutes designing my first retention policy before entirely understanding what it would mean for my backup.

Running 2.0.2.19_canary_2018-02-12

In order to get the retention policy runs listed in the internal log and via the e-mail reports, i must enable this in the settings:

--log-level=Information

I this the way it is supposed to be ?

What do you mean by “retention policy runs”? Is this any run using the --retention-policy parameter or runs that only do cleanup (because no file changes are needing backup)?

With “retention policy runs”, I mean when the configured retention policy is executed - after each backup.

Example excerpt from the log:

Messages: [
[Retention Policy]: Start checking if backups can be removed,
[Retention Policy]: Time frames and intervals pairs: ..

[Retention Policy]: Backups to consider: 2018-02-12..

But if --log-level=Information not is configured, there will be no report in the log. No information at all about the retention policy result.

Great clarification, thanks!

I suspect this is by design due to the potential for a LOT of messages to come out of retention processing, but I should probably step back and let the actual developer of this functionality answer for sure. :slight_smile:

I find that the “result” log for the individual backup job itself does reflect retention policy settings - can you check that also?

  1. [Click Backup Job on Home screen]
  2. “Reporting”
  3. “Show Log”
  4. Select any entry ending with “Result”
  5. “Messages” section near the bottom of the log output

This is exactly what I’m talking about. Here is more excerpt from the “result” log:


Messages: [
[Retention Policy]: Start checking if backups can be removed,
[Retention Policy]: Time frames and intervals pairs: 7.00:00:00 / 00:00:00, 31.00:00:00 / 1.00:00:00, 181.00:00:00 / 7.00:00:00, 3652.00:00:00 / 31.00:00:00,
[Retention Policy]: Backups to consider: 2018-02-12 00:24:23, 2018-02-11 00:06:49, 2018-02-06 23:20:03, 2018-02-01 22:49:28, 2018-01-29 23:11:58, 2018-01-26 23:39:25, 2018-01-24 23:41:15, 2018-01-23 22:08:36, 2018-01-21 23:14:47, 2017-12-27 02:08:44, 2017-12-13 00:54:57, 2017-12-03 23:25:30, 2017-11-23 01:12:00, 2017-11-15 21:59:49, 2017-11-06 00:35:11, 2017-10-29 23:48:54, 2017-10-17 23:44:29, 2017-10-09 00:40:41, 2017-09-10 23:37:07, 2017-07-25 00:14:59, 2017-06-13 23:44:50, 2017-05-11 00:32:47, 2017-03-29 22:49:06,
[Retention Policy]: Backups outside of all time frames and thus getting deleted: ,
[Retention Policy]: All backups to delete: ,

]
Warnings: []
Errors: []


Again, no information about retention policy checking and result is included if --log-level=Information not is configured,

So are you saying all that stuff doesn’t appear in the Result log if you don’t have that setting set? Because it shows up in my logs seemingly regularly.

Yes, No stuff at all about the retention policy appears if I don’t have that setting set. Same thing on all 4 computers which I have Duplicati installed on.

Weird, since I’ve never set that setting AFAIK and yet the retention policy stuff shows up in mine.

Someone else has the problem on latest Canary, I also do. I have backups with retention option “Keep all backups”. Now when I change to the new retention options and save, then re-edit config, it show “Keep all backups”.
But when I look into the export command line view, there it shows it has saved “–retention-policy=“1W:1D,4W:1W,12M:1M””.
The UI is highly confused.

Sorry for not responding earlier to this. Have been a bit busy this week.

Unlimted timeframe and interval
@Pectojin and @kees-z

I’ve read through your discussion and thought about it for quite a while.

I quite like your idea of having U for „Unlimited“ for both the timeframe and the interval. So you could for example have 1W:U,1Y:1W,U:1M:

  • 1W:U would mean “For one week keep all (=unlimited) versions”. Internally it’ll work the same as specifying 0s for the intervall and 0s will still be valid as interval for this functionality.
  • 1Y:1W still means “For one year keep a version at the interval of one week”. No change here in how it works.
  • U:1M would mean “For unlimited time keep a version at the interval of one month". This could replace the need for having to specify a very long timeframe like 99Y. It’ll always be applied after the last rule, so for example if you also have an 99Y:2W timeframe, then that one will be used before the U:1M timeframe.
    (Internally the unllimited timeframe will still work with a specific date of January 1st, 0001, see DateTim.MinValue)

Minimum numer of backups to keep
As for the idea to keep a minimum amount of versions, I’m still quite sceptical, especially since it’s a bit hard for me to understand why having a certain amount of backups supposedly results in a “safer” situation. IMHO this somewhat goes against what the retention policy is supposed to do, that is, deleting outdated backups and keeping the backup list short-ish, while still keeping frequent backups of recent changes.

But ok, that is only my humble opinion. I tried to show some pitfalls in the following example:

Here the two timeframes now have the proposed new K:X rule for keeping a minumum number of backups per timeframe (see row 1). But with these “Keep” rules per timeframe Duplicati would:

  • still keep less backups if no backups have been created for a while e.g. user is gone or source data wasn’t modified so Duplicati didn’t create backups (see row 3 and 6)
  • suddenly violate the retention policy and stop deleting backups even thought there are already quite a lot more recent backups (see row 12 and onwards).

What might work slightly better is to define a global, timeframe-independent “Keep” rule which overrules some of the decisions the retention policy made. But this then conflicts with the new “Unlimited” timeframe rule from above. For example a user might specify K:10 and U:1M. Then, after 10 months, he will have accumulated 10 backups and since these don’t get deleted, he will always have 10 or more backups from now on, rendering the K:10 useless for protecting newer backups after long absence.
Also this global “Keep” rule might be confusing as to how it differs from the --keep-versions option.

Logging with level Information

That’s because the log messages are logged with the priority Information in the code. If you’d set --log-level=Profiling and a log file via --log-file=... then you’d get even more messages from the retention policy run (among other things). These additional messages wont show up in the Messages: [...] output though. I think Duplicati in general limits messages in this area to the level Information and above, maybe to not clutter the output there too much. But I’m not 100% certain about that.
As to why @drakar2007 doesn’t have to specify the level in the options of the backup job: Maybe he has it set to Information globally in the Default options sections of the Settings, so there isn’t any need to set it on a per job basis?

4 Likes