Unexpected results with deleted files when using Custom and Smart Retention settings

Jochem · June 9, 2022, 12:53pm

In this post i’ll discuss the effect of Custom and Smart Retention settings on deleted files.

In short: deleted files may be gone long before you expect them to

Please note the following:

Duplicati works as designed
According to a long read of forum posts: many users will not expect this (but some already know)
I could not find this in the manual pages
Maybe i do not correctly understand the Duplicati backup model, it would be nice if an expert would comment to confirm or deny stated facts from this post

In order to understand the behavior, first we must know:
How does Duplicati makes backup’s?

Your source is one or more directory trees, for which you want to maintain backups
With each full and uninterrupted backup, Duplicati creates a snapshot of your complete source. This snapshot is also called backup, backup-set or version
The snapshot essentially is a complete listing of your source, including (a reference to) its contents
The content of the files (split in dblocks) is maintained seperately by Duplicati in volume files
The snapshot refer to these content (hashes to dblocks) and unchanged content is only saved once

Please note that a snapshot essentially is a point-in-time listing/backup of your source. A snapshot only has knowlegde about existing objects in the source.

Ok, what does all this mean?

For the first backup, Duplicati creates a snapshot (listing of all source objects) and saves all contents
For an unchanged source, each Duplicati backup creates a snapshot (listing), but does not save any content, because this is not needed as it is unchanged
For a source with changed content, only the changes are propagated. The old content is kept as long as other, older, snapshots refer to it
For a source with deleted files, nothing special is done. The deletions are not present in the snapshot. The contents of deleted files is kept as long as other, older, snapshots refer to it

Suppose you make a backup every day for an entire year. You will have 365 snapshots, or backup-sets of versions. Can be a lot of data to backup.
Suppose you are happy with backup thinning and you only need one backup a day for the first 7 days, one per week for the next 4 weeks and one per month for a year after that.

This can be done with a Custom Retention setting: 7D:1D,4W:1W,12M:1M (a Smart Retention is just a preset Retention Setting of this kind).
The effect is that your amount of snapshots to keep will be reduced to 7 + 4 + 12 = 23, and the oldest one wil be about 12 months plus 4 weeks plus 7 days old.

For source objects that stay present in your source all works as expected. Nothing to tell here. For source objects that are deleted this does not work as expected by me (and i fear by many others). Why is that?

Somehow, i expect that deleted files follow the Custom Retention rule as set up and that the deleted file wil be retrievable for the total time. A bit over a year in the example. But this is not how it works. Depending on how long a file existed before it was deleted it can be retrievable as expected, or be gone after a week!

Worst case:

One backup each day and a Custom Retention rule of 7D:1D,4W:1W,12M:1M
Create a file and run a daily backup. The file now exist in one snapshot/backup-set/version (version 0f, * means it contains the file)
Next day, delete the file and backup. All backup’s from this day on do not contain the deleted file. We now have two snapshots (0,1f)
After 6 more days, we have 8 snapshot and backup thinning kicks in. But there is only one snapshop in the week bin, so i think all 8 are still kept (0,1,2,3,4,5,6,7f)
After another day, we have 9 snapshots. 7 for the first 7 days and 2 for the week after that (0,1,2,3,4,5,6,7,8f). I think the very oldest snapshop is now deleted.

Conclusion: after nine days your deleted file is lost forever.
Please note that i do not fully understand snapshot selection during thinning. I could be that in this example the oldest snapshot is kept. If so, add an extra day and backup before stap 2.

The issue here is that short lived files can be lost before you know. Short lived means shorter than the largest thinning cup, in this example one month. The reason is that thinning only looks to snapshot dates, nothing else. Again: Duplicati works as designed.

For me, retrieval of short lived files must be possible in my total backup retrieval period, say one year. Deleted files must be retrieval like all other files. For this use case, using a Smart or Custom Retention rule is not possible. The solution simply is to use an other rule (Delete backups that are older than).

I wonder if this aspect was on the radar when custom retention rules where designed. I feel they are only usable for very advanced users.

So please be aware of the effect of Smart or Custom Retention rules if you care about deleted files!

Xavron · June 9, 2022, 2:10pm

So what you’re saying is that if you use eg smart backup and delete a file locally that it then gets fully deleted from Duplicati backups within some days?

That definitely doesn’t protect from accidental deletions that show up months later. That should not be an advanced user thing for sure. I would think its just an oversight.

According to the following then this would indeed be an incorrect situation or bug as smart backup says it will have a limit for 12 months (There will remain one backup for… each of the last 12 months:)

“any file deleted from the source will subsequently be deleted from the backup when it reaches the retention policy limit.”

Jochem · June 9, 2022, 2:56pm

Depending on the date the now deleted file was created and depending on the Custom/Smart Retention rule: yes, this could happen.

I cannot immediately find the source of your citation, so i cannot comment on it due to missing context. But it seems incorrect indeed.

The issue here is dat during snapshot/backup thinning (that is what custom/Smart Retention does), Duplicati cannot manage anything regarding deleted files; it has no knowledge of deletions. It is not part of the conceptual backup model. The result is that snapshots may be removed that contain the only copy/copies of the deleted files. This can be anywhere in the backup time frame e.g. before the retention policy limit, depending on how long the the file has lived.

Xavron · June 9, 2022, 2:57pm

The source is linked in my post. Not sure why you’re not seeing it.

Ah, so you need to have snapshots enabled for this problem then? Its disabled by default. Might be a byproduct of that use. I don’t use snapshots and haven’t looked into that so don’t know how it stores data there.

From what I can see, it should just use the snapshot to gain the data and adds it to the backup. Maybe there’s some other complex situation though. I’d have to fully re-read everything. Maybe your use is too complex here. Someone else might see what’s going on here.

Jochem · June 9, 2022, 3:06pm

Missed it. Indeed the statement in JonMikeIV’s post is generally incorrect. It is only true when the now deleted file has existed longer than the retention limit. Only then it behaves as i (and i think many others) expected.

No, my use of snapshot is a descriptive term to explain the way Duplicati works. The --snapshot-policy is something different and has nothing to do with my post. Sorry for the confusion.

Xavron · June 9, 2022, 3:12pm

If I understand correctly. It could be that as it deletes backups, files in those are lost, and it incorrectly loses files. One would have to look at the code and see how its dealing with backup deletions for the smart feature. It should be not deleting if there’s a file timestamp not beyond limit and not found elsewhere or something like that.

At least that’s my quick guess here as to why it could be failing to be correct.

Jochem · June 9, 2022, 3:20pm

We don’t have to look at the code. Due to the Duplicati conceptual model e.g. the way it works, it has no knowledge of deleted files. It cannot manage deleted files. Normally not a problem. It’s only with Smart or Custom retention rules that things can happen with (short lived) deleted files. They may disappear. Maybe it’s an oversight and i think it cannot and should not be repaired. It’s just how it works. (In order to get Duplicati to manage deleted files, is has to gain knowledge about this, e.g. do comparisons between snapshots/backup-sets/versions. I guess this is a bridge way too far.)

Just use another rule (Delete backups that are older than) and you are safe.

Xavron · June 9, 2022, 3:39pm

Oh, my reference to deleted wasn’t local. It was referring to the backup data.

Either way, sure if it doesn’t find a local file in the DB, its been deleted or moved (moved is half the same as deleted) anyway so it can know anyway. That’s generally how it works in programming unless code can check in trash.

ts678 · June 10, 2022, 1:18pm

Welcome to the forum @Jochem

It probably depends on what parts you read. Some seems fine. Comparison to CrashPlan seems off…
I’m not going to run a word count to see if “generally incorrect” fits, but I’ll do a deeper dive on it below.

The first bullet covers files that still exist, so doesn’t apply. The second bullet covers deleted file losses from their backup aging past the longest time frame as was discussed in the posts just above that one

It doesn’t, however, cover the point that deleting a backup makes things as if the backup didn’t happen. Some people might think that’s obvious, however others would miss it, so some support volunteers may specially point out the risk you cite. It would be nice if the manual could point it out. Care to try to write?

GUI smart and custom backup retention aren’t covered #83 (seeking help on explaining a messy topic)

Part below is your worry, right?

People who actually think this through thoroughly may realize (or may be told in the forum) that a file which exists (between creation and deletion) for a time less than current minimum interval (which can increase with time) may disappear completely. This may matter to some people, but there’s a limited amount that can be said. Would it be better to have an “advanced” section for those wanting it?

CrashPlan has special features for deleted files. Are you familiar with those, or other backup programs?
Aside from improving documentation (help wanted, as with everything), what change might you advise?

Migrating from Crashplan - Retention & other general questions covers the difference and your warning.

looks to me to give a very concise summary, and “will remain one backup” is a clue. Certainly, it could expand a bit (if someone volunteers), but I wouldn’t want it to be a book. That’s what the manual is for.

One also has to understand what you describe, which is that a backup is a point in time of the system.
Bringing up CrashPlan again, it blurs the line between system and files. A volunteer plans work below:

Implementing the feature to restore any version of a single file (features can happen given volunteers)

Great advice but I wonder if publicity of this whole area can improve? Documentation can possibly find volunteers, but not often. Actual code (not just message changes) is even more seriously understaffed.

Duplicati relies on volunteers from the community, and that’s what limits the progress that can be made.

Jochem · June 13, 2022, 8:04pm

Thank you ts678, for your detailed response. I’d like to respond in a few separate replies.

First, please, let us discus a fairly standard use case including a reasonable expectation from an user.

The use case is this:

a user creates a daily backup
the user understands that a file created and deleted before any backup has run, is lost
the uses uses a smart or custom retention rule, lets say 7D:1D,4W:1W,12M:1M
the user creates a file and accidentally deletes this file a few days later

The reasonable expectation of the user could be that the deleted file will be retrievable upto the end of the retention period, e.g. a little over year in this example.

Do we agree on all of this? For this moment i assume so.

Now, we know that the expectation wil not be fulfilled. The deleted file (in most cases) will be lost forever if the user tries to retrieve say a month later. The user will be very disappointed.

From the manual this is not clear at all. Even an informed user with generally excellent information incorrectly writes:

The issue with deleted files is pointed out in the forum at only a very few places. Your description in Github is indeed the most complete:

Yes, this is my worry. It makes clear that people need to think [about Smart of Custom Retention policy, Jochem] thoroughly, to only then realize that deleted files can and will be lost within (the smart or custom) retention policy period.

Please note that ts678 describes the same issue as i did, and deeply to, but maybe a bit abstract. The interval refer to the interval times from the retention policies, while i explained the issue via the “thinning process”. It’s all the same.

So my question is:
Should Duplicati expect users to think so hard to get to the conclusion about deleted files?
Or -mayby- should Duplicate protect users against certain predictable and unfortunate use cases?

Jochem · June 13, 2022, 8:30pm

Yep, and my all-time favorite was the conceptual model of ADSM/TSM/Spectrum Protect.

I guess i’ll make a suggestion for the documentation later on, have to figure out how to do that first.

I strongly disagree on this. There is no clue whatsoever regarding irretrievable deleted files due to snapshot thinning. I would suggest to start the Smart and the Custom description with something like: “For advanced users only. Not all deleted files will be available for retrieval.” and then continue with the current description. (And of course there must follow a good explanation in the manual.)

Note: the deleted file issue only can occur with Smart or Custom retention policies, not with:

ts678 · June 13, 2022, 9:40pm

As explained, the manual does not cover these options, so of course the missing section is not clear.

The retention can be set in 3 ways

is where it ends. but I posted a link to the issue asking that the manual cover this – and your concern.

What you quote seems correct, is written for context I’ll quote below, but isn’t addressing your context:

Notice how that flows to:

any file deleted from the source will subsequently be deleted from the backup when it reaches the retention policy limit.

(and now that I read it again, this is translating backup versions into file terms – maybe a tip for us…)

What I see as the step-too-far, possibly inspired by the context it’s in though:

It’s talking of fell-off-the-end, not deleted-from-middle, so I agree with your point more than your quote.

I think pointing it out would be better. Manual needs writing, and GUI text could expand. Any thoughts?
Either way, this needs someone who knows the mechanics of how to do it (and volunteers to do so…).

Is that Yep to CrashPlan? Regardless, how do any products that you know deal with version thinning?
Can anybody thin versions in a way that doesn’t lose deleted files into holes created by the deletions?

Care to explain, especially if on-topic and maybe even if it’s not? Duplicati can’t bend its model totally.
I thought about maybe (if developers help) moving last version of a deleted file into a nearby survivor.
That’s kind of ugly though, to have things show up in a version that weren’t originally backed up there.

I think the motivation for thinning was to get rid of intermediate file versions that are no longer useful.

OK, so you think “one backup” per year is no clue whatsoever that there’s a gap? I guess we disagree.
Regardless, “concise” meant it was brief, and brief leaves the impact for user to figure out (clue or not).
It’s definitely possible to be too brief, and this one could use expansion, though expansion has limits…

You’re listing the three the manual covers now, so there’s definitely a blank canvas to try to explain this.

drwtsn32 · June 14, 2022, 3:43am

I guess I didn’t have this expectation. I was a CrashPlan Home user back in the day, and I did appreciate its special handling of deleted files. It seemed like a pretty unique feature for the backup programs I was familiar with.

When I first started experimenting with Duplicati in 2017, I didn’t see any such special option. With a bit of research I understood that it pruned backups at an entire snapshot level, never deeper down making individual decisions on particular files. I adapted a custom retention option to balance that risk with how many versions I wanted to retain.

Your concern and input is very valuable - it shows that the documentation is inadequate! If you could volunteer to adjust the documentation, it would be appreciated. This project could use additional volunteers.