Prevent data loss when repairing outdated local database

Jojo-1000 · June 15, 2023, 9:23pm

Hi,
I saw that some users unintentionally deleted new versions of their backups when trying to repair a database that was restored from somewhere else (e.g. this forum post or this issue). I think this can happen pretty quickly when using different backup sources, such as an older full-disk backup and a more up-to-date duplicati backup.

So, I would like to work on a fix that prevents this kind of unintentional deletion and I need some input how this should be done.

Currently, there are two modes for repair:

No local database: Run database recreate
Local database exists: Run repair of remote files (assumes the local database is perfect)

The second case is the problem, because any versions and blocks newer than the database state are assumed to be corrupted and deleted. I can think of a few ways to remedy this:

Move files instead of delete:
Either move to a sub-folder or to a new prefix, so that no data is lost. The backend architecture does not allow moving as far as I can tell, so it would need to be downloaded and re-uploaded. There could also be cases with corrupt volumes that should be deleted, but now they still take up space.
Check for new dlist files and refuse to repair:
If there are any dlists newer than the last backup in the database, abort with an error and tell the user to delete the local database and run a recreate.
I think this would prevent the common case above from happening, while still actually deleting corrupted data from the remotes.
(Ideally) run a partial recreate when new versions detected:
This would combine the new versions into the local database without needing to run a full recreate, which is pretty slow. However, this would require quite some work and could cause some difficult to catch bugs.

In my opinion, checking for new dlist files and returning an error is simple and low-risk and does not change the behavior of the repair operation (as moving would). Although I can also understand the point of view that no deletions should happen at all without further user action.
In the future it would be nice to have partial recreates, but I think that is not absolutely necessary at this stage when a full recreate also works.

Can anyone think of edge-cases where just looking at dlist files would not be sufficient? My thoughts about this:

Intermediate versions which were deleted due to retention policy should still be deleted on the remote, so only look for newer dates, not any unknown dlist files
I think partial backups missing a dlist file cannot be recovered anyway without the correct database, so they might as well be deleted

ts678 · June 15, 2023, 10:32pm

That was basically my proposal in your linked issue. It seemed an easy safeguard check.

I forget the dialog. Does popup currently offer Repair button? If so, that may need change.

Ordinarily I think (though a look at code would be worthwhile) retention deletes happens after backup. Possibly a backup that finds nothing changed applies retention too. I’m not sure, but putting in the old database would then find missing files rather than extra ones it would want to delete, so I’m confused.

One thing that could conceivably go wrong with any time-based plan is wrong clocks.

Coises · June 15, 2023, 10:48pm

The first problem is the conflation of the two modes in one button. The UI leads users to think that Repair works on the local database. Repairing the local database and deleting remote data are very, very different things!

One usually winds up doing these sorts of things because something has gone wrong. Typically a user is anxious, upset… though this is a time that reading documentation, searching forums and double-checking everything should prevail, in practice it doesn’t always work out that way. (My experience here.)

In my opinion that’s a primary need. Make the user interface clear, so users know they will be deleting files from remote storage (and can choose to back up the backup first, if appropriate).

Once that’s done: If files can’t be moved on the remote, can they at least be renamed (to something Duplicati will “know” should be ignored), and only deleted as a second, separate step once success of the repair operation has been confirmed? If that is impossible too, what about creating dummy “flag” files with the same name and a different extension to tell Duplicati which files are to be ignored?

Jojo-1000 · June 15, 2023, 10:52pm

Are you talking about this one (from the issue)?

Fatal error => Found 10 remote files that are not recorded in local storage, please run repair

I do think a one-click repair button is too dangerous in that situation. The message should probably also state that repair will delete the additional files.

I was just thinking of other situations where one might run repair. Maybe the backend failed to delete these backups. This would trigger the exact same message as above, and in this case it would be desired to delete the extra files.

At least with an outdated database, the timestamps in the database would be exactly equal to the destination filenames.
Maybe there could be some problem if multiple backups point to the same destination, but I don’t know how you would distinguish that scenario from a broken remote.

Jojo-1000 · June 15, 2023, 11:00pm

I agree it is dangerous that the repair command does two completely different things (and one option is already covered by recreate), but I doubt it could be changed easily if many people rely on the behavior in scripts. The UI can be reworked of course.

Renaming is moving the file, there is no difference to a different directory. This would need to be implemented for every single backend, and maybe some just fundamentally can’t support it.
With the current design of the CLI, a two step process would be difficult to implement.

That could work, or the ignored files could be saved in a new table in the database (although that would make it difficult for a user to see which files need to be cleaned up manually).

Coises · June 15, 2023, 11:26pm

Honestly, who would read that and not think, “It’s going to update my local database so that it matches the remote backup”? I’m trying to think of a plausible real-world scenario in which Repair is what a user would want to do after receiving this message.

ts678 · June 15, 2023, 11:44pm

Then it would be wrong if Repair is changed so it won’t do that. It will now, which is bad.
What’s worse is it’s hard to find out what the files are. You can look at live log if need be.

There’s a certain amount of reconciling at the start of the backup while checking file listing.
A file meant to be deleted had been set to Deleting state in the database. If backend didn’t
actually delete before, it will get another chance then.

Changed Computer Clock For A Few Minutes and Broke Duplicati
is one report of how Duplicati deals with bad clocks. I don’t think it happens much though…

There’s probably some discussion on that around. It happens. Possibly a similar sanity test
could avoid problem from that, e.g. if someone decides to Repair. Chances are the dlist file
times are wildly different from the Remotevolume records, and that would be a good clue…

Putting Repair button under Database → Maintenance could certainly give that impression.
Arguably some cases use database information to “fix” remote (see below), but who knew?

It’s not just a button problem. The REPAIR command describes how it does different things.
In addition to deleting “extra” files, a repair can sometimes replace lost dlist and dindex files.
And possibly it also fixes internal database problems, but details on what it does are scarce.
Even ignoring compatibility pains, I don’t think there’s much chance of getting repair redone.

Message changes are likely easier, but bear in mind that non-English versions take awhile…
I also think the original proposal was to rely less on user, and more on looking situation over.
The challenge may be balancing preventing accidents against impeding legitimate activities.

gpatel-fr · June 16, 2023, 8:51am

Hello

I strongly dislike renaming or copying files on the remote backend:

it’s not a reliable way to handle a recovery - what if there is not enough space ?
what happens if an user does a parallel install on another computer “just” to test restore ?

I don’t think that the partial recreate is realist either; it’s way too complicated IMO.
Duplicati has enough intricate stuff as it is.

Agree that the 2 modes for repair is not intuitive.

Ideally I’d have time to give more thought to the dlist files time option .
When I was thinking (casually I’d admit) to this I was more thinking about a host ID saved on the backend, but dlist file time seems an easier option indeed.

Jojo-1000 · June 16, 2023, 1:47pm

Okay, so based on your feedback I am going to implement a simple check for new dlist files on the remote before repair is run. If there are any new files, it aborts with an error before deleting any unknown files.
This does not solve all of the issues with repair, but at least it prevents the easiest accidental data loss.