Error on purge with 2.1.0.3_beta_2025-01-22

If it would help I’d be happy to meet virtually on Google Meet, Zoom, or Teams where I could share my screen and we could take a look at what I’m seeing here in real-time?

That indicates that the database is somehow “empty” in that it has no filesets. This should not be possible. From reading the rest of the post, it seems that this is indeed how it looks both locally and remotely.

I think this part is the problem:

What it does is record the current fileset as deleted, and all other volumes as deleted, and then commit the database. If the process is terminated or fails after line 210 but before the upload completes in line 211 it would create an unwanted scenario.

In this case, the new fileset is marked as temporary (because the upload has not started) and the previous one would be marked as deleted. Next run will delete both of these files, causing a loss of that version.

If that version is also the only version that remains, it would end up with no versions.

@yu210148 I assume you had multiple versions prior to this issue?

Thanks for that offer, I will PM you and we can set it up.

I agree. This could be the timeout issue rearing its head again. You can set --read-write-timeout=5m to increase it to 5 min. The timeout only happens on “large” buckets, where the listing of the bucket can sometimes exceed the default 30s.

Hi kenkendk I’ll sort out a time and reply to your message shortly but to respond to this post:

The first time I ran into this, Yes, I did have multiple versions. What I ended up doing though was deleting the old backup job when this came up the first time then, I re-created it from scratch in the GUI and did an initial backup with the new job. I’m pretty sure that it was after the initial one completed but before it ran a second time I did the purge. It’s run the backup job several times since then.

I agree, 24 RetryAttempts seems odd to me as well. Where should the --read-write-timeout=5m be set? It is true that this is a fairly large bucket. I have 2 other backup jobs that also go to this bucket (albeit in different folders) so it’s just shy of 2 TB in size. With the default 50 MB block size this does mean that there are a lot of files in there. Listing could very well take a bit of time. When I do it manually it doesn’t take 30s but there could be something else going on that’s causing that threshold to be exceeded.

Thanks for taking a look at this @kenkendk

kev.

As Kenneth and I talked about over Zoom I’m going to remove this unrecoverable backup job and start a fresh one. We were able to generate a bug report and he’s got all the information he can from this.

I’m going to make 2 changes to this new one though: 1). I’m going to change the block size from the default 50 to 100 MB, and 2). I’m going to add the --read-write-timeout=5m flag as he suggested above. It may be a while before I attempt another purge but I figured I would document what I’m doing.

Thanks again.

kev.

Thanks again for taking the time to debug this with me. I have spent some time with the bugreport database and have found an explanation: the timeouts.

It looks like the purge call was setting it up for failure, but the real failure was the backup operation. It attempted to fix some errors, and due to a bug, it ended up removing all filelists/filesets.

I have now tracked down the problem and it is caused by an unfortunate interplay between the timeout and some old abort handling code. More than 6 years ago there was a method to stop a running backup and a desire to make the stop graceful. This was signaled by some code throwing a OperationCanceledException and the uploader gracefully exiting once it happened.

The timeout handler will in some cases throw the OperationCanceledException instead of the correct TimeoutException. Additionally, throwing an OperationCanceledException will sometimes be converted to a TaskCancelledException marking the tasks as cancelled.

All these factors combine into a case where the upload can fail with an OperationCanceledException and this failure passes all checks, making Duplicati think it has completed correctly.

This should be caught in the post-backup verification, but since files were marked as “Uploading” the logic here considered the files to be incomplete and conveniently “fixes” it by removing them.

In total this makes it possible to have a backup that appears to succeed but actually has failed/missing uploads. Since the cleanup is quite robust, no validation checks are triggered after the cleanup, giving the appearance that all is working well, when it is in fact not.

In your case, Duplicati found some failing uploads and corrected these, uploading a total of 4 new filelists. Due to the bug, these were silently failing, without Duplicati knowing it. But seeing that new filelists were “successfully” uploaded, it proceeded to delete the two older filesets, leaving you without any filelists/filesets.

We will be releasing a new stable release shortyly that fixes this issue as it is quite severe. In the meantime, you can use canary 2.1.0.108 or later, or simply set the timeout to a high value (or set it to 0 to disable).

Note that setting a high timeout value only fixes the case related to the timeout, but other code may also throw OperationCanceledException and this will result in similar lost files, but is expected to be less frequent.

1 Like

Impressive hunt for the bug (which apparently hit me once too), enabled by working together.

Thanks all!

1 Like

Thanks again Kenneth,

I’ll keep an eye out for a new release in the beta channel that I’m on too.

kev.

1 Like

As a final follow-up to this. I just attempted the same thing with version 2.1.0.5_stable_2025-03-04 and it worked as expected. Thanks again for all your help here!

kev.

1 Like

Incidentally, I may have an explanation for the excessive time-outs: the source data is on an external WD Green drive that has some pretty aggressive power-saving settings built in to its firmware. I’m not usually present when Duplicati is doing it’s back-up job but when I was for the last backup I noticed it spinning up and down repeatedly so there may have been some delay getting the data from it. That may be the cause of the timeouts.