StopNow(), Backup and cancellation tokens

I’ve spent more time thinking about this and playing with it. The more I understand how the database is linked into the backup process, the less confident I am that it has the correct results unless the entire backup completes successfully.

I know that all queries are run on the main thread. I know there are transactions wrapping many of the calls. I am unsure the size and scale of those transactions and how a possible rollback of one of those transactions via a stopNow() action will impact the output. I’ve not been able to create a transaction model in my head that will create consistency between the resulting backup and the local database that have different definitions of complete/committed. The multi-threaded nature of the backup does not lend itself to doing that in a helpful way.

eg;
Lets take 10 files, 100Megabytes each, all with the same content.
The backup will split those across threads, produce the same blocks in different files and write that into the local database. That all happens before the data is in a volume and that volume is stored on the remote. I can see race conditions in that scenario, even before a stop.

When we stop in the middle, the database may commit to say we have backed up that block. But I see no guarantee that we have backed up that block. Alternatively, we may backup the block, but because of the transactions in the local database and the exit point, the transaction is rolled back and we now have backed up blocks and volumes that the local database thinks are not backed up.

It continues to be my increasing opinion that the local database, local data management model and threading process need to be revisited to determine how they can work together to be able to produce a relatively transactional model.
My thinking about it would prefer the messages that are sent between threads would contain the backup block information for any backup in progress, and be updating and checking the database at the last possible point before deciding if we are actually collecting the block data to put it into a volume. There are edge cases in that as well that I’ve not devised a simple plan to resolve.

There are possible performance implications of these choices. But as a backup tools, I consistently hear that correctness is the most important. It’s my view that changes should be made in that direction to be able to increase stability and correctness. I believe in the short term, that may have some performance impacts. However I also believe the structure of the local database and its use are a significant performance problem and we may see improvement.

I don’t think any of these issues can be resolved in the current beta as all require changes that are too risky.

This supports my opinion that larger changes need to be able to be made to allow longer term benefits. That is not viewed as the best direction for the project with the resources available. Which in turn pushes me back to wondering how I can help if all the things I have to offer are not suitable at this time. Maybe I should try again in a year.

I definitely do not need to have to take in consideration design changes at this time. I am already distracted enough as it is.

Thanks for taking the lead on gaining understanding. Do you think you’ll be able to share someday?
Understanding the existing mechanisms (which do extend beyond just the database) will be helpful:

My proposal is that a visit should precede a revisit. No currently active person knows the early plans. Should you prefer to call that a revisit (by a new team), I think it’s still good not to “clean-sheet” code. Proposing an ideal design should be safe enough, but ripping everything out will set things way back. Incremental change starting from the current code should (if it can get us there) be a whole lot better.

Some items that are not strictly database transaction, but are possibly helped by commit at right time:

Remotevolume table State field gives clues on a backup restart of what was how far in the uploading. Mechanisms must exist for blocks that didn’t make it, or for that matter, made it then dblock vanished.

A vanished dblock gets fixed by code in purge-broken-files by purging files existing before vanish.
Potentially similar mechanism could be applied to the temporary block files that are written in parallel.
Unlike the vanished dblock case, this loss is of a proto-dblock that probably won’t impact any old files.

There is a synthetic filelist mechanism that may run before backup to upload dlist if prior couldn’t do it.
This could happen due to an interrupt, and means to record files to the extent that the upload finished. Instead of running it at the start of the backup after interrupt, maybe do it at end after a fast stop effort.

Reliable cleanup from a crash or interrupt is needed anyway. Should a stop now go down super-hard?
How far off is it currently from stopping all backup progress and doing what the synthetic filelist will do? That’s probably the minimum possible time for a clean stop, e.g. destination is consistent, with its dlist.

There are some similarities between SpillCollector which runs when proto-dblock parallel work is done, and Compact. They both take partially filled block files and emit filled ones, and maybe some leftovers.

To make the whole analysis less of an overwhelming large chunk, one can actually still request doing a synchronous upload, allowing focus on file preparation. That can then be de-parallelized using options. Looking at logs from a simplified version may be one way to start. Do we log commit? If not, add those.

Can you continue developing an understanding of the current plan and produce a short writeup of that? There have been several attempts made at describing the processing. Add in the database processing.
You might have thoughts about right-sizing of queries, EXPLAIN QUERY PLAN (don’t scan), and so on…

There’s also the new PRAGMA adder to play with. That was well-received by one database-savvy user.
I’m not sure who’s going to actually go explore though, but then the explorer can share what they found.
This same user has a great site for hitting transaction problems too. I wish they’d work with us on those.

Database and destination internals testing reply to you would probably do a lot to support their situation.
Want some testing scripts to go break Duplicati, and you can study and fix, and go break it some more?

Or look at some existing backup corruption that has good steps to reproduce. There are several around.

There’a an abundance of places to help IMO. If you care to, describe what you’re good at, and what not. Above suggestions were sort of DB flavored but need some C# reading ability too. What about Python?

You are quite right, there should be documents describing this. While I did try to document the process, it was more attractive to write the code than document the “obvious” details, which are now not so obvious.

As stated, there is no real documentation for how this is supposed to work, but I can give my idea for the process. A caveat is that this is straight from memory, so I might have details wrong, but the overall idea should be the same.

Anything that is not transmitted to a remote resource (i.e. an upload) is not visible so these changes should never be permanent. This leads to the idea that an upload triggers a transaction commit. I seem to recall that the volume producer (collecting the blocks) will issue the commit when a volume is completed. That volume will be in the “temporary” state as we do not know if it has been uploaded. Once the upload has completed, it will be marked as “uploaded” and the database will commit again. The next backup will mark the volume as “verified” so we know the file is truly uploaded.

At least that is the “golden path”.

If the backup is interrupted, there will be volumes marked as “temporary”. These will be purged on the next run. Ideally they should not even be in the database, but that would require another storage mechanism, which I deemed too complex at the time. Today, I think a B+Tree would be a good optimization to avoid repeated database queries (yes, SQLite uses B+ trees as well).

This “temporary” state allows recording of data into the database, in a consistent transactionally safe manner, while also supporting CTRL+C style kills.

One problem is that Duplicati relies on the filelist to figure out “what goes where” and for an interrupted backup, this file is missing, as it is the last file to be uploaded. If this scenario is detected, a “synthetic” filelist is created, based on the last uploaded state and whatever changes managed to get uploaded (i.e. what blocks can be used). This part is quite complex and has been the source of some bugs.

Yes, I left out all the metadata handling to keep the concept simple. If I was to implement it again, I would use a naming scheme to be able to store multiple metadata streams, and the streams would be treated 100% like the main data stream.

However, when I added metadata, I was naive and thought that metadata was less than 1kb, so it could be stored differently. The current handler shows some of this.

Due to this simplistic thinking, the current handling of metadata only supports up to one block, and all the code that handles it is using special handling. If it was rewritten to just have a special filename for metadata, almost all of the special handling could be deleted.

My thinking here was to reduce the memory usage, and let the database be the “source of truth”, such that multiple volumes are started, allowing multithreaded compression, but the single database is updated and queried to see if a block already exists. With the “temporary” fix explained above, this takes advantage of the transactions while supporting multi-threading. But there might be cleaner ways to handle this.

If it is helpful that I explain the underlying thinking, just tag me in a question, or PM me.

1 Like

I tried looking at this a little. I wish there was more in the way of logging available, however what’s there looks like it’s mostly on what’s run. I think a lot of what’s seen in the code is not actually used in backup.

TemporaryTransactionWrapper comes in with the transaction the backup began, so a commit is ignored. When an outer transaction is not passed in, the local transaction takes care of keeping transaction-safe. This reminds me a bit of what I guess a stored procedure might do on a DB having nested transactions.

One simple way to ballpark it is to watch rollback journal size in File Explorer sorted by Date modified. Watching file uploads finds it shrink occasionally around when a file upload occurs, as this also needs to set the Remotevolume row State field (and commit it) so it has some historical records to do restart fixes possibly including the synthetic filelist. I would liken this to an application checkpoint to enhance restarts, which allows something better than a start from scratch. This is nice when work might go for a long time.

For finer grained view, I used three methods which seemed to line up somewhat well against each other.

  1. Sysinternals Process Monitor include FlushBuffersFile and CloseFile, exclude if not the journal file.

  2. Rapid polling of journal size, giving report of when the size decreases, or journal has been deleted.

  3. Profiling log in glogg viewer, looking for commits, typically CommitTransactionAsync in many types.

One concern I had before was whether asynchronous work such as uploads might decide to commit at a time that was wrong for some other thread’s likings. It now looks like work gets generally done in *Async version of the database method, and having them run on the main thread prevents incorrect interleaving. Performance probably suffers though. I’m thinking of one user with a 20 core CPU going underutilized…

I don’t know how this relates to the StopNow issue because I haven’t looked at the thread design much. Understanding the issue of transactions seemed worthwhile, because of the proven backup corruptions. Shedding light on some areas should give more confidence that fixes can be done safely, as is the goal.

EDIT:

By proven backup corruptions, I mean ones such as are currently filed with reproducible test cases even when it takes an interrupt tester like I described to induce them. Aside from those, it holds up rather well, meaning it’s far from clear to me that it can’t possibly work reliably (even if slowly), so needs new design.

Well, as there is less than 30 days since the last post, I guess it’s still time to add more confusion on cancellation tokens in Duplicati if anyone is interested. Maybe this has been documented somewhere but I have not yet found it, so I’ll post my findings if this interests anyone.

There are 3 cancellation tokens in Duplicati backups. Sad thing is that these 3 tokens are often passed to functions under names such as ‘token’ or ‘cancellationToken’. It is not making things easier when trying to understand what is happening :slight_smile:

The main one is used in the Web user interface when clicking on the interrupt button.
It is created in Duplicati/Library/Main/Operation/BackupHandler.cs (RunAsync).

A secondary one (not very important) is the counterToken, used to cancel file enumeration when it’s not needed anymore. This one has actually a good name.

The uploader token is the most interesting IMO. It is created in Duplicati/Library/Main/Operation/Backup/BackendUploader.cs, and from then passed to the sub-tasks for uploading the different kind of files generated by Duplicati (blocks, index…), from then passed to the Duplicati backends, and from then passed to the (external) driver code. It is cancelled when an uncaught (at an upper level) exception is raised.

Typically it happens when something bad happens in a backend (bug or most likely ‘network’ - in a very wide sense - problem). In this case this uploader token cancels all the other uploads. This is a self-destruct token, but limited to the upload. Quite often when it triggers, it will happen for a block upload (because these are big uploads, 50 Mb). At this time the file list will have been uploaded to the backend. At this point, we have a backup from the point of view of a recovery (when we don’t have a file list, all there is is a bunch of useless data files). So it’s something that is worthy of a recovery. However since all pending blocks uploads have been destroyed, it’s not an easy recovery.

Note that at no point in the upload process the main token is taken in any account, it’s not even known. That’s why users can click desperately on the cancel button to no effect.

First design strangeness: why oh why not pass the main token to the upload and check it from time to time ? Problem is that when aborting the upload, there is the possibility to create a very rotten backup as seen above, that exists officially but is very much not easy to use.

Second design strangeness: why is the file list upload not delayed until the very end of all uploads ? It would turn half baked backups into just a bunch of useless data files that could be rid of without problem, and most importantly would not generate huge delays when restoring from scratch in case of disk disasters. These data and index files would just be ignored when restoring from scratch or recreating the database, even if not cleaned already in regular repairs. If it’s done (certainly not easy), it would also enable an easy workaround for the first design strangeness.

However at the point where I am currently in my muddled thinking, I rate this as ‘priority’ for the next canary.

I’m not following this thought. Isn’t below when it’s uploaded – at the end?

@ts678

this is the result of actual testing. I have slowed the network link (done with ‘tc’ under Linux). I have consistently got backups where the only file part of the backup was the fileset. So when trying to restore, I got … an empty backup !

Reason is that the file is created after the last dblocks are created yes, however while it is created the last dblocs are still uploading. Remember that dblock creation is a temporary file on disk, that is then queued for uploading.
In the part you quoted, line 530, you have the upload of the file list. At this point the last dblocs are not yet uploaded completely. The filelist is small, it may start last, but it is going faster so it’s finished to upload before the last dblocks. If there is a problem at this point, or if for some reason the big dblocks have a problem, you get a dlist saved on the backend but missing dblocks.
In the case of my tests, the backup was small so with the buffering (multithreaded), usually no dblock at all was uploaded when the code you quoted was running, hence an empty backup because the uploading of dblocks was crashing. One crash is enough to abort the rest.

That sounds like issue:

Interrupting backup can leave hidden damage so Recreate will download all dblocks then fail from missing blocks #4341

Parallel uploads allow the dlist to upload faster than the larger dblocks. An interruption leaves a backup with missing dblocks.

which I thought got fixed:

Upload filesets last during a backup #4354

If parallel uploads are enabled it is possible for the fileset to be uploaded before all dblocks are uploaded.
When the fileset is placed in the queue to be uploaded place it in a separate list and only upload it after all dblocks have been uploaded.

I must say I’m no BackendUploader expert. Description sounds like what we want. Unsure of code.

EDIT:

Possibly context matters too. Does issue happen on a normal backup, or does it need a Stop now?

The backup stopped all by itself, I saw to it so that it crashed by slowing the network to 1mbit/s.
There is in fact no way to use Stop now when the final buffer flushing is happening, at this point all the files to backup have been gathered in the backup and all dblocks and dindex files are generated as temporary files, so there is no checking of the main cancellation token. It’s too late at this point.

I tested with upload throttle of 1 MByte/second with a file making about 40 MB of upload. I’m not sure why the dindex didn’t upload in parallel with the dblock (maybe that’s the design?) and admittedly there’s only one dblock/dindex pair before the dlist, but I’m not seeing an early dlist upload attempt, nor does FTP log.

2023-06-13 12:28:18 -04 - [Information-Duplicati.Library.Main.Controller-StartingOperation]: The operation Backup has started
2023-06-13 12:28:21 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: List - Started:  ()
2023-06-13 12:28:21 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: List - Completed:  ()
2023-06-13 12:28:24 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Put - Started: duplicati-b49bdd7f884704881b4c729f5b0eb4852.dblock.zip (39.27 MB)
2023-06-13 12:29:05 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Put - Completed: duplicati-b49bdd7f884704881b4c729f5b0eb4852.dblock.zip (39.27 MB)
2023-06-13 12:29:05 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Put - Started: duplicati-i047cc9197f7f408f9443eb35ea285d41.dindex.zip (26.90 KB)
2023-06-13 12:29:05 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Put - Completed: duplicati-i047cc9197f7f408f9443eb35ea285d41.dindex.zip (26.90 KB)
2023-06-13 12:29:05 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Put - Started: duplicati-20230613T162821Z.dlist.zip (722 bytes)
2023-06-13 12:29:05 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Put - Completed: duplicati-20230613T162821Z.dlist.zip (722 bytes)
2023-06-13 12:29:05 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: List - Started:  ()
2023-06-13 12:29:05 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: List - Completed:  (3 bytes)

This is an ordinary initial backup with a throttle to try to slow down the dblock. Could it be you crashed it?

Was that forcing a timeout or something? Do you see the problem when the backup completes normally?

FTP is very different. See this post:

I think that you can run safely backups with Ftp or Sftp with Duplicati on slow links. Not so with Http based protocols.

Edit: with Http based backups, you can also use smaller Dblocks, or the protocol could be handled at Duplicati level and a bigger timeout could be set - most of the time with the Http backups, the Http client is created in the backend driver and Duplicati can’t set a bigger timeout anyway.

I don’t know if it’s different except in timeout behavior. BTW regarding 100 seconds this hits OneDrive and similar MS things all the time, requiring an adjustment to http-operation-timeout. Anyway, on FTP:

2023-06-13 12:46:18 -04 - [Information-Duplicati.Library.Main.Controller-StartingOperation]: The operation Backup has started
2023-06-13 12:46:21 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: List - Started:  ()
2023-06-13 12:46:21 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: List - Completed:  ()
2023-06-13 12:46:28 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Put - Started: duplicati-bb6cfa0c2b266495591fa2ed9077a087f.dblock.zip (49.90 MB)
2023-06-13 12:46:28 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Put - Started: duplicati-b6ab6c249be3a4fbcb23e334acd5e7e26.dblock.zip (49.97 MB)
2023-06-13 12:46:33 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Put - Started: duplicati-bcd6bf5c4bddd4756927a39c432242111.dblock.zip (49.92 MB)
2023-06-13 12:46:34 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Put - Started: duplicati-ba804000e198346609e4721b8d0255599.dblock.zip (49.94 MB)
2023-06-13 12:46:39 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Put - Completed: duplicati-bcd6bf5c4bddd4756927a39c432242111.dblock.zip (49.92 MB)
2023-06-13 12:46:40 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Put - Completed: duplicati-b6ab6c249be3a4fbcb23e334acd5e7e26.dblock.zip (49.97 MB)
2023-06-13 12:46:41 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Put - Started: duplicati-i1905e2183dc84da0ac7472f3e256a63a.dindex.zip (17.66 KB)
2023-06-13 12:46:41 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Put - Started: duplicati-i6f96c1571d934267828a22b03ae4b200.dindex.zip (18.66 KB)
2023-06-13 12:46:41 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Put - Completed: duplicati-i1905e2183dc84da0ac7472f3e256a63a.dindex.zip (17.66 KB)
2023-06-13 12:46:41 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Put - Completed: duplicati-i6f96c1571d934267828a22b03ae4b200.dindex.zip (18.66 KB)
2023-06-13 12:46:42 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Put - Started: duplicati-bf78a95701bad4b77aa38e334e09907e6.dblock.zip (49.94 MB)
2023-06-13 12:46:43 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Put - Started: duplicati-b0ce86b1158c14464bb322e9fc2c6a444.dblock.zip (49.96 MB)
2023-06-13 12:47:26 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Put - Completed: duplicati-bb6cfa0c2b266495591fa2ed9077a087f.dblock.zip (49.90 MB)
2023-06-13 12:47:26 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Put - Started: duplicati-i4c145a6211b04c49ab55f2fac6898847.dindex.zip (18.59 KB)
2023-06-13 12:47:26 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Put - Completed: duplicati-i4c145a6211b04c49ab55f2fac6898847.dindex.zip (18.59 KB)
2023-06-13 12:47:26 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Put - Started: duplicati-ba1235e9399b04fffaaa879423ec46bba.dblock.zip (25.05 MB)
2023-06-13 12:47:33 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Put - Completed: duplicati-ba1235e9399b04fffaaa879423ec46bba.dblock.zip (25.05 MB)
2023-06-13 12:47:33 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Put - Started: duplicati-i3836723569d447ab8c30a85d1d538323.dindex.zip (21.92 KB)
2023-06-13 12:47:33 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Put - Completed: duplicati-i3836723569d447ab8c30a85d1d538323.dindex.zip (21.92 KB)
2023-06-13 12:47:36 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Put - Completed: duplicati-ba804000e198346609e4721b8d0255599.dblock.zip (49.94 MB)
2023-06-13 12:47:36 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Put - Started: duplicati-i72a09e884b7e42358051723ded3ebf1b.dindex.zip (17.65 KB)
2023-06-13 12:47:36 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Put - Completed: duplicati-i72a09e884b7e42358051723ded3ebf1b.dindex.zip (17.65 KB)
2023-06-13 12:48:05 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Put - Completed: duplicati-b0ce86b1158c14464bb322e9fc2c6a444.dblock.zip (49.96 MB)
2023-06-13 12:48:05 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Put - Started: duplicati-i17166c6e3bd844ac8b0a0c82d782e6fa.dindex.zip (111.48 KB)
2023-06-13 12:48:05 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Put - Completed: duplicati-i17166c6e3bd844ac8b0a0c82d782e6fa.dindex.zip (111.48 KB)
2023-06-13 12:48:16 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Put - Completed: duplicati-bf78a95701bad4b77aa38e334e09907e6.dblock.zip (49.94 MB)
2023-06-13 12:48:16 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Put - Started: duplicati-i5d2d7a4f3ac8430cab86ea91a0f7243c.dindex.zip (17.84 KB)
2023-06-13 12:48:16 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Put - Completed: duplicati-i5d2d7a4f3ac8430cab86ea91a0f7243c.dindex.zip (17.84 KB)
2023-06-13 12:48:16 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Put - Started: duplicati-20230613T164621Z.dlist.zip (888 bytes)
2023-06-13 12:48:16 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Put - Completed: duplicati-20230613T164621Z.dlist.zip (888 bytes)
2023-06-13 12:48:16 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: List - Started:  ()
2023-06-13 12:48:16 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: List - Completed:  (15 bytes)

shows the concurrent uploads of dblock files (I added a second file of about 300 MB). Still no problem, possibly meaning that the forced HTTP error is causing things to run unusually (maybe reduces issue).

EDIT:

If you mean 100 seconds is hard to fix when it’s not Duplicati code doing it (e.g. third-party drivers), yes. Possibly sometimes they give a knob we’re not using, other times maybe need workaround like you say.

Exactly. I have browsed Minio-dotnet code and the timeout is at the default and there is no way for the client (Duplicati in our case) to change it.

This is just a summary about how the cancellation is working right now so I won’t forget:

Cancellation process

  • All operations have a result which can control the operation (pause, resume, stop (now / after file), abort)
  • The backup handler is passed a cancellation token
  • Controller.Stop() cancels that token and calls BasicResult.Stop()
  • Controller.Abort() calls BasicResults.Abort() and Thread.Abort() on the task thread (does not cancel the token)

Part A:

  • BasicResult.Stop() sets m_pauseEvent to unpaused and m_controlState to stopped (aborted for Abort()) on the result. These are used by result.TaskControlRendevouz():
    • The method throws an exception if aborted
    • If paused it blocks, otherwise it returns the current state
    • BackendManager uses this to pause and abort transfers between operations and in the progress handler
    • BackupHandler: pause and abort before post backup verification
    • CompactHandler, DeleteHandler, RecreateDatabaseHandler, RepairHandler, RestoreControlFilesHandler, RestoreHandler, TestHandler:
      pause/abort before each file, stop: complete backend transfer and finish transaction before stopping
    • ListChangesHandler: pause/stop/abort at a few predetermined spots
    • ListFilesHandler: pause/stop/abort before each fileset

Part B:

  • In addition, BasicResult.Stop() calls stop on a m_taskController, but only for stop now. There is supposed to be a distinction between stop now and stop after current file in the task controller, but it is not used.

  • For BackupHandler this is passed in as result.TaskReader to

    • BackendUploader
    • DataBlockProcessor
    • FileBlockProcessor
    • StreamBlockSplitter
    • FileEnumerationProcess
    • SpillCollectorProcess

    and is checked at some other places, but not to:

    • FilePreFilterProcess
    • MetadataPreProcess (gets the cancellation token though, uses it to ignore cancellation exceptions)
    • ProgressHandler
  • In each of those with task reader:

    • await taskreader.ProgressAsync is called repeatedly
    • Running: this returns false without blocking
    • Paused: this await blocks until resumed
    • Stopped: this returns true without blocking, allowing cleanup after a file/block is complete
    • Terminated (only called in Dispose()): this cancels the underlying task, throwing a TaskCancelledException once awaited. Some processors rely on this to finish after all other tasks.
  • Otherwise the task reader is only used for FileEnumerationProcess in TestFilterHandler

Part C

  • The cancellation token is used in
    • FileBlockProcessor
    • FileEnumerationProcess
    • BackupHandler.RunMainOperation() to determine if the backup is partial
  • It seems to be checked at least as often as ITaskReader.ProgressAsync, or more often. I think this is to cancel faster in sections where a pause would not make sense.

Conclusion

  • There seem to be two parallel ways to communicate the end of the operation (three if you count the cancellation token)
  • BasicResult.TaskControlRendevouz():
    • is a blocking operation to pause the progress
    • throws when aborted
    • can return Run or Stop (Pause will always block until resumed and Abort will always throw)
  • ITaskReader.ProgressAsync:
    • is async await compatible to pause the progress (Pause() is not called anywhere, should be in BasicResults.Pause())
    • throws when terminated (in Dispose, which does not seem to be called ever, it is also not set in BasicResults.Abort())
    • returns true to continue, false to stop

I think it was intended to move from the blocking TaskControlRendevouz() to the async ProgressAsync, but this transition is very incomplete at the moment. In the commit history it was added in a3f2b39d9 in 2016, seems to be after the cancellation token was already used:

Implemented handling of pause/stop/abort in the concurrent code.
Implemented the dry-run feature for backups.

TaskControlRendevouz() was added in bd53090a in 2014:

Implemented the pause/resume/stop/start methods throughout the calls to allow for interactive control over the tasks

Right now this whole logic is very confusing, maybe @kenkendk knows more about how this was intended to be.

thanks for sharing your analysis.

All I can add is that I have found that the backup cancellation token is a bit confusing it its naming. In actual fact, there are 2 main levels in cancellation tokens, the main one created in Duplicati/Library/Main/Operation/BackupHandler.cs, it is only used on RunMainOperation, that is, the thing that is counting files, creating temporary files (zip and aes), and passing them to the backend level. This token is never used by the backend level. There is another cancellation token created at the backend level (Duplicati/Library/Main/Operation/Backup/BackendUploader.cs) and this one never takes in account the user action (clicking on cancel in the UI). It is only used to self destroy the uploading when something goes badly wrong. But in the code there is no different naming for these 2 levels so it’s not obvious while browsing the code.

In short, the Duplicati approach is to never cancel pending uploads. I think that the idea is that it would create states that would be very difficult to recover from. That means that if the upload goes wrong without crashing, it gives frustrating results for users because the backup seems to never stop (network failures + retrying takes a very long time to finally abort).

So cancelling works fine (if a bit slowly if the upload link itself is slow) if the backend works reliably. All that is necessary is to wait patiently for the upload to terminate (it never cancels itself).

That’s why I don’t see that as a major priority because if the backend don’t work (the target itself is bad, something that could happen when using homegrown servers or consumer grade NAS), or the driver is buggy or outdated, you have bigger problems than having trouble cancelling backups. They would not work anyway. I have toyed with the idea of transmitting the cancellation request from the main cancellation tokene to the backend level but I have given up for now because it’s not a really urgent problem IMO.

The reason why I went so deep is because for the .NET 5 migration, all of the backends receive a cancellation token for their operations. I am just not sure which one I should use, but probably the existing BackendUploader token. The BackendUploader should probably also use another token to cancel any retries, otherwise stopping will always be slow. I don’t think that stopping after a failed upload will do any damage to the backup consistency.

There is also BackendManager which is used for tasks other than uploading. That also needs to forward the cancellation, although it runs everything on a separate thread.

I feel like stop now should also abort running uploads (if possible on the backend level), so the process stops immediately. Stop after current file should wait for running uploads to finish, but maybe not do any retries.