StopNow(), Backup and cancellation tokens

I’ve spent more time thinking about this and playing with it. The more I understand how the database is linked into the backup process, the less confident I am that it has the correct results unless the entire backup completes successfully.

I know that all queries are run on the main thread. I know there are transactions wrapping many of the calls. I am unsure the size and scale of those transactions and how a possible rollback of one of those transactions via a stopNow() action will impact the output. I’ve not been able to create a transaction model in my head that will create consistency between the resulting backup and the local database that have different definitions of complete/committed. The multi-threaded nature of the backup does not lend itself to doing that in a helpful way.

eg;
Lets take 10 files, 100Megabytes each, all with the same content.
The backup will split those across threads, produce the same blocks in different files and write that into the local database. That all happens before the data is in a volume and that volume is stored on the remote. I can see race conditions in that scenario, even before a stop.

When we stop in the middle, the database may commit to say we have backed up that block. But I see no guarantee that we have backed up that block. Alternatively, we may backup the block, but because of the transactions in the local database and the exit point, the transaction is rolled back and we now have backed up blocks and volumes that the local database thinks are not backed up.

It continues to be my increasing opinion that the local database, local data management model and threading process need to be revisited to determine how they can work together to be able to produce a relatively transactional model.
My thinking about it would prefer the messages that are sent between threads would contain the backup block information for any backup in progress, and be updating and checking the database at the last possible point before deciding if we are actually collecting the block data to put it into a volume. There are edge cases in that as well that I’ve not devised a simple plan to resolve.

There are possible performance implications of these choices. But as a backup tools, I consistently hear that correctness is the most important. It’s my view that changes should be made in that direction to be able to increase stability and correctness. I believe in the short term, that may have some performance impacts. However I also believe the structure of the local database and its use are a significant performance problem and we may see improvement.

I don’t think any of these issues can be resolved in the current beta as all require changes that are too risky.

This supports my opinion that larger changes need to be able to be made to allow longer term benefits. That is not viewed as the best direction for the project with the resources available. Which in turn pushes me back to wondering how I can help if all the things I have to offer are not suitable at this time. Maybe I should try again in a year.

I definitely do not need to have to take in consideration design changes at this time. I am already distracted enough as it is.

Thanks for taking the lead on gaining understanding. Do you think you’ll be able to share someday?
Understanding the existing mechanisms (which do extend beyond just the database) will be helpful:

My proposal is that a visit should precede a revisit. No currently active person knows the early plans. Should you prefer to call that a revisit (by a new team), I think it’s still good not to “clean-sheet” code. Proposing an ideal design should be safe enough, but ripping everything out will set things way back. Incremental change starting from the current code should (if it can get us there) be a whole lot better.

Some items that are not strictly database transaction, but are possibly helped by commit at right time:

Remotevolume table State field gives clues on a backup restart of what was how far in the uploading. Mechanisms must exist for blocks that didn’t make it, or for that matter, made it then dblock vanished.

A vanished dblock gets fixed by code in purge-broken-files by purging files existing before vanish.
Potentially similar mechanism could be applied to the temporary block files that are written in parallel.
Unlike the vanished dblock case, this loss is of a proto-dblock that probably won’t impact any old files.

There is a synthetic filelist mechanism that may run before backup to upload dlist if prior couldn’t do it.
This could happen due to an interrupt, and means to record files to the extent that the upload finished. Instead of running it at the start of the backup after interrupt, maybe do it at end after a fast stop effort.

Reliable cleanup from a crash or interrupt is needed anyway. Should a stop now go down super-hard?
How far off is it currently from stopping all backup progress and doing what the synthetic filelist will do? That’s probably the minimum possible time for a clean stop, e.g. destination is consistent, with its dlist.

There are some similarities between SpillCollector which runs when proto-dblock parallel work is done, and Compact. They both take partially filled block files and emit filled ones, and maybe some leftovers.

To make the whole analysis less of an overwhelming large chunk, one can actually still request doing a synchronous upload, allowing focus on file preparation. That can then be de-parallelized using options. Looking at logs from a simplified version may be one way to start. Do we log commit? If not, add those.

Can you continue developing an understanding of the current plan and produce a short writeup of that? There have been several attempts made at describing the processing. Add in the database processing.
You might have thoughts about right-sizing of queries, EXPLAIN QUERY PLAN (don’t scan), and so on…

There’s also the new PRAGMA adder to play with. That was well-received by one database-savvy user.
I’m not sure who’s going to actually go explore though, but then the explorer can share what they found.
This same user has a great site for hitting transaction problems too. I wish they’d work with us on those.

Database and destination internals testing reply to you would probably do a lot to support their situation.
Want some testing scripts to go break Duplicati, and you can study and fix, and go break it some more?

Or look at some existing backup corruption that has good steps to reproduce. There are several around.

There’a an abundance of places to help IMO. If you care to, describe what you’re good at, and what not. Above suggestions were sort of DB flavored but need some C# reading ability too. What about Python?

You are quite right, there should be documents describing this. While I did try to document the process, it was more attractive to write the code than document the “obvious” details, which are now not so obvious.

As stated, there is no real documentation for how this is supposed to work, but I can give my idea for the process. A caveat is that this is straight from memory, so I might have details wrong, but the overall idea should be the same.

Anything that is not transmitted to a remote resource (i.e. an upload) is not visible so these changes should never be permanent. This leads to the idea that an upload triggers a transaction commit. I seem to recall that the volume producer (collecting the blocks) will issue the commit when a volume is completed. That volume will be in the “temporary” state as we do not know if it has been uploaded. Once the upload has completed, it will be marked as “uploaded” and the database will commit again. The next backup will mark the volume as “verified” so we know the file is truly uploaded.

At least that is the “golden path”.

If the backup is interrupted, there will be volumes marked as “temporary”. These will be purged on the next run. Ideally they should not even be in the database, but that would require another storage mechanism, which I deemed too complex at the time. Today, I think a B+Tree would be a good optimization to avoid repeated database queries (yes, SQLite uses B+ trees as well).

This “temporary” state allows recording of data into the database, in a consistent transactionally safe manner, while also supporting CTRL+C style kills.

One problem is that Duplicati relies on the filelist to figure out “what goes where” and for an interrupted backup, this file is missing, as it is the last file to be uploaded. If this scenario is detected, a “synthetic” filelist is created, based on the last uploaded state and whatever changes managed to get uploaded (i.e. what blocks can be used). This part is quite complex and has been the source of some bugs.

Yes, I left out all the metadata handling to keep the concept simple. If I was to implement it again, I would use a naming scheme to be able to store multiple metadata streams, and the streams would be treated 100% like the main data stream.

However, when I added metadata, I was naive and thought that metadata was less than 1kb, so it could be stored differently. The current handler shows some of this.

Due to this simplistic thinking, the current handling of metadata only supports up to one block, and all the code that handles it is using special handling. If it was rewritten to just have a special filename for metadata, almost all of the special handling could be deleted.

My thinking here was to reduce the memory usage, and let the database be the “source of truth”, such that multiple volumes are started, allowing multithreaded compression, but the single database is updated and queried to see if a block already exists. With the “temporary” fix explained above, this takes advantage of the transactions while supporting multi-threading. But there might be cleaner ways to handle this.

If it is helpful that I explain the underlying thinking, just tag me in a question, or PM me.

1 Like

I tried looking at this a little. I wish there was more in the way of logging available, however what’s there looks like it’s mostly on what’s run. I think a lot of what’s seen in the code is not actually used in backup.

TemporaryTransactionWrapper comes in with the transaction the backup began, so a commit is ignored. When an outer transaction is not passed in, the local transaction takes care of keeping transaction-safe. This reminds me a bit of what I guess a stored procedure might do on a DB having nested transactions.

One simple way to ballpark it is to watch rollback journal size in File Explorer sorted by Date modified. Watching file uploads finds it shrink occasionally around when a file upload occurs, as this also needs to set the Remotevolume row State field (and commit it) so it has some historical records to do restart fixes possibly including the synthetic filelist. I would liken this to an application checkpoint to enhance restarts, which allows something better than a start from scratch. This is nice when work might go for a long time.

For finer grained view, I used three methods which seemed to line up somewhat well against each other.

  1. Sysinternals Process Monitor include FlushBuffersFile and CloseFile, exclude if not the journal file.

  2. Rapid polling of journal size, giving report of when the size decreases, or journal has been deleted.

  3. Profiling log in glogg viewer, looking for commits, typically CommitTransactionAsync in many types.

One concern I had before was whether asynchronous work such as uploads might decide to commit at a time that was wrong for some other thread’s likings. It now looks like work gets generally done in *Async version of the database method, and having them run on the main thread prevents incorrect interleaving. Performance probably suffers though. I’m thinking of one user with a 20 core CPU going underutilized…

I don’t know how this relates to the StopNow issue because I haven’t looked at the thread design much. Understanding the issue of transactions seemed worthwhile, because of the proven backup corruptions. Shedding light on some areas should give more confidence that fixes can be done safely, as is the goal.

EDIT:

By proven backup corruptions, I mean ones such as are currently filed with reproducible test cases even when it takes an interrupt tester like I described to induce them. Aside from those, it holds up rather well, meaning it’s far from clear to me that it can’t possibly work reliably (even if slowly), so needs new design.