StopNow(), Backup and cancellation tokens

ts678 · May 28, 2023, 6:50pm

I tried looking at this a little. I wish there was more in the way of logging available, however what’s there looks like it’s mostly on what’s run. I think a lot of what’s seen in the code is not actually used in backup.

TemporaryTransactionWrapper comes in with the transaction the backup began, so a commit is ignored. When an outer transaction is not passed in, the local transaction takes care of keeping transaction-safe. This reminds me a bit of what I guess a stored procedure might do on a DB having nested transactions.

One simple way to ballpark it is to watch rollback journal size in File Explorer sorted by Date modified. Watching file uploads finds it shrink occasionally around when a file upload occurs, as this also needs to set the Remotevolume row State field (and commit it) so it has some historical records to do restart fixes possibly including the synthetic filelist. I would liken this to an application checkpoint to enhance restarts, which allows something better than a start from scratch. This is nice when work might go for a long time.

For finer grained view, I used three methods which seemed to line up somewhat well against each other.

Sysinternals Process Monitor include FlushBuffersFile and CloseFile, exclude if not the journal file.
Rapid polling of journal size, giving report of when the size decreases, or journal has been deleted.
Profiling log in glogg viewer, looking for commits, typically CommitTransactionAsync in many types.

One concern I had before was whether asynchronous work such as uploads might decide to commit at a time that was wrong for some other thread’s likings. It now looks like work gets generally done in *Async version of the database method, and having them run on the main thread prevents incorrect interleaving. Performance probably suffers though. I’m thinking of one user with a 20 core CPU going underutilized…

I don’t know how this relates to the StopNow issue because I haven’t looked at the thread design much. Understanding the issue of transactions seemed worthwhile, because of the proven backup corruptions. Shedding light on some areas should give more confidence that fixes can be done safely, as is the goal.

EDIT:

By proven backup corruptions, I mean ones such as are currently filed with reproducible test cases even when it takes an interrupt tester like I described to induce them. Aside from those, it holds up rather well, meaning it’s far from clear to me that it can’t possibly work reliably (even if slowly), so needs new design.