StopNow(), Backup and cancellation tokens

Hi @gpatel-fr,

I’ve been looking at the StopNow() issue, and attempting like you to looking at the code.

My threading skills around async tasks are not as strong as I’d like. It seems a very complicated setup with channels and multiple threads. What is confusing it I can’t understand how a few of the Tasks ever exit;

Backup.DataBlockProcessor.Run(database, options, taskreader, token),
Backup.FileBlockProcessor.Run(snapshot, options, database, stats, taskreader, token),
Backup.StreamBlockSplitter.Run(options, database, taskreader),
Backup.FileEnumerationProcess.Run(sources, snapshot, journalService,
      options.FileAttributeFilter, sourcefilter, filter, options.SymlinkPolicy,
      options.HardlinkPolicy, options.ExcludeEmptyFolders, options.IgnoreFilenames,
      options.ChangedFilelist, taskreader, token),
Backup.FilePreFilterProcess.Run(snapshot, options, stats, database),
Backup.MetadataPreProcess.Run(snapshot, options, database, lastfilesetid, token),
Backup.SpillCollectorProcess.Run(options, database, taskreader),
Backup.ProgressHandler.Run(result)

Some accept cancellation tokens and take action on it.
Some use the taskreader to await in the while loop.
Some use while(true), and aside from a thrown exception, I can’t see how they exit. And in some cases, all exceptions are caught and it just continues on.

Now when the main thread ends, these background tasks/threads can/do run in the background. I’ve seen that in my testing on my computer. I don’t know how that’s nice/safe on the database, as it’s supposed to run all queries on the main thread that has exited.

The error I get from running that test is unparsable json

JsonTextReader.ParseValue()
JsonReader.ReadForType(JsonContract contract, Boolean hasConverter)
JsonSerializerInternalReader.Deserialize(JsonReader reader, Type objectType, Boolean checkAdditionalContent)
JsonSerializer.DeserializeInternal(JsonReader reader, Type objectType)
JsonSerializer.Deserialize[T](JsonReader reader)
RestoreHandler.ApplyMetadata(String path, Stream stream, Boolean restorePermissions, Boolean restoreSymlinkMetadata, Boolean dryrun) line 514
RestoreHandler.ApplyStoredMetadata(Options options, RestoreHandlerMetadataStorage metadatastorage) line 303
RestoreHandler.DoRun(LocalDatabase dbparent, IFilter filter, RestoreResults result) line 444
RestoreHandler.Run(String[] paths, IFilter filter) line 113
<cut further stack trace>

I have been unable to determine a way this code exists cleanly yet. Have you gotten any further?

1 Like

Well, you have been farther than me. At some time it will be necessary to play with CoCol to understand backup code. However, it hangs for you in restore code, and CoCol is not referenced in restorehandler.cs. From git blame, it seems that code in RestoreHandler is mostly older than in BackupHandler, maybe it has not been updated to more recent features of Duplicati. That’s me speculating wildly of course.
It would be interesting to go deeper to see if it always hangs in restore or it happens to hang sometimes in backup.
While understanding CoCol and multitasking is certainly necessary in the medium term, that’s something I have postponed (in fact that’s mostly what I was talking about ‘understanding the network code’ in my post) because there are so many urgent things, among them getting a beta out. So if you feel the same I understand completely, maybe you could advance something else, because this particular enterprise will require quite some free and mostly uninterrupted time.

Do you have any exception text? All I see is a ParseValue at the top.

What code exactly? This is restore, and earlier you pointed to backup.
I can offer general advice, but I don’t want to look at every bit of code.

As a general guess for those, look at the CoCoL Hello World example.
There’s usually a way to recognize an end-of-data situation, e.g. Linux
read returns 0 on end of file. I think CoCoL throws a RetiredException.

So look at where the exception catch is, relative to the channel reads.
Any catch has a scope to it, doesn’t it? Are they at high or low levels?

https://github.com/duplicati/duplicati/blob/master/Duplicati/Library/Main/Operation/Backup/Channels.cs
is what you will probably see backup use. That’s the only Channels.cs.

Following Channels.cs into their uses led to a description attempt here:

Channel Pipeline, which tries to also match the kenkendk writeup here.

The big concurrency addition in 2.0.3.6 looks (if you diff against 2.0.3.5) like BackupHandler.cs,
and RestoreHandler.cs got Fixed empty block handling in concurrent_processing branch #2637
and some interesting commentary on SQL readability, and knowing whether a rewrite is correct.
IMO this is an important question, because without it, code will get less stable rather than more.

Exception text

Newtonsoft.Json.JsonReaderException : Unexpected character encountered while parsing value: �. Path '', line 0, position 0.

Code I’m referring to is the BackupHandler and all the subtasks I posted. Each has own quirks, but many are while (true). Let’s take Duplicati/Library/Main/Operation/Backup/MetadataPreProcess.cs and the Run() method. Explain how that can exit. My guessing from what you’ve said is that ReadAsync() throws a RetiredException. That’s caught in “some” cases, can continues. I would say that readability wise, ReadAsync() being the exit method is not clear. And there are not comments to assist in understanding the flow of while(true) loops.

Now we may be able to exit is a RetiredException is throw. We also can get a ThreadAbortException at any point in the code. Which is less than ideal for handling clean exits. And based on the intermittent failure of the test, I currently suspect we are sometimes processing the ThreadAbortException in ways that produce corrupt data. @tsuckow agrees that cancellations tokens are better, and I’m not sure why we have them, but don’t pass them into each operation of a backup. That way we can finish the process in an acceptable way, in the same way we only finish RetiredException when we are reading new data, not existing data.

I’m still struggling to work out where the json is written to disk. My estimation is that it will be put into a block, and that block written to DB, and to a volume. But with ThreadAbort, I can’t see how we guarantee that the metadata block makes it to disk in a transactional way that all threads either consider a block/file “processed” or not. It is my view for backup, we must be able to provide consistency between db writing, file inclusions, block writing and volumes. Or you easily end up with corruption if you interrupt the process at any point.

Speaking of Cocol, why use that over the standard .NET Channel producer/consumer logic?

Speaking of Cocol, why use that over the standard .NET Channel producer/consumer logic?

If I’m not mistaken channels aren’t supported in .net Framework

That was probably true 5+ years ago. And would have been the driver of a custom library then.

NuGet Gallery | System.Threading.Channels 8.0.0 and Channels - .NET | Microsoft Learn seem to indicate it’s available as long as you use framework 4.6.2 or .net core 2.1. Those aren’t unreasonable. But we have what we have at the moment.

As mentioned by @gpatel-fr the restore code was supposed to be updated the same way as the backup process, using the channel model, but other things were prioritized as restore is usually done less frequent than backups.

Yes, CoCoL is based on the concepts of Communicating Sequential Processes (CSP) which uses channels as the exclusive communication pattern and a “poision” that closes the channels. The language around CSP is a bit dated, so I opted to use newer terms, where you “retire” a channel, meaning that no further communication can be done. This causes any reader/writer to get a RetiredException when attempting to read and write.

CoCoL was written before Channels were introduced into C#. Today I would probably use the built-in channels, even though they are lacking some of the more advanced multi-reader/writer features.

The logic here is that the task (aka CSP process) is running in a while(true) loop, such that it runs “forever”. Some processes will not run forever as they are the initial producers (such as the file system enumeration process). When a non-infinite process is completed, it will dispose() the channels that it is using. This will make the channels accept no more input, but allow them to be emptied (aka read). Once the channel is empty, it will throw the RetiredException which will case the inifinite-running task to stop. When it stops, it will dispose its own channels. This causes a cascading effect that brings down all the processes. The reason for this design is that it reduces the chance that race conditions in the shutdown process are present.

In other words: each of the while(true) tasks run until an exception is thrown. The RetiredException is filtered by the RunProtected() and not re-thrown. Any other exceptions will be rethrown, so the error is reported to the task that is awaiting the process.

Yes, the use of ThreadAbort was an earlier attempt at introducing an cancellation mechanism. I attempted to avoid passing a cancellation token around to all methods, and instead use the thread abort feature. This turned out to be a poor solution, because all code now needs to handle exceptions at arbitrary code points.

The idea with ThreadAbort is that it does not allow any guarantees, which is why it should not be used at all.

The hard part is then to use the cancellation token, and make sure any long-running tasks, such as DB queries, are running in a task, such that the caller can quit while a query is running. The query code then needs to handle being terminated when it completes.

The error line does not seem to be related to writing json, it is an error that occurs when reading previously stored serialized metadata. The question is then if the json that is in the zip file is corrupted, or the reader somehow messes up.

I would put a try/catch around that line and see what the stream contains, maybe that will reveal the issue. But it does sound like a data corruption issue where the contents of the stream are overwritten. Most likely cause is some code that reuses a buffer array to avoid allocations, which then breaks with multiple threads.

1 Like

That’s about it but there are some more finer points. I guess that I’ll have to add to the list of my tasks the compilation of my pages of notes, translate them to english, and getting the result in a general readable form, about the very intricate process that is turning the on disk structure into a database and the reverse process - and I may not have worked out 100% of the funny stuff. It can’t be explained in a few lines.

This is supposed to be expoed in the Duplicati white papers but these sources are not up to date (I guess that Duplicati was not handling metadata when it was written, or metadata handling has been left out for the sake of simplicity), and there are a lot of small things left out. There are also some posts by mostly @ts678 on the subject.

While I’m talking to you, I’ll take the opportunity to point that I may have not answered all your questions these last weeks, sorry about that, I have a limited capability and available time and energy.

That’s okay, I don’t need all my questions answered. Your comment about the plan aed how to actually help was really helpful. Of the other things, I’m still most interested in how to get code feedback and how something might get merged. I’ll persist on this issue for the moment.

This is the difficult things. You have to choose fast exit vs consistency. This is a backup tools, and we have talked a lot about stability, I’d pick that.

If there are long running tasks, they need to be looked at when causing problems. The SQL, is… well one of the things that I would like to give some attention to.

I agree, however there’s also the legitimate use case of system shutdown or restart at some point. This leaves the destination in a potentially untidy state. The Remotevolume table and its State field has info, and Duplicati has to try to put things back in order on next backup. Calls are before RunMainOperation.

A carefully orchestrated shutdown may leave things tidy, but cleaning up at start may also be an option. From a user perspective, I’d note that the stop flavors of “after current file” and “now” confuse the users. Those wanting “right now” don’t notice that “now” has some fine print on it about work done before then.

There was once a discussion of whether both speeds of stop were needed. IIRC there was a conceived use case for both, but if “stop now” gets too tough, there’s probably an option to redefine or remove it…

Getting back to stability…

Relying on the database for cleanup information is bad when the database is bad. Aside from user error (which should still be mitigated, when possible), my main code question is whether all these concurrent processes modifying the database are breaking each other’s transaction plans by local commit decision.

Or maybe there’s a design plan behind keeping the entire database transactionally consistent at startup after sudden death, or maybe the inconsistencies are known and the mess cleanup logic handles them?

Wondering about the plan leads to hesitation about fixing what might be simple database oversight, e.g.

and properly shutting work down on errors and on request seem similar, so might be looked at together.

1 Like

I’m not sure what’s best to say that’s helpful and actionable.

DB is not aligned to operations, not updated in ways to show progress through the thread producer/consumer pipeline, uses a single connection, does not allow concurrent reads, does not store data in smaller binary form (hashes), is not optimized for lookups and uses transactions in ways I’ve not understood and appear confusing from my 20 years experience with large DB’s. They don’t usually get interesting until a few hundred GB.

Architecture wise, it needs small transactions that include state information that’s updated when that operation has been successfully completed. Volumes arent “done” or backed up until they are on the remote. This then flows back to files, and blocks. If you know the state information you can easily cleanup on startup of another operation as you know what was successful. You can then have a partial backup that you know what’s complete in it and what data is then just extra garbage and what cannot be deduplicated as it never really got backed up.

At the moment, a block is added to the database and I’ve not seen a guarantee that block is actually backed up. My reading of you issue comment appears to support my view.

Either way, I find this all irrelevant and academic until there is a way to make that kind of change that would then be reviewed aed merged. (Yes, this is the hobby horse I will not be able to get off).

This is getting into the area of those design documents I wish we had, however the design does have concepts such as asynchronous-upload-limit which allows files to queue while waiting for their upload.

The term “actually backed up” begs the question of how far, but generally it takes the destination at its word until a file list after backup or on next backup finds otherwise, in which case records are reverted.

Here’s a proposed enhanced stack-trace-like theory on how the rollback of Block table might get done:

DELETE FROM ""Block"" WHERE ""VolumeID"" IN ({0})
RemoveRemoteVolumes()   LocalDatabase.cs
cleanupRemovedRemoteVolumes.Add(i.Name);
case RemoteVolumeState.Uploading
RemoteListAnalysis()    FilelistProcessor.cs
VerifyRemoteList()      FilelistProcessor.cs
PreBackupVerify()       BackupHandler.cs

The use of the State field is probably necessary because destination doesn’t do transactions directly.

That’s Verified transition in the RemoteVolumeState, unless you don’t feel a need to actually do a list.

Look through the RemoteListAnalysis() cleanup code for how that’s exactly what it’s attempting to do.

It is my experience that most of this design should be code comments, the backup handler block of code this thread started with I would love to see 500 lines of comments about how it’s supposed to achieve its goal. Separate docs are great but hard to ensure it’s maintained. And it’s an achievable goal to add them in areas that are understood. It can be done as paat of code changes rather than a large difficult document that is also challenging to verify at the beginning.

It’s true the current code does have the remote volume flag. I can’t believe it works properly as when I use ctrl+c on the cli, I’ve been able to get to and invalid DB that won’t repair. So a documented design for data flow in backuphandler would assist with that.

Agree that a separate docs can be hard. I’d like to see a small comment block at the top of each file.
Giant comment block in one place might go as neglected as a separate doc. Lines all over are hard.

Because the overall architecture is pretty sprawling, and subject to slow change, a design summary possibly is a reasonable separate document, then code comments spread around could add details.

Pretty pictures are nice, but lead to potential difficulty and friction if somebody ever has to change it.

Regardless of form, my goal is to be able to understand the design so we know its problems (or not), giving us some idea of what to change (or maybe dare not change) to accomplish what the goals are.

I wasn’t sure where some of your database comments were heading. SQLite is missing big-database capabilities, but we’re not taking advantage of everything it has yet, as we try to get to stable and fast.

Agree completely. I wish we had that. Unless we get some aid, is this something you’d like to work on?
The forum post I wrote is a very high level view, but I’d like to get down a little further to database work.

Regarding Control-C, I was doing automated interrupt testing before using random kill, maybe a repair.

        timeout = random.randint(1, 30)
        count_started += 1
        print('Running backup at', datetime.datetime.now())
        result = subprocess.run(args, timeout=timeout, capture_output=True)
        exit_code = result.returncode
        print('Exit code', exit_code)
        if exit_code >= 100:
            print('stderr was:')
            error = result.stderr.decode()
            print(error)
            if re.search('ErrorID: MissingRemoteFiles', error):
                count_missing += 1
                repair()
            elif re.search('ErrorID: ExtraRemoteFiles', error):
                count_extra += 1
                repair()
            elif re.search('Unexpected number of remote volumes marked as deleted', error):
                count_marked_deleted += 1
 #              repair()
 #              pass
                break
            else:
                count_unknown += 1
                break

which, from where the code landed when I gave up, looks like it was probably two that repair could fix (and I think they have Issues filed), maybe one that it couldn’t, and then some unclassified type of fail.

Some sample statistics from that random kill test script (plus related commentary) can be found here.

What kind are you getting? I really dislike the ones that can’t be rescued regardless of extreme efforts.

Not really keen, I want to be changing things at the same time to give best value for the investment and that’s not a priority.

I would like to change the way the DB is managed but is too big a change for the moment.

That blend might work. It also helps the code reviewer, so helps the end goal of better code shipping.

Thanks to all for your work.
Here I only want to put in a word for human time (you call it fast exit). Sometimes Duplicati hangs up for days, and I’m forced to kill it. Or I have to turn off the computer, which is the same. I cannot wait days/weeks/years with no clear expected time to complete the job. In such cases, speaking about stability is not a valid alternative.

There is no doubt that there are problems stopping Duplicati cleanly. At the moment there is no clear plan for going forward, however I intend to commit this PR

in a separate branch in a few days and build binaries based on it for people wanting to help in testing. Hopefully it could get things in a better shape. But it’s not for next beta that’s for sure.

My understanding of Thread.Interrupt is that it will reduce some errors on exit. But will only interrupt when a sleep occurs. So exit may be slower. I commented out Thread.Abort and it did not appear to have any effect on the exit of the cli. I’ve not been using the web to see any impact there.