StopNow(), Backup and cancellation tokens

You are quite right, there should be documents describing this. While I did try to document the process, it was more attractive to write the code than document the “obvious” details, which are now not so obvious.

As stated, there is no real documentation for how this is supposed to work, but I can give my idea for the process. A caveat is that this is straight from memory, so I might have details wrong, but the overall idea should be the same.

Anything that is not transmitted to a remote resource (i.e. an upload) is not visible so these changes should never be permanent. This leads to the idea that an upload triggers a transaction commit. I seem to recall that the volume producer (collecting the blocks) will issue the commit when a volume is completed. That volume will be in the “temporary” state as we do not know if it has been uploaded. Once the upload has completed, it will be marked as “uploaded” and the database will commit again. The next backup will mark the volume as “verified” so we know the file is truly uploaded.

At least that is the “golden path”.

If the backup is interrupted, there will be volumes marked as “temporary”. These will be purged on the next run. Ideally they should not even be in the database, but that would require another storage mechanism, which I deemed too complex at the time. Today, I think a B+Tree would be a good optimization to avoid repeated database queries (yes, SQLite uses B+ trees as well).

This “temporary” state allows recording of data into the database, in a consistent transactionally safe manner, while also supporting CTRL+C style kills.

One problem is that Duplicati relies on the filelist to figure out “what goes where” and for an interrupted backup, this file is missing, as it is the last file to be uploaded. If this scenario is detected, a “synthetic” filelist is created, based on the last uploaded state and whatever changes managed to get uploaded (i.e. what blocks can be used). This part is quite complex and has been the source of some bugs.

Yes, I left out all the metadata handling to keep the concept simple. If I was to implement it again, I would use a naming scheme to be able to store multiple metadata streams, and the streams would be treated 100% like the main data stream.

However, when I added metadata, I was naive and thought that metadata was less than 1kb, so it could be stored differently. The current handler shows some of this.

Due to this simplistic thinking, the current handling of metadata only supports up to one block, and all the code that handles it is using special handling. If it was rewritten to just have a special filename for metadata, almost all of the special handling could be deleted.

My thinking here was to reduce the memory usage, and let the database be the “source of truth”, such that multiple volumes are started, allowing multithreaded compression, but the single database is updated and queried to see if a block already exists. With the “temporary” fix explained above, this takes advantage of the transactions while supporting multi-threading. But there might be cleaner ways to handle this.

If it is helpful that I explain the underlying thinking, just tag me in a question, or PM me.