StopNow(), Backup and cancellation tokens

kenkendk · May 25, 2023, 8:13pm

You are quite right, there should be documents describing this. While I did try to document the process, it was more attractive to write the code than document the “obvious” details, which are now not so obvious.

mr-russ:

I’m not sure what’s best to say that’s helpful and actionable.

DB is not aligned to operations, not updated in ways to show progress through the thread producer/consumer pipeline, uses a single connection, does not allow concurrent reads, does not store data in smaller binary form (hashes), is not optimized for lookups and uses transactions in ways I’ve not understood and appear confusing from my 20 years experience with large DB’s. They don’t usually get interesting until a few hundred GB.

Architecture wise, it needs small transactions that include state information that’s updated when that operation has been successfully completed. Volumes arent “done” or backed up until they are on the remote. This then flows back to files, and blocks. If you know the state information you can easily cleanup on startup of another operation as you know what was successful. You can then have a partial backup that you know what’s complete in it and what data is then just extra garbage and what cannot be deduplicated as it never really got backed up.

At the moment, a block is added to the database and I’ve not seen a guarantee that block is actually backed up. My reading of you issue comment appears to support my view.

As stated, there is no real documentation for how this is supposed to work, but I can give my idea for the process. A caveat is that this is straight from memory, so I might have details wrong, but the overall idea should be the same.

Anything that is not transmitted to a remote resource (i.e. an upload) is not visible so these changes should never be permanent. This leads to the idea that an upload triggers a transaction commit. I seem to recall that the volume producer (collecting the blocks) will issue the commit when a volume is completed. That volume will be in the “temporary” state as we do not know if it has been uploaded. Once the upload has completed, it will be marked as “uploaded” and the database will commit again. The next backup will mark the volume as “verified” so we know the file is truly uploaded.

At least that is the “golden path”.

If the backup is interrupted, there will be volumes marked as “temporary”. These will be purged on the next run. Ideally they should not even be in the database, but that would require another storage mechanism, which I deemed too complex at the time. Today, I think a B+Tree would be a good optimization to avoid repeated database queries (yes, SQLite uses B+ trees as well).

This “temporary” state allows recording of data into the database, in a consistent transactionally safe manner, while also supporting CTRL+C style kills.

One problem is that Duplicati relies on the filelist to figure out “what goes where” and for an interrupted backup, this file is missing, as it is the last file to be uploaded. If this scenario is detected, a “synthetic” filelist is created, based on the last uploaded state and whatever changes managed to get uploaded (i.e. what blocks can be used). This part is quite complex and has been the source of some bugs.

Yes, I left out all the metadata handling to keep the concept simple. If I was to implement it again, I would use a naming scheme to be able to store multiple metadata streams, and the streams would be treated 100% like the main data stream.

However, when I added metadata, I was naive and thought that metadata was less than 1kb, so it could be stored differently. The current handler shows some of this.

Due to this simplistic thinking, the current handling of metadata only supports up to one block, and all the code that handles it is using special handling. If it was rewritten to just have a special filename for metadata, almost all of the special handling could be deleted.

My thinking here was to reduce the memory usage, and let the database be the “source of truth”, such that multiple volumes are started, allowing multithreaded compression, but the single database is updated and queried to see if a block already exists. With the “temporary” fix explained above, this takes advantage of the transactions while supporting multi-threading. But there might be cleaner ways to handle this.

If it is helpful that I explain the underlying thinking, just tag me in a question, or PM me.