Alternative FTP failure?

kenkendk · October 19, 2019, 8:55pm

The idea with spill is to allow concurrent compression, with libraries that does not support concurrent compression (i.e. zip files).

A large portion of the CPU time is used to compress data (it is inherently compute intensive). A simple fix would be to run multiple compressors in individual threads (as the files are independently compressed in a zip archive), but the zip library does not support it. We could add it, but it is not so easy to do, as the zip files need each compressed stream written sequentially.

To solve this, Duplicati simply compresses to multiple independent zip files. This has the upside that it works with any compression tool. Whenever a new block is added to the output, whichever zip file is ready first, will grab the block and add it to the output. Once the zip file is too large, it is sent to the uploader, and a new zip file is started.

However, once all blocks are uploaded, we have a situation where multiple zip files contains some blocks, but none are filled completely (or they would be uploaded). These partially filled zip volumes are sent to the spill collector process, which simply merges the partial files, producing as few zip files as possible (fill up until the size limit is reached).

tl;dr: The spill process kicks in at the end of the backup, and merges all partially filled compressed volumes.

I assume you mean database transactions. I do not have a document for it, but the strategy is that everything runs in a transaction (particularly useful for --dry-run). Before a volume starts uploading, the transaction is committed, such that the next run will not discover a “new” file on the backend. (This could happen if the upload starts, and Duplicati is hard-killed. The database will then be in the previous state, but the backend will have a partial file which does not exist in the database).

I wrote CoCoL as an academic project, and personally find that it solves concurrency better than any other multi-processing approach I have seen. I realize that it is somewhat foreign to other developers though, so here is a Duplicati-centric description of CoCoL:

The idea is to have a process which is more-or-less a method that runs independently (just a C# async Task). To remove all race-conditions, a process does not share any variables or memory with any other process, and only communicates with other processes through channels.

While a process is nothing special in C#, a channel is more complex in its implementation.

Communication over a channel will only take place when there is both a reader and a writer waiting (aka rendevous-style, or barrier-like communication). A process can request to read or write multiple channels, and exactly one communication will happen, guaranteed to preserve the order of requests.

In other words, each channel has a queue for readers, and a queue for writers. When both queues contain one entry, communication happens.

This kind of communication is intuitive (just a read/write call) but removes non-determinism (including race conditions).

The reason for using AutomationExtensions.RunTask(...) is that this call automatically calls Join and Leave on channels. The effect of this is that the channel closes (throwing exceptions) as soon as the last reader or writer calls Leave (i.e. the channel has zero readers or zero writers, meaning that no further communication can happen).

In Duplicati, this is used to enabled concurrency for hashing and compressing.
Roughly the following happens:

A process traverses the file system and writes paths to a channel
A process reads paths and checks if the metadata has changed; all changed paths are written to another channel
Multiple processes read paths and computes block hashes; new block hashes (and data) are written to a channel
Multiple processes read blocks and compresses blocks; filled volumes are written as upload requests
A process caps the number of active uploads
A process performs the spill pickup (as described above)
A process performs the remote operations (upload, download, list)

Due to the synchronous nature of the channels, there are never races, and no process can flood the system. As soon as the upload limiter is maxed out, it will not read its channel, automatically blocking the writers, rippling all the way back to the process traversing the filesystem. Similarly, once the filesystem traversal process is done, it closes the channel, causing a cascading close of channels and controlled shutdown of the processes without data loss.

This makes is trivial to maximize and balance the system resources as we simply fire up enough processes and they will automatically block the previous processes as needed, no matter if the disk, CPU, memory or network link is the bottleneck.

If anyone is interested in a more academic description of CoCoL, these are the two papers I wrote about it: