A user recently asked for an overview of Duplicati from a security perspective. Because this information isn’t currently documented in one place, I’m sharing a version of that overview here to explain the different components that make up Duplicati, and why they were chosen to provide maximal safeguards.
But first, some context:
How traditional backups usually work
A traditional backup is typically created by making an initial full copy of the files. This first backup takes up a lot of space and is problematic for storage over the internet, with limited bandwidth.
The trouble with incremental backups
To save on storage of subsequent backups, most systems rely on incremental backups after that initial backup. Rather than making a new, full (bulky) backup, the system usually captures only what has changed since the last backup. This approach is more efficient, but flawed: the incremental backup will gradually drift away from the full one, causing either growing incremental backups or long chains of incremental backups. In the case of a long chain, restorations are slower, and any breakage will cause the later (i.e. more recent) backups to be useless.
Also, when it comes to housekeeping, you can only delete backups that are self-contained. That means that all incremental backups must be deleted before you can remove the original full backup they relate to. This can lead to ballooning storage requirements, because retention policies can require multiple full backups.
Duplicati does away with the full+incremental backup scheme in favor of a simpler, block-based approach where there is no difference between each backup. This is a big improvement for backups that are accessed via the internet, because only the initial backup will consume a large amount of bandwidth. Duplicati stores each block just once, so the storage requirements are significantly less than with full+incremental backups. You can freely select versions for deletion as there are no differences between the versions. Here’s how it works, and what the block-based storage approach means for security and integrity of your backups.
Anatomy of a file: Duplicati’s block-based storage format explained
A file’s contents can be seen as a sequence of bytes, but keeping track of individual bytes is a lot of overhead. Instead, Duplicati treats the file as a sequence of fixed-size blocks made up of the bytes. Managing blocks requires several orders of magnitude less overhead than managing bytes.
To give each block a unique id, Duplicati uses a cryptographic hashing function, creating the block id as a hash value from the contents. Since the id is derived from the contents, Duplicati has only to calculate the hash value again to verify that the block contents have not changed. It also naturally identifies blocks with the same content (as they have the same hash value), enabling deduplication with no additional effort.
To ensure the file itself is not altered, a hash value is also computed for the entire file (we’ll go into this in Tamper resistance). The file can then be described as a sequence of block hashes and a file hash. Repeating the above for multiple files creates a list of file descriptions and this list represents a backup or a snapshot of the filesystem at that point in time.
Collecting data by volume
Duplicati stores the snapshot containing a list of files, in one file per snapshot. The blocks are expected to make up the bulk of the data, so these are grouped into volumes. Where other systems simply store blocks (assuming a capable filesystem or object store), Duplicati collects the blocks into an archive that can be compressed.
There are two purposes for volume collection:
Reduce the storage requirements and number of network requests
Although not a problem with day-to-day use, there are limits to how many files a filesystem can contain in a single folder. Even when using folders to handle the limit, there are complications with communicating a large number of individual files. By grouping into compressed volumes, the storage system needs to keep track of fewer entries and listing is more effecient.
Minimize the inferable information from the backup
If the blocks were stored individually, perhaps compressed and encrypted, it would still be possible to infer some information from that collection of blocks, such as whether the backup contains many smaller files or a few large files. It could also be possible to detect a previously known dataset by means of fingerprinting the contents.
By grouping the contents into compressed and encrypted volumes it is only possible to infer information about how often the backup is running and approximate how much data changes between runs.
Modular design
To counter not-invented-here syndrome, Duplicati uses a modular design that offloads responsibilities into modules. The primary path for a backup is to use a compression module, an encryption module, and a storage module:
- The compression module is used as a file container where blocks are placed. There are no requirements for this module to actually do compression, it just needs to be able to collect a number of named files into a single file. The zip file format is most often used for this step.
- The encryption module is optional, but recommended for off-site backups. It has the responsibility of taking a passphrase and a compressed file and producing an encrypted file. This is primarily done with AES Crypt but can also use GPG, both of which support a unique session key for each file and a strong HMAC for ensuring that the file has not been modified.
- Finally, the storage module is responsible for transferring the file to the storage destination. The implementation of the storage module has no knowledge of the file format, it just supports moving a file to or from the storage.
Using this separation with only file-to-file interfaces makes it possible to perform both backup and restore operations with tools that are commonly available outside the Duplicati project.
The separation of compressed archive and encryption also ensures that there is no information leakage from the compressed archive into the encryption, which has been a concern with zip encryption. This enables a full trust-no-one backup setup, where both the communication protocol and storage destination can observe no additional information about the contents.
Keeping tabs on the storage state
Google Cloud accidentally deleted UniSuper’s subscription, which also deleted all data associated with the subscription. UniSuper had set up replication across two regions in Google Cloud to protect from a regional failure, but Google Cloud deleted the replica as well! — The Pragmatic Engineer, May 2024
The UniSuper/Google Cloud incident—a series of unfortunate events—showed that while large cloud providers have an impressive setup for minimizing data loss, they are not infallible. It also raised questions about whether customer data should be immediately and permanently deleted under any circumstances.
On the user side, local disks can also cause storage errors or simply wear out. Aside from the storage issues, there can also be protocol or logic issues where files are suddenly hidden or reappear after being deleted.
To counter these issues, Duplicati verifies files as part of the normal backup process. Since the number of remote volumes is significantly lower than the number of blocks, a full file list is retrieved before starting a backup. Duplicati compares this list to the expected state of the storage. If there are missing files, additional files, or files with a diverging size, the storage integrity is unknown and the backup will return an error instead of starting.
Stopping the backup is not ideal because then no new backups are made, but the alternative is that more data is placed and the results are potentially a useless backup. To further protect against errors, at the end of each backup a random number of volumes are selected for verification. The volumes are downloaded and the hash of the remote file is compared against the hash recorded before the file was uploaded. If the file has been modified, an error will be logged. Note that this check is done on the (potentially) encrypted volume and there is also a HMAC inside the encrypted file.
A separate testing mode also supports decrypting and a verifying a sample of the blocks inside are equivalent to the recorded hash.
Tamper resistance
As the blocks and files are hashed, Duplicati has built-in corruption detection that detects errors on the bit-level and can trace the problem to a single block. Each restore verifies that the restored file has the correct full file hash. Let’s explore what this means for a would-be attacker:
- They would need to somehow gain access to the remote volumes to tamper with a backed-up file.
- If the volumes are encrypted, they would need to collect the passphrase as well.
- With access to the files, the attacker would need to locate the block they want to modify or craft a new block to include.
- Then they would need to modify the list of blocks for the target file(s) and replace the hash of the file itself.
- They’re not done yet though. They would also need to re-package the modified files, using the same approach as Duplicati uses, described with the modules above.
- Finally, they would need to replace the existing files with the modified versions.
Due to the storage checks outlined above, there are really only two cases where such a modification would go undetected:
- The modified files are the exact same size as the originals. In this unlikely case, the change will only get picked up when the modified files are randomly selected for file hash verification.
If this is the case, an attacker relies on being able to perform the modification before a hash check is performed, which most likely requires some way of destroying the local storage state.
- The recorded storage state isn’t present, such as when restoring from a different machine.
Again, in this case, an attacker relies on some timing and additional intrusion to ensure that the local state is destroyed before the modification can be detected.
Wrapping up
Duplicati is designed as a space-efficient backup system with multiple layers of safeguards against data leakage and data corruption. The unique, block-based format is designed to be fully deduplicating and content-protecting, with hash values being used at multiple levels. The checks performed against the storage ensure that there are no surprises when a restore is critically needed. And finally, using client-only encryption, there is no risk of information leakage and a full trust-no-one setup.
You can read more about how Duplicati’s backup process works here.
Credits
Rebecca Dodd is a contributing writer on this article.