This topic tries to tie a number of things together with some detail while still being a concise read.
This is sort of a static view dealing in data items and transforms, without saying how code does it.
Channel Pipeline talks about the flow in the backup direction. Restore/recreate is less concurrent.
This document was inspired by a wish for test tools to test backups faster or better than Duplicati.
Direct inspection of destination files allows that, provided one can find out where and how to look.
Current script writes an account of what it sees, and checks for duplicated or missing information:
299184 blocks seen in dindex files 1576 blocklists seen in dindex files 1576 blocks used by dlist files blocklists 126466 blocks used by dlist large files data 4564 blocks used by dlist small files data 143309 blocks used by dlist files metadata 23270 blocks unused 1575 large blocksets in dlist files 4564 small blocksets in dlist files small file blocksets that are also metadata blocksets: set() small file blocksets that are also blocklists: set()
Running a destination checker goes well with automated testing, e.g. to forecast a recreate issue.
Rather than dig into details by uncommenting print statements, future goal is to make a database.
A script is also an easy prototyping testbed for methods that Duplicati can maybe adopt someday.
Now on with some internals info. This has not been well validated yet, so might change over time.
File in dlist ============= A dlist file lists its files with a filelist.json file. File objects may have: type "File", "Folder" are very typical examples path complete, in format fitting the current OS hash source file data hash size source file data size time source file timestamp, 1 second resolution metahash source file metadata hash metasize source file metadata size blocklists one or more blocklist hashes of large file A blocklist defines its blockset by the concatenation of SHA-256 block hashes. It is a block itself, so is limited in size and can be referenced by its hash. The information above should make the Destination column below easier to read.
Rough mapping ============= This table shows how a few things are represented in database and destination. Left to right is roughly backup. Right to left is roughly recreate or restore. What Database Destination ---- -------- ----------- source files Fileset table destination folder dated dlist files mapping FilesetEntry table dlist "/" filelist.json object array source file File View dlist filelist.json file object mapping File BlocksetID dlist object "hash" hash ref file data Blockset table implied by a "hash" hash ref mapping BlocksetEntry table dindex "list/" blocklist hash ref block Block table dblock "/" file named by hash Base64 (File table metadata is somewhat like data, but table path has an extra hop) mapping File MetadataID dlist object "metahash" hash ref metadata Metadataset table implied by a "metahash" hash ref mapping Metadataset BlocksetID none -- direct ref from file object
Deduplication ============= Duplicati makes what one might call a deduplicated full backup on each backup, however such data reduction occurs at many levels, and some are internal ones. Backup looks for added, deleted, and modified files, compared to previous run. Multiple references to any file, blockset, blocklist, or block item may occur. References are typically by hash in destination files, and row ID in database, sometimes needing a mapping table, e.g. FilesetEntry refers to files in Files. most of which are typically the "same" file as in the previous backup version. These tables generally have unique rows, however this isn't always guaranteed, for example by table constraints as a double-check on table maintenance logic.