This topic tries to tie a number of things together with some detail while still being a concise read.
This is sort of a static view dealing in data items and transforms, without saying how code does it.
Channel Pipeline talks about the flow in the backup direction. Restore/recreate is less concurrent.
This document was inspired by a wish for test tools to test backups faster or better than Duplicati.
Direct inspection of destination files allows that, provided one can find out where and how to look.
Current script writes an account of what it sees, and checks for duplicated or missing information:
299184 blocks seen in dindex files
1576 blocklists seen in dindex files
1576 blocks used by dlist files blocklists
126466 blocks used by dlist large files data
4564 blocks used by dlist small files data
143309 blocks used by dlist files metadata
23270 blocks unused
1575 large blocksets in dlist files
4564 small blocksets in dlist files
small file blocksets that are also metadata blocksets: set()
small file blocksets that are also blocklists: set()
Running a destination checker goes well with automated testing, e.g. to forecast a recreate issue.
Rather than dig into details by uncommenting print statements, future goal is to make a database.
A script is also an easy prototyping testbed for methods that Duplicati can maybe adopt someday.
Now on with some internals info. This has not been well validated yet, so might change over time.
File in dlist
=============
A dlist file lists its files with a filelist.json file. File objects may have:
type "File", "Folder" are very typical examples
path complete, in format fitting the current OS
hash source file data hash
size source file data size
time source file timestamp, 1 second resolution
metahash source file metadata hash
metasize source file metadata size
blocklists one or more blocklist hashes of large file
A blocklist defines its blockset by the concatenation of SHA-256 block hashes.
It is a block itself, so is limited in size and can be referenced by its hash.
The information above should make the Destination column below easier to read.
Rough mapping
=============
This table shows how a few things are represented in database and destination.
Left to right is roughly backup. Right to left is roughly recreate or restore.
What Database Destination
---- -------- -----------
source files Fileset table destination folder dated dlist files
mapping FilesetEntry table dlist "/" filelist.json object array
source file File View dlist filelist.json file object
mapping File BlocksetID dlist object "hash" hash ref
file data Blockset table implied by a "hash" hash ref
mapping BlocksetEntry table dindex "list/" blocklist hash ref
block Block table dblock "/" file named by hash Base64
(File table metadata is somewhat like data, but table path has an extra hop)
mapping File MetadataID dlist object "metahash" hash ref
metadata Metadataset table implied by a "metahash" hash ref
mapping Metadataset BlocksetID none -- direct ref from file object
Deduplication
=============
Duplicati makes what one might call a deduplicated full backup on each backup,
however such data reduction occurs at many levels, and some are internal ones.
Backup looks for added, deleted, and modified files, compared to previous run.
Multiple references to any file, blockset, blocklist, or block item may occur.
References are typically by hash in destination files, and row ID in database,
sometimes needing a mapping table, e.g. FilesetEntry refers to files in Files.
most of which are typically the "same" file as in the previous backup version.
These tables generally have unique rows, however this isn't always guaranteed,
for example by table constraints as a double-check on table maintenance logic.