Database and destination internals

ts678 · May 12, 2023, 2:12am

This topic tries to tie a number of things together with some detail while still being a concise read.
This is sort of a static view dealing in data items and transforms, without saying how code does it.
Channel Pipeline talks about the flow in the backup direction. Restore/recreate is less concurrent.
Database rebuild has some conceptual and table background, e.g. see brief linked summary.

This document was inspired by a wish for test tools to test backups faster or better than Duplicati.
Direct inspection of destination files allows that, provided one can find out where and how to look.
Current script writes an account of what it sees, and checks for duplicated or missing information:

299184 blocks seen in dindex files
1576 blocklists seen in dindex files
1576 blocks used by dlist files blocklists
126466 blocks used by dlist large files data
4564 blocks used by dlist small files data
143309 blocks used by dlist files metadata
23270 blocks unused
1575 large blocksets in dlist files
4564 small blocksets in dlist files
small file blocksets that are also metadata blocksets: set()
small file blocksets that are also blocklists: set()

Running a destination checker goes well with automated testing, e.g. to forecast a recreate issue.
Rather than dig into details by uncommenting print statements, future goal is to make a database.
A script is also an easy prototyping testbed for methods that Duplicati can maybe adopt someday.

Now on with some internals info. This has not been well validated yet, so might change over time.

File in dlist
=============

A dlist file lists its files with a filelist.json file. File objects may have: 

                type            "File", "Folder" are very typical examples
                path            complete, in format fitting the current OS
                hash            source file data hash
                size            source file data size
                time            source file timestamp, 1 second resolution
                metahash        source file metadata hash
                metasize        source file metadata size
                blocklists      one or more blocklist hashes of large file

A blocklist defines its blockset by the concatenation of SHA-256 block hashes.
It is a block itself, so is limited in size and can be referenced by its hash.

The information above should make the Destination column below easier to read.

Rough mapping
=============

This table shows how a few things are represented in database and destination.
Left to right is roughly backup. Right to left is roughly recreate or restore.

What            Database                Destination
----            --------                -----------
source files    Fileset table           destination folder dated dlist files
mapping         FilesetEntry table      dlist "/" filelist.json object array
source file     File View               dlist filelist.json file object
mapping         File BlocksetID         dlist object "hash" hash ref
file data       Blockset table          implied by a "hash" hash ref
mapping         BlocksetEntry table     dindex "list/" blocklist hash ref
block           Block table             dblock "/" file named by hash Base64

(File table metadata is somewhat like data, but table path has an extra hop)
mapping         File MetadataID         dlist object "metahash" hash ref
metadata        Metadataset table       implied by a "metahash" hash ref
mapping         Metadataset BlocksetID  none -- direct ref from file object

Deduplication
=============

Duplicati makes what one might call a deduplicated full backup on each backup,
however such data reduction occurs at many levels, and some are internal ones.

Backup looks for added, deleted, and modified files, compared to previous run.
Multiple references to any file, blockset, blocklist, or block item may occur.

References are typically by hash in destination files, and row ID in database,
sometimes needing a mapping table, e.g. FilesetEntry refers to files in Files.
most of which are typically the "same" file as in the previous backup version.

These tables generally have unique rows, however this isn't always guaranteed,
for example by table constraints as a double-check on table maintenance logic.

mr-russ · May 12, 2023, 12:31pm

Can you expand on this comment a little? I have many desires for tests, can you outline some of the tests and/or test case you were thinking or have been told about.

Thanks

ts678 · May 12, 2023, 1:55pm

TL;DR There’s performance, but also reliability. For that, stress it some while logging and monitoring, thereby providing a far better debug environment than when a user hits things. Ideally, fix issues fast.

Maybe more than a little, and it could be its own topic. I can split this up later if that seems beneficial.

This is quite a wide open area that could very much use test staff. Volunteers are very much needed. Sometimes equipment is also needed. Other times people might just run whatever they have around. Sometimes they’re using Duplicati anyway as their backup, and testing that better might be desirable.

Performance testing is its own whole area which could help find slow spots and measure fix attempts, following the goal of doing guided optimizations rather than unguided (a worse return on work + risks).

My main wish is to get a better handle on the reliability issues that keep us from being able to release something to the not-yet-used stable channel. Duplicati has made it past its worse time when backup breaks without any obvious provocation. One of those turned out to be from a specific pattern in data.

Next level of bugs after backup itself is compact that runs occasionally after the backup to do cleanup. Because it runs less often, it takes longer to stumble across bugs. Stumbling is a bad test plan when a user does it, because there is rarely any debug level info there to accurately diagnose/fix the problem.

Better information collection is possible by running test cycles until something looks wrong. Check that and preferably find a recovery or avoidance because constant test stops on known bugs get annoying. This is also where a rapid actual fix (instead of bugs lingering for years) may also be used to continue.

If things seem to hold together when unprovoked (and they appear to), then provoke with usual things such as network failures (simulated, or set no retries and let real world do its thing), or system reboots probably simulated by using process kills (because actual reboots are slower and too work-disruptive).

I don’t know the unit tests, so I’m talking system command line tests, because a GUI is harder to drive. duplicat-client can bridge that a bit, and is a Python script so is pretty easy to tweak if the need occurs.

StopNow(), Backup and cancellation tokens has random kill test, and what fixable (or not) bugs it finds.
There’s an unshown script before that which can make random changes to a file to keep backup going.

In my backup on this PC, I’m currently running a profiling log with --profile-all-database-queries which makes the very frequent block operations visible, and I’m keeping a 30 deep history of database. That’s history if it breaks. To prove recreate works, I do it. I used to run test all --full-result, but that accumulates errors over time (but so far they seem harmless Extra entries though), but I stopped.

I’m running the Python checker script I mentioned at the top, and looking directly into destination files…

There are other ways to do this. As a starter effort using Python sqlite3, I did a simple database read. Fancier checks could be done. An idea of what Duplicati already checks would be a helpful guide too…

EDIT:

You might notice that my type of tests are not what one might call functional tests, e.g. can it one time do what the functional spec (there is none) says it should do? Those are easier to just describe in issues, and there are plenty filed awaiting fix. Functional tests are probably good at catching regressions though.

The unexplainable (e.g. based on what a user has seen) reliability breaks are the harder bugs to isolate, which is why my goal is to find and fix those while avoiding regressions by not fixing what’s not broken.

Thanks for asking about tests.