Why Db is needed and why its not stored by backup files

Hello,
i want to start again backing up my existing backup, but dont have anymore my local db files. So like i understood i need to recreate the db from the backup files. This job is still in progress and took until now some hours. (There are 60.000 files by 50mb ). I run it from a pc and backup my server data to an external hdd.

Because it takes so long i have some questions:

  • Why is a local db needed?
  • Why is it a local and not saved inside the backup? (So if i run the backup from different host, i need different dbs, which doesnt make sense from my perspective, because its about the files on the backup, right?
  • Is it save to put the db next to the backup and use this on different hosts (i can backup my server from different hosts).
  • Why it takes so much time? There are not much files inside the backup, but large. So i dont understand what makes it so slow.
  • Would it be faster when having bigger blocks (and therefore less block-files)?
  • Also the db gets huge, 1.2gig for now and still growing. So i think its not only about single hash per file, right?

Need a comment from developer on original reasons for it. I see both pros and cons to it.

Please clarify “its” and “files on the backup”. Database tracks source and destination files.

Migrating Duplicati to a new machine explains how you can move the db if you have it.
Having an active backup from two computers to one destination should never be done because each computer will get surprised by extra and missing files on the destination.

Since you use an external hdd, if you mean can the database be there too, it’s allowed (however it will load the external hdd more than if database was kept on another drive).

There’s even a --portable-mode option where databases are in a data folder below Duplicati installation folder. You can also use --dbpath, but note drive letters can vary.

Removable drives (mostly Windows) shows a way to deal with that on destination side.
I haven’t tried using a drive-letterless or relative path for --dbpath, but maybe it works.

How Backup Works in the documentation points to further details, but key right there is:

For each file

    Read a block at a time

    Check if the block already exists

        Ignore if it exists

        Add to volume if new

    Process next block

and default is 100 KB or 1 MB blocks depending on what Duplicati backup first began on.
Increasing the blocksize is advised for backups as big as this, to reduce block tracking.
Block size documentation is not so specific, but adds to idea that you’ve got lots of blocks.

Things can also go wrong while finding blocks. See About → Show log → Live → Verbose for where things are. If it’s reading dindex files, that’s a more compact store of block data, however if it’s reading dblock files (generally in the last 30% of progress bar), dindex lack some blocks that are needed, so the big dblock files get read directly, in hope of a success.

You’re mixing together two different things, the blocksize and the dblock-size, a.k.a. Remote volume size on the Options screen. Those are dblock files, and contain blocks.

Bigger blocks means faster processing and a smaller database. Bigger remote volume means fewer but bigger files. It might help a little bit, but has drawbacks at restore time, because many dblock may be needed to have the blocks for a given file to be restored.

Remote volume size discusses this, and points to information on how the restore works.

Information is stored per-block, which is why big backups and small blocks make big DB. There is a lot of other data there too. If you had a lot of small files, the paths need space.

  • Why is it a local and not saved inside the backup? (So if i run the backup from different host, i need different dbs, which doesnt make sense from my perspective, because its about the files on the backup, right?

If the database is inside the backup and the backup is corrupt…:man_shrugging: Better to just let it rebuild from the data itself. This will take a while due to ISP speed if to an s3 endpoint, etc. the size of the data, etc. Also if the database is corrupt, you’re not getting your data back from it alone…better to let it rebuild a new db from the actual backed up files.

1 Like

You’re never getting your data back from the database alone. It’s on Destination.

The local database

But to increase the performance and reduce the number of remote calls required during regular operations, Duplicati relies on a database with some well-structured data.

The database is essentially a compact view of what data is stored at the remote destination, and as such it can always be created from the remote data.

IMO “always” is a stretch because remote data can have damage, hurting recreate.

By “view of”, this means it doesn’t have actual backup data, but knows what’s there.

This lets it spot changed files and upload changes, without constant remote queries.

is kind of vague. You don’t want constant access, as mentioned. There are debates on whether or not a save after backup (always) is good. It can save time doing a Recreate should a disaster wipe out local system, and time for restore is critical, e.g. to business.

thank you very much for your answers! This helps very much to get a understanding.
I will check the migration to a new machine.

so, should i than change boath, blocksize → 1mb and dblock-size ~500mb?

this i dont understand. if Target is the same (for example nas) the files and relative filestructure should be the same. Shure, bs stores there own files like .DS_Store or Thumbs.db, but they we will stored anyway in my backup (without setting some filter), undependend where i run duplicati.

If so, it woudnt also help me, because i still need the backup files :wink:

yes sounds good, when it doesnt take hours for only creating db from a backup file. And yes if the db needs so much reads and writes it will really slow down the backup process if its runs sam time on same storage.

First, documentation says:

Importantly, since the block size is fixed, it is not possible to change it after running the initial backup. This is because there is no way to compare blocks of different sizes, so it would essentially be a new backup if the size was changed.

but if you want to start fresh, something like 1 MB or even 5 MB might be reasonable, and will make a smaller database. If the backup began on 2.1, it’s already at 1 MB. Earlier was 100 KB, which is too small for this size backup (which is why the default was raised some).

You can probably do 500 MB Remote volume size, but it matters less, and read the advice:

Remote volume size explains how raising that can slow small restores if you ever do those.

raises an obscure point, which is that if your external HDD is ExFAT or FAT32 rather than NTFS, opening files slows with file count, so fewer larger files might process a little faster.

You can change Remote volume size any time, but slow is less likely to cause the whole backup to be reorganized by compact.

The source files are similar, but change over time. Backup run later will see different view. More importantly, the destination files from two independent backups of the same source appear completely different in terms of file names. Look at your drive for an example of it.

The dlist files have dates to the second, so likely won’t match, and the dblock and dindex files are intentionally named to be globally unique, so those won’t match either. Tracking destination files is done by the database associated with a destination. It must match well.

is what I wrote. You’re talking only about source being similar. Similar isn’t identical, and destination files also need to be tracked to know what’s in them without constant reading.

Some other backup programs use cache to avoid remote reads. Duplicati uses database.

but they will be stored in completely different destination files. Test two backup jobs of the same single small file to different folders somewhere, and see if the files appear the same. Looking inside, one will be able to see resemblances, but the external file names will differ.

Duplicati uses the database to check for missing and extra files, and if the second backup writes its files to where first backup has its files, first backup will complain about the extras.

  --prefix (String): Remote filename prefix
    A string used to prefix the filenames of the remote volumes, can be used
    to store multiple backups in the same remote folder. The prefix cannot
    contain a hyphen (-), but can contain all other characters allowed by the
    remote storage.
    * default value: duplicati

will let you use a single folder if you absolutely must, and that lets each job know its files.

How Backup Works
How Restore Works

You mean if i do changes on my static files, or why they should changed?

And thats my question why not prevent this?

  1. Pathes could be relative to root-backup foder
  2. Os differences like different Slashes (win"\", unix “/”) could be maped to a uniform form.

So at the end the os and the mounting point of the backup target and sources doesnt matter.
But i get it (for now) that is not implemented and i need to update my db.

I want now to start new. ChatGpt told me to take 10mb for block-size and 1gb for dblock-size, so yeah i will take something between this values, but for shure higher as default :wink:

Also the question which file-system i should use. Now it is ExFat. But like i understood is not very save to use for backup because it doesnt have a journal implementation.
So i think i ve to decide between ntfs, ext4 and btrfs. What would you guys favor?

Who said static files?

Your server data never changes? If it changes, a later backup will see different data.

Paths of what? Do you mean source or destination paths?
Only destination is in Duplicati control. Source is what it is.

It’s now sounding like you think this is just a tree copy from source to destination. It’s not.

Please read cited documentation on how it’s done. If you prefer trees, rclone can do them. Other sync programs can also do one-to-one (with different roots, and slash conversions).

Comparison of file synchronization software

Duplicati is a backup program intended to allow multiple versions, compactly and securely.
Some file sync programs do versioning (allowing you to roll back from damage or viruses).
Probably none do block-based deduplication, so more versions will chew up space quickly.

I think they often come that way because it’s supported by Windows, Linux, and macOS.

I don’t know what you have, but I’d guess Linux, since you mention btrfs. Supposedly, typical Linux systems have NTFS out-of-the-box, and I’d expect there are a lot of others.

I’m mostly on Windows NTFS, am debating whether to reformat a new exFAT USB drive, and can’t give you any concensus opinion on your choice. You can ask in search, AI, etc.

The remote storage contains compressed (and encrypted) zip files. To figure out what is inside these files, Duplicati would need to download and decrypt+decompress them, which would be time consuming (and expensive if you pay for bandwidth).

The local database is actually just a cache of what is stored remotely, so it saves Duplicati from needing to read the individual files. Instead of having the cache in flat files, which is common in other applications, Duplicati is using an SQLite database for several reasons:

  1. Atomicity: Like most databases, SQLite is resilient to crashes, so it is very unlikely that the database is damaged, even if you terminate the process at random times

  2. Fast query: Since the database is loaded inside the same process, Duplicati can use the B+ tree in SQLite to look up block hashes quickly and figure out if data is already known.

  3. Structured data: With a data structure it is possible to save most of the information, so some operations, like showing difference between to backups.

You can think of the database as merely a cache file and that is the main reason it is not backed up. It is supposed to be possible to recreate it from the remote data.

It is not stored inside the backup because it would create a significant storage overhead. Where only the latest database version is usable.

There is also a circular dependency, in that the database tracks the contents, but if the database is also part of the content …

Generally yes. The database contains the filenames which may be sensitive.
Other than that, you can move around the database as you like, it just needs to be “paired” with the remote storage.

We are actively working on speeding up the recreate process, but it should not take more than a few minutes + transfer time.

To recreate the database, Duplicati needs to download the .dlist files and the .dindex files. The .dindex files are explicitly there only to speed up the database recreate process.

In some cases, something has gone wrong over time, and the .dindex files do not contain all needed information. In this case, Duplicati will start downloading each of the .dblock files and hunt for missing information. While this works, it is a very slow process.

Can you see if you are hitting the issue where .dblock files are being downloaded?

i really dont know. but might be as slow as it was. Maybe also because there where so many files.