[Idea] Backup Duplicati database to avoid recreate

ts678 · March 8, 2021, 11:51pm

You overlook the last point, maybe because I connected it badly. You can backup your own database but there’s no point to backup a Duplicati database of the running job because it becomes instantly obsolete.

You can look at the log at profiling level, e.g. About → Show log → Live → Profiling and see changes going into the database. If Duplicati backs up database for a job as an ordinary file, it loses later changes.

A successful database backup strategy needs to backup the exact, settled database matching a backup. Possibly some other systems use a database differently. If you don’t think Duplicati database changes in backup, please look at it more closely. I’m not sure how well you can snag a copy, but note size changes.

EDIT:

How the backup process works explains that, and pretty much everything there is tracked in the database.
DB Browser for SQLite can be used to watch the live DB (read only mode is safest) and see the changes.
The local database is aimed at developers, but adds a small amount. Also see the Local database format.
You do not want to put a stale, incomplete, mismatched database in use. That can cause lots of trouble…
Therefore I stick by the line I put at the top of this post. Backup can’t backup its own DB via normal means.

jabuzzard · March 9, 2021, 12:24am

I guess what users want, or at least sensible users is a method to backup ones database to a location that is not the computer running the backup to aid in disaster recovery situations.

I was pointing out this is an ancient feature of enterprise backup systems that use a database to track what they backup. TSM is the granddaddy of such backup systems.

Now this would require changes to the Duplicati code to manage it for sure, but the idea that backup software that uses a database to track what it has backed up and to where cannot backup it’s own database is by proof of counter example wrong in fact.

ts678 · March 9, 2021, 12:34am

You’re citing a case that apparently has a database that’s little-updated or not critical, so having a stale copy is fine. This could probably be done for Duplicati’s Duplicati-server.sqlite database, which mainly holds configuration data. Your counter-example must not need a current database if a stale one works.

The articles I pointed to show the core of Duplicati. It’s not like a little tweak will change the core design.

jabuzzard · March 9, 2021, 2:32pm

The database in Spectrum Protect/TSM is absolutely critical to the operation of the backup. The location of the backup copy of every file from every host is stored in that database, and a stale database means loosing backup data. My day job has involved running TSM servers for over 16 years now with many hundreds of TB backed up. I am looking at Duplicati for my home backup and just commenting on what so far I see as a weakness of Duplicati based on my extensive TSM experience.

The fundamentals are if you are using a DB in this manner like TSM and Duplicati then protecting that database for disaster recovery is essential. It would be fair to say Duplicati is lacking in this regard, though it’s also fair to note you are not forking out thousands of $$$ for your backup solution so don’t expect all the same features. I am just pointing out that it is possible.

So how does TSM manage it’s DB backup. Well first thing to note is when the DB is being backed up you can’t backup anything else at the same time. In fact all active backups get paused during the DB backup. TSM being server/client can have many clients backing up to the same server at once all sharing the same database.

The second point to note is that a DB backup is handled separately from a file backup and the backup data is not intermingled with the file backup data. This is where the notion of changes to the Duplicati code come in. That is while the SQLite DB is just a file on the client a DB backup function would be handled separately from the standard file backup. I have zero experience with developing .NET so it would be a very steep learning curve before I could contribute actual code unfortunately.

In TSM land the DB backup history is nothing more than a simple text file that can be read by the TSM server so it can find it’s DB backups for a restore, from where it can then go on to do a restore of the files. Noting due to the client/server nature of TSM this is only necessary should you use the server. If you loose the client it’s much simpler.

For Duplicati I guess the most obvious way would be a method of synchronizing the local SQLite DB with a remote copy. Do backup, then optionally sync SQLite DB with a copy at the remote location.

drwtsn32 · March 9, 2021, 5:21pm

But with Duplicati it isn’t critical. It’s ok for the local database to be lost as it can be rebuilt from scratch by reading the files on the back end. Now I’ll caveat this by saying Duplicati has had some bugs that made this not work as well as it was supposed to, where it could take a very long time. But in my experience the issues have been resolved in the recent Canary builds. I do a test recreation of my databases every few months and it takes 15 minutes at most.

ts678 · March 9, 2021, 6:13pm

That sounds similar to what Windows Volume Shadow Copy Service (VSS), per The VSS Model, except it does this to itself. VSS has a freeze/thaw approach, and it asks VSS-aware applications to prepare for the backup, flushing I/O and saving state to allow an application-consistent (not just crash-consistent) backup.

Interrupts and checkpoint/restart is a mainframe practice that sounds similar, providing a stable snapshot.

Duplicati 2.0.5.1 should be able to do a safe “partial backup” using Stop button and “Stop after current file”, however it’s considered a backup stop. When the backup runs again, it can just run as usual, and backup whatever needs backing up that didn’t get backed up by the previous backup (or was changed since then).

It sounds like even TSM might lose data that’s backed up after its last database backup. I expect it’s more reliable than Duplicati though, so that happens less. Duplicati does have crash recovery mechanisms that attempt to repair damage, and also upload a “synthetic filelist” which gives a backup of the last completed backup plus whatever got backed up in a rudely interrupted backup. Synthetic filelist will work in next Beta.

This is the point I’ve been trying to make (maybe with loose wording). Backing up the Duplicati database can’t be done by simply pointing to it, then backing it up with other files. It needs to have a separate step, whether that’s a secondary job that runs after the primary, or something using Duplicati scripting options.

I use this crude script in run-script-before to keep a history of databases while I invite disasters to debug:

rem Get --dbpath oathname value without the .sqlite suffix
set DB=%DUPLICATI__dbpath:~0,-7%
rem and use this to maintain history of numbered older DBs
IF EXIST %DB%.4.sqlite MOVE /Y %DB%.4.sqlite %DB%.5.sqlite
IF EXIST %DB%.3.sqlite MOVE /Y %DB%.3.sqlite %DB%.4.sqlite
IF EXIST %DB%.2.sqlite MOVE /Y %DB%.2.sqlite %DB%.3.sqlite
IF EXIST %DB%.1.sqlite MOVE /Y %DB%.1.sqlite %DB%.2.sqlite
IF EXIST %DB%.sqlite COPY /Y %DB%.sqlite %DB%.1.sqlite
EXIT 0

I also run a log at profiling level, which gets big (2 GB on the previous Canary runtime) but shows SQL. Think of it as similar to the flight recorder on an aircraft, to allow some analysis of how things went wrong.

My DB backup is a local COPY because the DB reportedly changes enough that it’s almost fully uploaded when using Duplicati for versioning. The local copy also runs faster than my Internet uplink can manage…

My use case is not the typical case, but anybody who really wants a backup can certainly set up a backup.

and then there’s the restore side, which ideally would be made somewhat automated, or at least get good directions. It’s easier to use the Recreate button, but (as mentioned recently), it’s not always a nice result, however it’s better than it was before, and my personal wish is to make database or Recreate issues rare.

Unfortunately, chasing somewhat rare bugs based on end user reports is hard because typically users will not want to be running all the debug aids I run. I’ve advocated (and begun) fault insertion and stress testing.

Meanwhile, let’s say one has a series of DB backups. Which one is intact, and which one matches latest? There is somewhat more self-checking at start of backup. If Duplicati fails a self-check, restoring the final database after the previous backup won’t help, because that’s where the new backup failed during startup.

This means maybe the database before the previous backup is the intact one, but it needs to be validated. Throwing it in and trying a backup won’t work, because backup validation will find new files it doesn’t know. Running Repair will synchronize the DB to the backup by removing the new files. Unclear if this is fixable.

Compacting files at the backend can repackage still-in-use blocks after delete, creating a large mismatch. Stale database will say it can go to some dblock file to get a block to restore, yet find the dblock missing.

The problem does not exist if Recreate button is used. Backup is supposed to have all it requires inside it, which is important in disaster recovery situations where a Direct restore from backup files is done.
dbpath option can be used to point to the database. I’m not advocating for that right now, but description is:

Path to the file containing the local cache of the remote file database.

So proposal becomes that cache backups be done. Phrased that way, does it sound like a standard thing?

I’m not saying it has no value, just that it’s not a simple thing. Very limited development resources could be put to better use, especially since anybody who wants database backups can potentially do it on their own.