[Idea] Backup Duplicati database to avoid recreate

Yes, I have a second backup set configured whose job is to back up ONLY the database from the first backup set. I use the --run-script-after option on the first backup job and point to a batch file. The batch file checks to make sure the previous backup was successful, and if so it triggers the second backup job.

I use this approach on 2 of the 10 computers I back up, the ones with the largest data sets (over 500GB).

Cloudberry doesn’t do deduplication. The back end has files in their native directory structure. It could rebuild all available files and versions just by doing a directory listing on the back end storage. Duplicati with its deduplication requires a much more involved process.

1 Like

Isn’t the password for the backup stored in the database?
It must be stored somewhere for the scheduled runs and I just guessed it might be the local sqlite database as I’m not aware of any other places. But I’d like to stand corrected …

Yes and no. It’s in the small server database Duplicati-server.sqlite (along with other secret-but-needed option values such as remote login credentials). The sometimes-huge per-backup databases are safer. There is a salt and hash of the password there, useful for verifying password use, but it’s relatively safe. Per-backup databases are also the ones that Recreate works on, that sometimes are big and too slow. Names are <semi-random-digits-or-letters>.sqlite and for any backup can be seen in Database screen.

The key claim I dispute is how the password stays unencrypted at the remote if you encrypt the backup. Use a good password, especially if your backup backs up Duplicati-server.sqlite (so that you suffer less pain if you lose a drive, need to recreate jobs, and find that you don’t have job exports saved anywhere).

One other note on database backup is that backing up a database actively in use (e.g. for active job) is risking an inability to read the database (or associated journal file) that’s in use at just the wrong instant. –snapshot-policy (most feasible on Windows) can avoid access issues, but active database will still be obsolete pretty much instantly after backup. Database for a job should be done after backup is finished. Keeping remote files in sync with database that stores information about the remote files is important…

1 Like

I don’t suppose sqlite has a transaction log feature such that a DB snapshot could be done infrequently followed by transaction log backups.

Temporary Files Used By SQLite is where I assume such a transaction log would be, and I don’t see one. Databases are not really in my expertise area though, and what I learn is more for SQLite and Duplicati… Maybe someone with more expertise in backing up large databases can say where this idea would apply.

I did find a report of an SQL Server install that managed to grow a transaction log to where it ate Windows (whereas I think all Duplicati’s files are the well-known ones plus some temporaries that don’t last long…).

While arguably one might say that logs SQLite uses for transactions are transaction logs, they’re not likely candidates for what you’re thinking about, which I suppose I would view as a rather raw differential backup.

sqldiff would be a way to do do-it-yourself differential backups, but I don’t know if it would be better than the backup Duplicati would do after its attempt at deduplication (said by someone, maybe you, to not be really effective, at least with default settings, where a 4KB page change can make a 100KB block get uploaded).

Hi drwtsn32

I understand your technical point of view, but that does not make the slightest sense when you need information and the bank reconfiguration can take days.

Imagine a disaster situation, a business stalled because the database needs five or more days to be rebuilt.

I use duplicati in parallel with other tools believing that the project will ovoluir in the near future.

At this point I do not feel safe using it as the only backup tool.

Maybe you misunderstood… I was just commenting on a major technical difference between Cloudberry and Duplicati. I’m not trying to say the database rebuild process in Duplicati shouldn’t be improved. It definitely should be!

In the meantime there are some things that can be done to help mitigate this risk, but I get if you and others aren’t comfortable with those mitigations.

1 Like

@drwtsn32, could you post the batch file you used with --run-script-after to back up the database? I’d like to get that set up on my machine and my DOS scripting skills are limited…

Sure, here’s the batch file I run. I am using the excellent duplicati-client command line program to trigger the database backup. You’ll need to adjust the number for your setup. This batch file is configured to only run after a successful backup operation, as you can see from the first two tests:

if /i "%DUPLICATI__OPERATIONNAME%" neq "Backup" goto end
if /i "%DUPLICATI__PARSED_RESULT%" neq "Success" goto end

%LOCALAPPDATA%\Duplicati\duplicati_client.exe login
%LOCALAPPDATA%\Duplicati\duplicati_client.exe run 6
%LOCALAPPDATA%\Duplicati\duplicati_client.exe logout

:end
2 Likes

At first glance, I thought that several answers are somehow aside the point. Then I realized that my viewpoint was more specific than the original question. I already had created a separate job that backs up only the local Duplicati databases, but the original question seems to suggest the databases would be included in other jobs.

I came here to see if someone already has presented the idea of Duplicati providing a template of a separate backup job for its local databases. It really needs to be a template, not a ready-defined job, because there are many details each user may want to tailor to their own needs, such as which particular target destination and which schedule would the job have.

Once we limit the focus on such a separate backup job, many arguments here become moot. For one, the argument about the size of the local databases being too big is moot, because you only ever need one version of each in the destination. Also, since Duplicati does not run the jobs in parallel, the only local database that is being active during this separate job’s run is its own database, which need not be included the set anyway. When a disaster happens, and the local databases need to be restored from this set, you do not have access to any database anyway, so you have to let Duplicati rebuild the database for this separate job first. And then, if Duplicati provides the template, you can set it up easily again.

So how about it, could you consider providing this kind of a template?

People would create a new post with EXACTLY the same content, when I started typing I saw this post and thought it was better to revive it than to create a new one.

I have been working with Duplicati for a few years now and I like it a lot, so much so that it is among the best backup tools I have ever used, mainly due to its simplicity in maintaining it and the always incremental backup option (you don’t need to do a new full every 1 week for example ), which helps a lot for cloud backups.

However I had some problems in the last times with the computer where Duplicati was installed to be lost due to hardware failure, as well as ransomware and the re-creation of the databases take a long time, in one case it took more than 1 day, since there was a lot of data in the cloud.

So I wanted to suggest to the developers who knows an additional parameter, just as we have several such as the bank’s auto vacum and things like that, to send a backup of the bank to the destination.

Even if this functionality is not used by everyone, in some cases it may be better to send 500MB to more than one database than to spend several hours recreating it.

I saw here some options how to backup this database before the main file, but I was not able to understand as well how the restoration process works, so I understood I have to import the backup file, start the process of recreating the bank, stop , download the old sqlite and replace it manually, that left me a little lost.

What’s the 500 MB in the example? If it’s the entire database, you wouldn’t send it to a database, but maybe you’d send it to a remote destination for safeguard. Assuming that was the idea, the question

would be to restore the database from its backup which would be a secondary done after the main job, because a backup can’t backup its own database because that database is still changing at the time…

If you’ve lost the whole original Duplicati, you’d probably bootstrap getting back by either setting up the secondary job again, or just doing Direct restore from backup files assuming you have necessary info.

Probably the usual way to reinstall a database is to use Database management screen to let you know where database belongs, and either put restored database there, or tell Duplicati the new database file.

This is space efficient but can slow down the restore of a file that undergoes constant change, because different generations of changes may wind up in different backup files, all of which might need download.

A database is a great example of a potentially big file that undergoes constant change, scattered around, making deduplication less effective. People who have tried backup have found they upload just about the database size every backup. This plus more reliable 2.0.5.1 Recreate, makes DB backup less attractive.

If you really want to try, I’d suggest using a low Backup retention to keep the collecting-the-parts issue somewhat under control, but you’ll still endure frequent automatic compact because of high churn rate…

Given a suitably large blocksize (100 KiB is too small for big backup to be fast), ideally only the dlist and dindex files would download. If you get dblock files, that’s a bad sign. Before 2.0.5.1 it was too common.

If the progress bar gets past 70%, it’s downloading dblock files. After 90%, it’s downloading all the rest…
About → Show log → Live → Verbose will also show you what you’re downloading, and how far you are.

Doing an occasional test of DB Recreate is a good idea to make sure it’s healthy when you really need it. You can copy off the old database for safety, and your new database will also be a lot smaller than it was.

Why not backung up latest Database? talks about an exotic DB backup method that eliminates standard Duplicati processing (which, as noted, doesn’t add much for DB backup, and might even make it worse).

I’m not sure how solid the control file code is, but you could pioneer its use if you want to see how it does. You could keep the usual number of versions of primary backup, and maybe just two of database backup because super-stale databases are nearly useless. As dlist files obsolete, they’re deleted not compacted.

Hum, I don’t think so. Let me look

Protect: TSMA>BACKUP DB DEVCLASS=db TYPE=full SCRATCH=yes COMPRESS=yes WAIT=no
ANR2280I Full database backup started as process 344.
ANS8003I Process number 344 started.

I can backup the database on my Spectrum Protect (nee TSM) server while the server is running. One has been able to do that since forever :slight_smile:

Of course TSM is an expensive piece of high end backup software that has been around for decades. However the point is that it is possible for backup software that uses a database as TSM demonstrates, to back it’s own database up. Before anyone says well TSM is using IBM’s DB2 for it’s database you could back the database up back in pre 6.0 versions when TSM was using IBM’s equivalent of Microsoft JET database engine for the database.

You overlook the last point, maybe because I connected it badly. You can backup your own database but there’s no point to backup a Duplicati database of the running job because it becomes instantly obsolete.

You can look at the log at profiling level, e.g. About → Show log → Live → Profiling and see changes going into the database. If Duplicati backs up database for a job as an ordinary file, it loses later changes.

A successful database backup strategy needs to backup the exact, settled database matching a backup. Possibly some other systems use a database differently. If you don’t think Duplicati database changes in backup, please look at it more closely. I’m not sure how well you can snag a copy, but note size changes.

EDIT:

How the backup process works explains that, and pretty much everything there is tracked in the database.
DB Browser for SQLite can be used to watch the live DB (read only mode is safest) and see the changes.
The local database is aimed at developers, but adds a small amount. Also see the Local database format.
You do not want to put a stale, incomplete, mismatched database in use. That can cause lots of trouble…
Therefore I stick by the line I put at the top of this post. Backup can’t backup its own DB via normal means.

I guess what users want, or at least sensible users is a method to backup ones database to a location that is not the computer running the backup to aid in disaster recovery situations.

I was pointing out this is an ancient feature of enterprise backup systems that use a database to track what they backup. TSM is the granddaddy of such backup systems.

Now this would require changes to the Duplicati code to manage it for sure, but the idea that backup software that uses a database to track what it has backed up and to where cannot backup it’s own database is by proof of counter example wrong in fact.

You’re citing a case that apparently has a database that’s little-updated or not critical, so having a stale copy is fine. This could probably be done for Duplicati’s Duplicati-server.sqlite database, which mainly holds configuration data. Your counter-example must not need a current database if a stale one works.

The articles I pointed to show the core of Duplicati. It’s not like a little tweak will change the core design.

The database in Spectrum Protect/TSM is absolutely critical to the operation of the backup. The location of the backup copy of every file from every host is stored in that database, and a stale database means loosing backup data. My day job has involved running TSM servers for over 16 years now with many hundreds of TB backed up. I am looking at Duplicati for my home backup and just commenting on what so far I see as a weakness of Duplicati based on my extensive TSM experience.

The fundamentals are if you are using a DB in this manner like TSM and Duplicati then protecting that database for disaster recovery is essential. It would be fair to say Duplicati is lacking in this regard, though it’s also fair to note you are not forking out thousands of $$$ for your backup solution so don’t expect all the same features. I am just pointing out that it is possible.

So how does TSM manage it’s DB backup. Well first thing to note is when the DB is being backed up you can’t backup anything else at the same time. In fact all active backups get paused during the DB backup. TSM being server/client can have many clients backing up to the same server at once all sharing the same database.

The second point to note is that a DB backup is handled separately from a file backup and the backup data is not intermingled with the file backup data. This is where the notion of changes to the Duplicati code come in. That is while the SQLite DB is just a file on the client a DB backup function would be handled separately from the standard file backup. I have zero experience with developing .NET so it would be a very steep learning curve before I could contribute actual code unfortunately.

In TSM land the DB backup history is nothing more than a simple text file that can be read by the TSM server so it can find it’s DB backups for a restore, from where it can then go on to do a restore of the files. Noting due to the client/server nature of TSM this is only necessary should you use the server. If you loose the client it’s much simpler.

For Duplicati I guess the most obvious way would be a method of synchronizing the local SQLite DB with a remote copy. Do backup, then optionally sync SQLite DB with a copy at the remote location.

But with Duplicati it isn’t critical. It’s ok for the local database to be lost as it can be rebuilt from scratch by reading the files on the back end. Now I’ll caveat this by saying Duplicati has had some bugs that made this not work as well as it was supposed to, where it could take a very long time. But in my experience the issues have been resolved in the recent Canary builds. I do a test recreation of my databases every few months and it takes 15 minutes at most.

That sounds similar to what Windows Volume Shadow Copy Service (VSS), per The VSS Model, except it does this to itself. VSS has a freeze/thaw approach, and it asks VSS-aware applications to prepare for the backup, flushing I/O and saving state to allow an application-consistent (not just crash-consistent) backup.

Interrupts and checkpoint/restart is a mainframe practice that sounds similar, providing a stable snapshot.

Duplicati 2.0.5.1 should be able to do a safe “partial backup” using Stop button and “Stop after current file”, however it’s considered a backup stop. When the backup runs again, it can just run as usual, and backup whatever needs backing up that didn’t get backed up by the previous backup (or was changed since then).

It sounds like even TSM might lose data that’s backed up after its last database backup. I expect it’s more reliable than Duplicati though, so that happens less. Duplicati does have crash recovery mechanisms that attempt to repair damage, and also upload a “synthetic filelist” which gives a backup of the last completed backup plus whatever got backed up in a rudely interrupted backup. Synthetic filelist will work in next Beta.

This is the point I’ve been trying to make (maybe with loose wording). Backing up the Duplicati database can’t be done by simply pointing to it, then backing it up with other files. It needs to have a separate step, whether that’s a secondary job that runs after the primary, or something using Duplicati scripting options.

I use this crude script in run-script-before to keep a history of databases while I invite disasters to debug:

rem Get --dbpath oathname value without the .sqlite suffix
set DB=%DUPLICATI__dbpath:~0,-7%
rem and use this to maintain history of numbered older DBs
IF EXIST %DB%.4.sqlite MOVE /Y %DB%.4.sqlite %DB%.5.sqlite
IF EXIST %DB%.3.sqlite MOVE /Y %DB%.3.sqlite %DB%.4.sqlite
IF EXIST %DB%.2.sqlite MOVE /Y %DB%.2.sqlite %DB%.3.sqlite
IF EXIST %DB%.1.sqlite MOVE /Y %DB%.1.sqlite %DB%.2.sqlite
IF EXIST %DB%.sqlite COPY /Y %DB%.sqlite %DB%.1.sqlite
EXIT 0

I also run a log at profiling level, which gets big (2 GB on the previous Canary runtime) but shows SQL. Think of it as similar to the flight recorder on an aircraft, to allow some analysis of how things went wrong.

My DB backup is a local COPY because the DB reportedly changes enough that it’s almost fully uploaded when using Duplicati for versioning. The local copy also runs faster than my Internet uplink can manage…

My use case is not the typical case, but anybody who really wants a backup can certainly set up a backup.

and then there’s the restore side, which ideally would be made somewhat automated, or at least get good directions. It’s easier to use the Recreate button, but (as mentioned recently), it’s not always a nice result, however it’s better than it was before, and my personal wish is to make database or Recreate issues rare.

Unfortunately, chasing somewhat rare bugs based on end user reports is hard because typically users will not want to be running all the debug aids I run. I’ve advocated (and begun) fault insertion and stress testing.

Meanwhile, let’s say one has a series of DB backups. Which one is intact, and which one matches latest? There is somewhat more self-checking at start of backup. If Duplicati fails a self-check, restoring the final database after the previous backup won’t help, because that’s where the new backup failed during startup.

This means maybe the database before the previous backup is the intact one, but it needs to be validated. Throwing it in and trying a backup won’t work, because backup validation will find new files it doesn’t know. Running Repair will synchronize the DB to the backup by removing the new files. Unclear if this is fixable.

Compacting files at the backend can repackage still-in-use blocks after delete, creating a large mismatch. Stale database will say it can go to some dblock file to get a block to restore, yet find the dblock missing.

The problem does not exist if Recreate button is used. Backup is supposed to have all it requires inside it, which is important in disaster recovery situations where a Direct restore from backup files is done.
dbpath option can be used to point to the database. I’m not advocating for that right now, but description is:

Path to the file containing the local cache of the remote file database.

So proposal becomes that cache backups be done. Phrased that way, does it sound like a standard thing?

I’m not saying it has no value, just that it’s not a simple thing. Very limited development resources could be put to better use, especially since anybody who wants database backups can potentially do it on their own.

2 Likes