Improving Reliability with Large Backup Sets

sgeklor · June 16, 2021, 4:38am

I’ve been using Duplicati for 2-3 years now. My working data set on my laptop is large (1.5 TB) and I back it up daily to four destinations: 2 x local folders (on external drives) and 2 x self-managed S3 servers running Minio in different off-site locations.

Over the years Duplicati has always had trouble corrupting the local database. I always had the impression, but couldn’t confirm it, that it would happen when the backup task was interrupted (unexpected shutdown, system crash, or forced shutdown).

However, on the latest version (2.0.6.1_beta_2021-05-03) I have found that it will corrupt the database every single time it tries to create the backup whenever the backup is unavailable. For an external drive that means unmounted, and for the S3 servers the computer is offline. The database corruption quickly became a big problem because the repair would fail and the database would have to be recreated. On a number of occasions the recreation would fail and I would have no option but to nuke the backup set and the database and just start over. None of this is really news and the bug reports are full of similar examples.

The reason for my post is to provide some practical advice for how to overcome this issue. What I have discovered recently is that Duplicati already has a feature that can mitigate this issue, and it’s a very elegant and neat solution. What I have done is use the “run-script-before-required” option and point it at a script that checks for the presence of the destination. I have done this on Linux Mint, so the examples below may not work for everyone, but the concept is very sound and I highly recommend it to all Duplicati users that may have experienced similar issues. You can easily write similar scripts for Windows or other clients.

I have been running for over a month like this without any corrupted databases, which previously would happen 2-3 times per week. So I’m very happy now and hope this will bring some joy to other users. No doubt this basic functionality needs to be added to the core routine, and I’ve no doubt the development team are doing the best they can and will get to it when it is possible to do so.

Here are the scripts I’ve been using:

Script to check for presence of external drive (note there is a small detail, I use a sub-folder on the external drive to hold the back up files. This means the path check only returns true when the drive is actually mounted. If you use the path check on the mount point then it will not work as expected since the mount point typically exists prior to the drive being mounted):

#!/bin/bash

# Does the target path exist?
[ ! -d "/path/to/target/folder" ] && exit 1

# All good
exit 0

Script to check for presence of remote server:

#!/bin/bash

# Can we ping the device? Try twice :)
ping -c 2 targethost.com > /dev/null && exit 0

# No dice!
exit 1

sgeklor · June 16, 2021, 5:36am

I also wanted to add some details outlining what to expect from others if they end up using such large backup sets:

blocksize: Default (100KB?)
Remote Volume Size: 1 GB for local destinations, 0.5 GB for remote destinations.
Source Files: 785616 (1.71 TB) - mostly “work” such as source code, graphical assets, some video assets, photos, PDFs, executables, etc… a real mix.
Database Size: 5.4 GB for remote and 4.6 GB for local (size difference due to volume size?) - So yes, that’s 20 GB of local storage allocated to the databases, more including any backups that get made.

Some of the other options that I have set are:

auto-cleanup: yes, auto-vacuum: yes, concurrency-compressors: 8 (I have a AMD Ryzen 5 3500U), concurrency-max-threads: 8, number-of-retries: 10, retry-delay: 30, use-block-cache: yes.

The reason for the large number of retries is that I’m located in Australia and the NBN here is pretty garbage with frequent dropouts. If the back up is running I would rather it kept retrying and eventually gets through as the network comes back online than to abort and give up entirely.

Anyway, the run time varies a lot based on how many files change in a day, i.e. how much work I do :). What I find is that for my typical workloads the daily run time is 40 minutes for remote and 20 minutes for the local destinations. To perform a full backup (either at the start, or when the database was corrupted and could not be rebuilt) it would take almost 24 hours.

I have performed test restores of key sub-folders from my backup sets with complete success, but I have not attempted to restore the entire thing. I would expect a complete restore to take at least 24 hours, or quite possibly a few days.

Overall I’m still very happy with Duplicati, and much more so now that I have added these “run-script-before-required” scripts.

drwtsn32 · June 17, 2021, 12:59am

You can’t change this setting once you’ve started backups, but if you ever start over for some reason, I would choose a larger size. 100KiB is too small for such a large backup set IMO. I would probably set it between 1MiB and 5MiB. This will keep your local job database smaller as there are fewer blocks to track. Database operations will be faster as a result. Only downside is reduced deduplication efficiency.

Also thanks for sharing your ideas on pre-scripts. It’s a shame that 2.0.6.1 has been less reliable for you when the back end is unavailable. That definitely needs some investigation!

sgeklor · June 17, 2021, 6:29am

Hi drwtsn32, this is good advice, thank you, I have been aware of the blocksize option since I first started using the software. The reason I no longer change it is that previously (back in 2019) I had some trouble with using a non-default blocksize. The issue was that when I was setting up a failed system from scratch, without an exported settings file, and I wanted to preserve the existing backup it failed because I did not remember the custom blocksize! From memory it both failed to resume as a target and also failed to restore. There is a chance I am misremembering or falsely atributing these issues to the blocksize option; regardless that was with an older version of the software.

I note that the latest version has much better restore support, especially for circumstances where the original system is entirely lost.

Anyway, my issues with the blocksize were, of course, user error, but since then I have always left it alone and stayed with the default. So thanks for the recommendation, I’ll be sure to change it next time I have to set up again or start over.

Does anyone know of any other issues (other than being unable to change it once the backup is first started) surrounding use of a custom block size?

sgeklor · June 20, 2021, 12:57am

As a final post, I also want to mention that using the approach of having scripts control whether the backup runs or not results in errors being thrown every time the backup is blocked from running. I can appreciate the logic to this, the script is blocking the backup from running by throwing an error, and the backup task reports this as an error.

However, an unwelcome side effect in my case is that the error reporting becomes useless. As I move around with my laptop, on some days there are no targets available, and on others only the network targets are present. On average, depending on my travel and workload, I’ll only backup to the fixed disks about once a week, which means I only get a “perfect” error-free backup run sporadically.

As a result this makes it difficult to detect when true errors occur with the backup. Don’t worry, my previous comments about the database errors being resolved by using the scripts still stands.

There is a thread here about giving the option to suppress warnings:

In my case, i would also like to suppress errors, since they are useless and distracting. The more complex implementation would be to allow the script to return at a level equivalent to either warning or error, and then with a suppression on warnings I would get the behaviour I desire. However, on balance, I would think that it’s not worth it so please don’t consider this to be a feature request (wrong place for it anyway, I know) more of a general musing on the topic.

ts678 · June 20, 2021, 9:37pm

One quirk is that your scenario of losing a drive (with settings and databases) without config Export will successfully recreate a database, but not show in your options that you used a non-standard blocksize.

Everything seems to work fine though, and if you really want to find the blocksize, trying to change it will cause the next backup attempt to complain with the old (unseen sort of like it was default) and new size.

There are lots of people with custom blocksize (because default is small for large backups), and Issues don’t seem to exist, or at least none obvious enough to have a blocksize mention in the title of the issue.

This depends on the exit code. If you’re actually doing exit 1, this should be considered normal. Log:

2021-06-20 17:00:02 -04 - [Information-Duplicati.Library.Main.Controller-AbortOperation]: Aborting operation by request, requested result: Normal

You can watch About → Show log → Live → Information to see how yours runs. Available codes are:

github.com

duplicati/duplicati/blob/5af46b6eecc9fe04c140ec5304dabdd50bdaa3af/Duplicati/Library/Modules/Builtin/run-script-example.sh#L21-L27


      
          # - 0: OK, run operation
          # - 1: OK, don't run operation
          # - 2: Warning, run operation
          # - 3: Warning, don't run operation
          # - 4: Error, run operation
          # - 5: Error don't run operation
          # - other: Error don't run operation

3 and 5 are kind of noisy (yellow and red GUI popups respectively), but not so noisy as to send email, which I think reports on the result of a backup run, and backup didn’t run. What’s your reporting way?

Backing up all the way to the original problem that is being avoided here, would you be able to work on characterizing it, either here or in a separate topic? This sort of thing shouldn’t happen, but it’s hard to resolve a problem that’s very vague or not reproducible easily. For example, I don’t have S3, and local
folder backup seems to fail fine (because it lacks its files) if run when the destination folder isn’t there.
EDIT: but my test backup is not as large as yours. Can you identify the simplest backup that does this (basically, steps for anyone to reproduce problem from scratch, with a description of the error seen)?

sgeklor · June 20, 2021, 11:09pm

Looks like I’m going to have to start a new bug report, because the exit codes do not work as expected. Thank you for the exit code list, I didn’t know that Duplicati already supported different error codes. When I wrote my scripts (examples in the original post) I simply used “0” for “OK” and “1” for “error”. After looking at your link, my interpretation is that error code 1 should be “OK, don’t run operation” however what I get from Duplicati is a big red cross and an error reported by the GUI and logs. The logs show “error code 1” is being reported, so the problem does seem to be with how Duplicati processes the error code.

As for recreating the original problem. I would be happy to do that and can post the debugging information, either here or in a new thread or bug report. Cheers.

ts678 · June 21, 2021, 2:13am

Looking more closely at the original post, I notice you used run-script-before-required which is different.

If the script returns a non-zero error code or times out, the operation will be aborted.

Try run-script-before and see if exit code 1 does what you want. Here’s a quote showing option it’s with:

github.com

duplicati/duplicati/blob/5af46b6eecc9fe04c140ec5304dabdd50bdaa3af/Duplicati/Library/Modules/Builtin/run-script-example.sh#L15-L22


      
          # --run-script-before = <filename>
          # Duplicati will run the script before the backup job and waits for its 
          # completion for 60 seconds (default timeout value). After a timeout a 
          # warning is logged and the backup is started.
          # The following exit codes are supported:
          #
          # - 0: OK, run operation
          # - 1: OK, don't run operation

sgeklor · June 22, 2021, 2:08am

Changing run_script_before_required to run_script_before now results in no error message being thrown by Duplicati, which was the expected behaviour, thanks!

sgeklor · June 29, 2021, 6:50am

An unwanted side effect to using “run-script-before” is that now the backup tasks no longer report the last backup time.

Previously (when using “run-script-before-required”) the interface would show the information from the last successful run:

Last successful backup: Today at 9:51 AM (took 01:51:24)
Next scheduled run: Tomorrow at 1:00 AM
Source: 1.71 TB
Backup: 1.36 TB / 14 Versions

Now, when the backups are skipped (at least once) the information is lost and the information displays:

Last successful backup: Today at 8:00 AM (took 00:00:00)
Next scheduled run: Tomorrow at 3:00 AM
Source: 0 bytes
Backup: 0 bytes / 0 Versions

This is demonstrably less useful since I can’t see how long it has been since the backup was successful. Thanks for the suggestions, but I think I’ll go back to using “run-script-before-required” and just ignore the errors.

sgeklor · July 19, 2021, 12:02am

One of my backups crapped out again due to being interrupted in the middle of the operation. I can’t post any diagnostic information because when the issue occurs, there are no logs shown for the back up task - are the logs stored in the database as well?

Anyway, I deleted the backup files and database and started over. This time I increased the block size to 5 MB as suggested and wanted to report that the database file is now only 500 MB, down from over 5 GB when using the default 100 KB block size. This is particularly noteworthy when you appreciate that Duplicati keeps 2-3 backups of the entire database, so the space saving is actually down to 2 GB from 20 GB! When I make the same change for all of my four back up tasks the space saving will be 8 GB down from 80 GB, which is rather considerable.

While in theory increasing the block size results in less de-duplication, I can report that for my back up set there is no difference. There is some de-duplication occurring (I can only suspect since in practice it will be a mix of de-duplication and compression) because the 1.8 TB file set produces a backup size of 1.3 TB, but this has been consistent irrespective of block size, so perhaps there is not as much de-duplication going on as I had thought?

ts678 · July 19, 2021, 12:59am

The normal logs are in the backup database. If an error occurs, logs often end up in server database
(About → Show log → Stored). Depending on how the interrupt happened, you might not have either.

If one wants logging despite problems, one can set external log-file=<path> at some log-file-log-level.

These might offset a bit, as compression might work better when there’s a larger block to compress.

I don’t think there’s an easy way to measure. I suppose you could compare size without compression.

sgeklor · July 19, 2021, 1:42am

Thank you for the information about the logs. I think I will set that external log option (since there is obviously something wrong with my setup) as I will no doubt encounter this issue again and the logs will be essential to figuring out what has happened.

sgeklor · August 8, 2021, 11:49pm

Reliability is still poor. I’ll start a new thread about the specific error I am experiencing.