SUGGESTION: Improve upload resilience by breaking backups into sub-jobs

gabrielwhite · August 5, 2019, 2:30pm

I’ve been struggling to get a 500GB backup to Dropbox to work. The initial job has failed a couple of times (internet connection).

I’ve seen people suggesting doing an “initial” backup of a subset of the files, then slowly adding files to backup set to avoid the downside of a mid-upload fail.

My question is this: could there be a setting within Duplicati to set a maximum “job” size, and Duplicati manage this problem internally?

For example:

500GB backup set
50GB “sub-job” size

Duplicati uploads the first 50GB of the backup, then “completes” it creating an initial backup.
Duplicati then uploads the next 50GB, then “completes” it, etc etc.

That way my downside risk of a fail somewhere along the way is only 50GB, rather than 499GB.

Given what I understand about how Duplicati is designed, I can’t imagine this is a complicated thing to implement, would help save a lot of frustration when things fail, and would avoid people having to manually manage their backup data to avoid the problem (backups should be automatic!).

Thoughts?

Gabe

ts678 · August 5, 2019, 5:21pm

There’s work and talk on manual stops now. Possibly this could be connected although it does add some issues, e.g. one can no longer speak of a backup success if one has only had success on a backup part. Presumably there would also be some GUI additions so it might wind up getting into more extended code.

Stop after Upload is a current discussion (but there are probably quite a few other topics about the issue).

Fix ‘stop after current file’ #3836 is the coding though it seems somewhat TBD how things should behave.

There’s also talk in those of the “synthetic filelist” which is the start plus changes so far. It’s now meant for resumption of an interrupted backup, but could possibly be leveraged into a periodic snapshot of progress.

except “complete” would be a checkpoint, and might not get the full treatment a finished backup would get.

gabrielwhite · August 5, 2019, 6:09pm

Thanks for your reply, and makes sense.

Also, I think this kind of feature would be very useful on a personal computer (where it’s likely the machine will be put to sleep mid-backup on a regular basis).

As for “success” - more relevant is probably whether or not any data is remaining to be backed up (even if a job isn’t running, it’s useful to know that there’s 10GB of outstanding data that is yet to be backed up).

Gabe

xblitz · August 5, 2019, 10:31pm

IMHO this approach is not correct. Backups must represent the state of selected files at a specific time: for this reason some OS introduced the concept of snapshot.

Now, suppose to have backup job of 500GB split in sub-jobs of 50GB and assume interval of one day per sub-job and suppose to start the backup at day_0… what is the date of backup? day_0 or day_10? Another question: what happen if at day_5 I modify a file backupped in the first sub-job? It’s re-backupped? or backup software keep trace of each version? these facts are most important in a restore scenario: a backup software must guarantee to restore the data exactely at the time specified. So, in a ideal world a backup job is a atomic and istantaneus operation. Of course if data are static (like photos or movies) there aren’t these problems… but for these data a backup is nonsense: you can create a copy to a off-line HDD driver\cloud.

Of course one can create manualy a backup job and decide to backup first the most important data but there are some caveats: suppose to create manualy a backup and adding files manually. If I create a backup at day_0 in the future I can assume to be able to restore a file in the status “day_0” but actually added to backup job the day 6 (but this is incorrect).

Of course these topics are very important in a office or industry scenario and lower in a home scenario (this is debatable). Please note, I wrote “backup software” and not “duplicati” because the concept of data congruency is true for every backup software.

In my opinion if you have problem with connection you can try to play with backend parameters related to the dimension of dblock files and upload options or alternatively split data in more backup jobs and execute these one a time: in this case if you lose your local database the recovery time of many little backups local DB should be less the time to recover a one big backup local DB.

ts678 · August 6, 2019, 2:19am

TL;DR Best fix is a better connection, but workarounds can help. Timing issue may vary by the situation.

Duplicati (and maybe any Internet backup software) is going to have trouble with a bad connection, so if you can possibly improve that it might be very good. Programs can sometimes work around slightly bad connections using techniques such as reducing file sizes (as was suggested) or increasing their retries.

Choosing sizes in Duplicati
–dblock-size (a.k.a. Remote volume size in the GUI)
–number-of-retries
–retry-delay
–throttle-upload (but use carefully because it’s buggy in current beta releases – should be OK next beta)

If you can show poor results on an Internet speed test, your ISP might also be able to help, although they might want some troubleshooting to make sure it’s their fault and not in your equipment, e.g. weak Wi-Fi.

Point-in-time backup can come very close to actual time on the backup label if one has –snapshot-policy make a snapshot (has OS and other requirements) but otherwise files just get picked up as backup runs, and the longer the backup runs the less known it becomes when a given file was actually read to backup.

Agree that care should be taken to not make the timing less accurate than current, if this feature is done.
There is possibly already a timing inaccuracy in the synthetic filelist idea for an interrupted backup. This tries to show what was in the last backup, updated with whatever was backed up before the interruption.

There’s probably a risk of the synthetic filelist view of files looking like one that got to finish updating files, however there are those asking to restore anything backed up, even if other files aren’t yet. It’s a tradeoff.

Stop after Upload raises similar issues to what-if-network-fails or what-if-we-intentionally-do-it-in-chunks.

I guess it depends on how particular one is. Sometimes one just wants the latest available copy. If not, a search among different recent backups might be needed to find a favorite. Check contents until pleased, however if you happen to know the time you want instead, test the restore and see what file time you get.
Certainly there are cases (especially in business) where what you want is set by time, but it’s not always.

mikaitech · August 6, 2019, 1:20pm

This is typically why I keep any jobs that go elsewhere/off location setup as smaller backups to begin with.
We do have the primary on location backup, then duplicati as secondary on location, and something else for full raw file copy off location, which also includes the duplicati backups.

gabrielwhite · August 6, 2019, 1:42pm

Thanks for all these responses.

I’ve just started using Duplicati (on a NAS) after having used Backblaze / Time Machine. Both of these solutions are very resilient in the case of failure (network connection / machine restarts), which is especially noticeable for larger backups.

Maybe Duplicati (local) and Rclone (to move the backup to the cloud) is a more robust solution for the problem I’m describing, but I’d think this is something Duplicati can handle internally (and I’d prefer to manage just one app on my home NAS rather than two).

drwtsn32 · August 6, 2019, 3:33pm

I recently switched to that type of setup, but not because I had issues with Duplicati storing data directly into B2 - I did it so I could have faster restore/verify/compact operations. My PCs at home now back up directly to my Synology NAS, and then I use Synology’s CloudSync program to sync with B2. It works great.