Combination of offsite disk and online-storage?

remcoboerma · January 22, 2022, 10:49pm

Hi,

I have roughly 5TB of slow changing data i wish to back up. I really like the idea of deduplication and i’m a big fan of FOSS. That’s how i got to duplicati. Just before, i played around with duplicity which is also very nice but doesn’t do the deduplication.
Alas, here is my storage plan, for which i’m wondering if i can get this up and running with duplicati:

When installing duplicati on my main machine, i want to write a full backup to a temporarily connected 6TB hard-disk. I want it to be a full backup, so restorable and all that, not just a subset of backup files. Next, i want the following/incremental backups to go to B2/SSH/Onedrive/… somewhere online.
Why?

Because now i can keep the file (like old tape archives) somewhere in a fault at some airgapped space, safe from intrusion.
I can load the disk without massive wait times (it took me months of continuous upload to store it all on backblaze backup and i don’t feel like doing that again)
A single hard drive is a lot cheaper than a 2 year online 5TB storage plan.
I can choose to move all the files from online to the database to “merge” the two and have less data online to cut cost. - i would consider this really premium

I can really easily make this work with duplicity, but to avoid loads of duplicate data when i move a directory or maintain those moves by hand in the backup, i would really appreciate dupliciti’s abilities.
I’ve tried to emulate it with a docker-install and some source files to backup. But as soon as i remove some dblock files from the destination (to emulate those on a disk and not present online), duplicati is smart enough to recognize that those files are missing, and it won’t continue the backup. Is there anyway i can force the continuation of the backup? I’ve seen the allow-missing-source but i need a allow-missing-dblock file thing.

(I have tried to find a similar question online but could not find one, if there is one please excuse me, a pointer would be appreciated)

Edit:
After tinkering some more, i discovered the no-backend-verification option, which does have the desired effect, but it might also have some undesired side-effects for what i want. My bet is that this would not work when uploading files online, because the index files would need to be read on each backup?

Thanks for writing wonderful software, and any help!

Kahomono · January 23, 2022, 6:15am

Store your incrementals locally, then upload them once each daily backup is finished. Use rclone or something similar for that step.

I know that how much benefit you get from de-dupe varies widely with your starting data, but I am consistently surprised at how little it really is. [ETA: for me]

remcoboerma · January 23, 2022, 3:01pm

Thanks! so backup once to a seperate disk. Remove. Backup changes locally and rclone those to the cloud.

In case of disaster: download the changes from the cloud, write them to the reattached seperate disk and restore as usual?

drwtsn32 · January 23, 2022, 5:00pm

For this approach to work you’d also need to use no-auto-compact and set your retention to unlimited. Otherwise at some point Duplicati will want to compact remote volumes and/or prune backup versions.

That being said I don’t think I’d use this approach. It seems riskier to me. What I personally do is back up direct to my NAS and then sync it to cloud storage. This way I have backup data in two spots: NAS for fast restores, but a cloud copy to get the backups off-site in case my house burns down.

Kahomono · January 24, 2022, 8:37am

Since the removable disk is identified as the cheap element, why not backup to one locally and periodically copy the backup to a second one, which you store offsite.

As for cloud storage, are you planning to rely on cloud for your “building burns down” backup, but trying to save on volume? In that case, I agree with @drwtsn32 - the risk is not worth the savings. See this thread about Inexpensive Cloud Storage Options. Put your whole backup there, intact.

ts678 · January 24, 2022, 2:56pm

I think this whole discussion involves tradeoffs. Getting off the usual path where Duplicati has a destination folder with the backup in it invites problems. Yes, one can make the backup fly blind, and not look at the file listings, and not sample the files, but this may miss some issues until disaster recovery finds them. Ouch!

Before the disaster, routine restores may also be unavailable because some old file blocks might be offline. Database management also becomes tricky. Many of the tools expect all of the destination files to be there.

This involves the least gather-the-files work, but if upload took months, how long is download going to run? Restoring essential files first is possible though, but needs good technique otherwise database recreation Direct restore from backup files will do will become annoyingly slow on repeat runs, if not on initial attempt.

By the way, for big backups, raise blocksize so that total blocks are a million or a few, or things get slower. Deduplicating at block level is nice until tracking all that gets slow. Files moves will deduplicate regardless.

If download speed is adequate, online pricing might be reduced in this “slow changing data” case by using Google Cloud’s Archive class which is super cheap until file downloads or minimum retention fees add on.

Alternatively you could go to a full-physical route. I think the part-offline-part-online combo is trickiest to run.

If we’ve already pushed to the point of two hard drives, you could rotate one offsite. Not having all copies of data lost to the same disaster means you always have one, even if it’s the offsite one, and so is a bit stale.

You can even even do this with one job definition, keeping the database with its data. This guards against a loss due to Duplicati messing up, because you’ll have two self-contained backups. Redundancy is good…

I see it as risky because of --no-backend-verification and --backup-test-samples=0 means going opposite from the higher-than-default-quality checking that some people are doing to try to improve safety.
Keeping all versions wastes space and slows things. Less compact may be safer. There are known bugs.

It may require premium design and maintenance effort, but might be possible though with some tradeoffs.
Keeping Duplicati from deleting files at the destination may be easy. Don’t delete versions, don’t compact.

Incremental files from backup can be copied (not moved because we want to check the file list and files) somewhere not subject to same local disaster, and if the only place is to online storage, copy them there.

Finding out which files are incremental can be date-based. This would be tough with delete and compact. You;d need to sync a delete of old files to wherever it’s stored. It might be online, but older files are offline.

When online gets too full, do catch-up routine maintenance (version delete, compact, testing) to the local backup then (when satisfied) rclone it onto the offsite disk, clear the online storage, and start cycle again. There might be some further optimizations or gaps to plug, but this whole thing might be too overboard…

Kahomono · January 27, 2022, 9:14am

How long is download going to take, indeed.

When do you need to download this whole backup? If your building burned down taking all your other copies with it. So, ideally, never.

This goes to what we security boffins call, Recovery Time Objective or RTO. How long can you stand being without while your data is restored from that last-resort backup?

The smaller RTO is, the more resource-intensive you need to make your backup scheme.

So the conversation about getting TB upon TB of cloud storage for <$10 should not be happening.

Everything is tradeoffs.

Kahomono · January 27, 2022, 9:16am

Oh yeah, I DO go for cheapest possible cloud storage. But I have consciously decided that my RTO is large. And my backups are segmented enough that I can get the essentials in a few hours, while waiting the month or so it might take for the 8TB monster to finish downloading.