Planning 20TB backup to B2, ~9 months to upload

John_Delisle · May 3, 2018, 2:28pm

Hello,

I’m planning a Duplicati backup to Backblaze B2. I have approx 20TB of data, consisting of VM disk images (live VMs), lots of media (video), and lots of typical user data (photos, documents, etc.).

My internet upload speed isn’t the best… I can only upload to Backblaze 24x7 at ~7mbit/sec. 20 TB is going to take about 9 months.

Duplicati will be run from an iocage jail on FreeBSD 11.1 (FreeNAS). My VMs are accessed by a remote hypervisor via NFS and iSCSI, and my user data is accessed by SMB.

I’m looking for advice. Is there anything I should know before kicking this off? In particular, I’m wondering about:

How does Duplicati handle open files? The VM disk images are always in use by another system (NFS and iSCSI). Will they be backed up?
How does Duplicati handle changes to a large file? The VM disk image is modified frequently, so I’d hope it’s doing block-level deltas or something similar so that it doesn’t back-up the entire file each time it sees it’s changed.
Is Duplicati reliable with this volume of data? (20 TB, 2 million files)
Data on my system changes regularly - files are added, deleted, and modified. I assume I can configure retention and versioning rules in Duplicati, and that it will purge data from Backblaze B2 in accordance with those rules so I’m not paying to store data that I no longer want or need. Is that true?
While I have this one massive system, I also have 20 PCs to back up running OSX, Windows, etc. I wish to back them up to the same Backblaze B2 account, but into different buckets to segregate the data. I assume this is supported, but I really would like confirmation that Client A won’t delete Client B’s bucket or data accidentally…
Any general recommendations? Pitfalls to avoid?

Thank you for making this excellent tool, and I appreciate your help!

joolsr · May 3, 2018, 4:23pm

You know with Backblaze that they can send you a disk to copy your data that you physically send back to them to store in the cloud? This might be better than waiting to upload TB of data …

mikaitech · May 3, 2018, 4:55pm

Looking at their service, it is only up to a 4TB drive (3.5TB storage space) at $189 each, which means he would need 6 of them, so $1200 just to borrow drives for his backups. They do refund the money once returned but that is still a lot of out of pocket expense for that service.

Pectojin · May 3, 2018, 4:59pm

Duplicati is not great for VM disk imaging at this point in time. There are no block-level deltas or anything else clever. It will split the VM into blocks of equal size and then when backing up next time it will do the same and upload the blocks that haven’t been backed up yet. This means potentially backing up the entire VM disk again if you add stuff to the beginning of the VM disk.

Duplicati has some performance issues with very large datasets, when the local DB gets very large. There are people backing up fairly large datasets, though. My 2 cents is that it’s much better to split your backup into multiple jobs. Causes smaller DB’s and gives more freedom for scheduling backup and tweaking performance settings.

Yes, --retention-policy allows you to define essentially any scheme for how many and how long to keep versions.

Multiple backups can use same backend or even same bucket without any problem. What’s important is that each job is configured to store their backups in each their folder. You may prefer to split them into each their bucket, but it’s up to you as long as none of them share a folder.

I would strongly recommend using a different tool for VM’s at this point in time.
And as stated above I would recommend smaller more defined backup jobs over one large. This one isn’t necessarily a must have, but it will reduce performance issues and provide more flexibility.

John_Delisle · May 3, 2018, 6:43pm

Yes, exactly this. I intend to prioritize backing up my most critical data, then the less critical data can trickle up.

JonMikelV · May 4, 2018, 4:26pm

Are you thinking of doing this through scheduling or by setting up your jobs with a subset of your eventual total backup and manually adding source folders as the priority stuff completes?

I, and some other users, have used this method to make sure priority stuff is initially backed up sooner than non-priority stuff as well as to test various Duplicati features and performance tweaks without waiting days / weeks for the first backup to finish before making changes to things.