Duplicati for large archive?

jason · June 26, 2018, 1:13pm

I’m testing duplicati for deployment on a linux server (ubuntu 14.04), whatever I end up using will have to be able to handle backing up a large archive (about 14 TB of data and growing) to a local nas and to a cloud storage provider. I’ve been testing on windows and macos computers and I have a few questions.

What are frequent performance bottlenecks with Duplicai? It seems very slow (about a week based on current speed) to run a 2 TB backup on a fast system going from one internal disk to another internal disk and there is nothing obvious in my performance metrics (excess capacity available in cpu, memory, and disk i/o) to indicate the bottleneck.
Is Duplicati suitable for backing up data archives of more than 15 TB?
Will the incremental backups be faster than the initial backup?
Can Duplicati work with Amazon glacier and Google coldline? I thought perhaps the slow response time might hurt the ability to do backup verification.
Can a backup job be configured to store in two destinations simultaneously? Or in my intended use case will I need two separate backup jobs, one pointed at a local nas and one pointed at a cloud storage destination?

Thanks for your help in advance! I’m sorry if I missed a faq or guide that answers this.

JonMikelV · June 27, 2018, 2:38am

Hello @jason, welcome to the forum - and thanks for trying out Duplicati!

Let’s see if these answers help you with your decision making:

Performance bottlenecks vary depending on system specs, but since you’ve got “a fast system” you’re likely running into inefficient sqlite database lookups. This is more visible when you’ve got lots of files small (paths) versus the same total size in fewer large files.

There are plans in the works to improve this side of things, but until it’s tested enough to roll out people who have run into this and shifted there sqlite file to an SSD drive have reported speed ups.
Duplicati should be able to handle sources larger than 15TB, however as discussed in #1 the more files you have the slower backups will run. Some users have gotten around this by breaking their backups into multiple jobs.
Yes, incremental backups should be faster than the initial backup - though by how much will vary depending on how much changes between backups. For example, if your initial backup to X hours and 50% of your data changes between backups I wouldn’t be surprised if incremental backups take X/2 hours to complete.
Unfortunately, Duplicati isn’t really designed for use with “cold storage” destinations like Amazon Glacier (though I don’t know about Google Coldine). That being said, it IS possible to use Duplicati with Amazon Glacier however to do so involves disabling many of the features that help make sure your backup is viable.

For example, by default before doing a backup Duplicati will list all the files on the destination to make sure they match what’s in the local database. Similarly, at the end of backup Duplicati will pull down some archives for testing to make sure they are still accessible and not corrupted.

With something like Amazon Glacier once files are moved into cold storage they are no longer visible to a normal file list request therefore features like the above have to be disabled.
While this is a requested feature, at the moment Duplicati can only use one destination per job. So in your intended use case you would indeed need two separate backup ones - one to the local NAS and one to the cloud destination.

Note that you COULD have a single job to the local NAS then use a sync tool like rsync or Syncthing to replicate that job to the cloud storage however there are some drawbacks to this process as well.

Sami_Lehtinen · June 27, 2018, 5:49am

Remember to use large deduplication blocks. Using too small blocks totally kills performance with large data sets. I’m currently using 256MB blocks and 4GB files for many larger data sets, which are quite static and bandwidth isn’t limiting factor.
Depends lot from 1. and how the data set changes. Can be very fast, or even slower as initial backup, in case where all attempts to de-duplicate fail and all files need to be completely reanalyzed again. We’ve had these cases. Most importantly is to make sure that the de-duplication is successful as well as not updating modified timestamps, unless there are actually changes in the file(s).
How about B2?

JonMikelV · June 27, 2018, 6:45pm

Thanks @Sami_Lehtinen for those important tips that I forgot to include!

I should also mention for #3 that if backed up files include things like virtual disk images or encrypted content where a small “internal” change can completely rewrite the file (or the block size boundaries) this can also cause subsequent backups to be slow.

jason · June 28, 2018, 2:03pm

Thanks very much for the information!