Has anyone tried using Duplicati to back up a file server? I’m wondering if we can use it to back up a 12TB file server to AWS, bypassing an on-prem backup.
I back up several file servers with Duplicati a mix of Windows and Linux, all uploading to Wasabi S3, but nothing quite so large but I don’t see that being a problem other maybe taking a long time.
For Windows though, you need to run Duplicati as a service so that the server doesn’t need to be logged in all the time to run. If you search the forum you’ll find threads telling you more on this.
If you decide to do this, you will most certainly want to customize the deduplication block size (default is 100KiB). The default size can result in way too many blocks for a backup that size - you’d possibly end up with over 100 million blocks for your initial backup. The local database (which tracks such blocks) will get very large, slow, and its recreation (like in a DR situation) will take a while.
See this article for more info on choosing block sizes: Choosing Sizes in Duplicati - Duplicati 2 User's Manual
Not sure what would be ideal, but a 10MiB block size would probably be a better starting point. This reduces deduplication efficiency but will result in only 1% the number of blocks to track.
Note that you cannot change this value once the backup is set up and you perform your first backup. You’d have to start over with a new backup. (You could of course keep the old backup set around for restores, if you wanted…)
Regarding going straight to AWS, my main concern is that you’d have a lengthy recovery process if there was a disaster and you had to restore everything. This isn’t unique to Duplicati but would apply to any backup product that targeted cloud only.
If there’s a reasonable way to split the 12 TB backup into smaller ones, e.g. based on file importance, that might reduce the large-backup impact on the DB, backups, and emergency restores. For maximum safety you want multiple backups anyway. It wasn’t clear if local would get dropped, or you don’t want a 2-step of on-prem, then remote backup of that using Duplicati. 2-step also makes for awkward 2-step DR process…
So the application has a local database? When backing up to S3, are there temporary files that get created? I’m trying to figure out if I would even have space on a server to back up 12TiB of data.
Yes, Duplicati uses deduplication and therefore needs to track blocks and hashes. It does this with a local database.
It also creates temporary files, but they are by default not very large. Default “remote volume size” is 50MiB which works fine for most users, and by default it may create 4 of these concurrently.
Duplicati reads all your data and breaks it into chunks (default 100KiB as mentioned earlier), deduplicates the chunks (so it never stores it more than once on the back end), and repackages the chunks into volumes (default 50MiB).
You can read more about how Duplicati works here: How the backup process works • Duplicati
I recommend you test it out on small sets of data so you can see how it works before you jump into the deep end and start backing up 12TB of data.
It took about 26 minutes for a gigabyte. I don’t see this as being feasible for 12-20TB of data if it normally takes this long. What does everyone think?
The first backup will be slower as it has to process data and upload it. But once it is complete, future backups will be faster. Only changed and new files are analyzed, and only the changed blocks within those files are uploaded.
But yes the first backup can be brutal. Depends on your upload speed and other factors. When I had just a 5Mbps upload speed at home, it took a few weeks to do an initial backup of 500GB.
I am backing up 2,5 TB of data right now (not 12 TB I know ), split into 8 different sets. The largest backup set is around 1,2TB right now but growing.
Because the first backup takes quite some time, I did the biggest backup in smaller portions (starting with a few folders at first and adding up over time). I did not wanted to let the other backup sets “wait” for too long without getting a re-run, I run every set daily.
The 1,2TB set - I started with a few 100 GB and added the rest over time. All in all with a 1MBit connection, it took me a few days (around 6 days for 600GB - sth like this). I used the default setting from duplicati (block and volume size)
But now, the smaller sets with less changes during the day, finish within 5 to 20 minutes. The largest set with 1,2TB and most changes on the source, finishes in around 2 to 3 hours.
As I said, I run them daily and the upload is a 10Mbps connection which results in around 900KB/s to 1MB/s upload speed.
Beyond upload, one question is how long one can wait for restore. Sometimes asymmetric Internet connection speeds help here. My download on a cable modem is 10 times faster than my upload…
There are a few cloud services (such as Backblaze B2) that can ship you a physical device instead, however this is not instant. This is simply a suggestion to consider restore needs as well as backup.
I’m backing up 8 TB of disk images in a single backup set. Works fine, 64 mebibyte deduplication blocks and 4 gibibyte volume files.
Probably smaller dedupe blocks would increase effectiveness of deduplication, but in this particular case it doesn’t matter so much. Most important thing is to have a copy of the data with version history.
With older versions this set used to corrupt often, but it has been working fine for at least half an year so far. (Backup versions are being created daily).
Another large backup set is the backup of all of the normal duplicati backups. Yet these massive jobs are excluded from that one.