Newbie: What is the most cost effective way to backup 12+ terabytes windows folder of data daily to AWS Deep Archive?

sul · April 29, 2024, 3:15pm

Hi Folks,

I am new to Duplicati. I have a file server running windows standard 2022. If at all possible, I would love to use Duplicati to incrementally backup a specific folder on my F: Drive to AWS Deep Archive on a daily basis. I understand the initial backup can be a full backup and all subsequent backups would be incremental. Also, I do not anticipate needing to retrieve data from deep storage (possibly for years) as it is only for safekeeping in case of an emergency.

I have not purchased AWS storage as yet but I can do as soon as I can be sure the above is indeed possible. I assume this is not an uncommon thing to do. However, unfortunately I am not finding a clear cut set of instructions to backup incrementally to AWS Deep Archive.

BTW: I currently practice the backup operation above using LTO7 tapes using a different backup software. My current cost for tapes is less than $120 per year on the average. So, I’m hoping I can replicate a similar scheme using Duplicati and AWS Deep Archive.

I appreciate any advice.

Thanks!

-sul.

drwtsn32 · April 29, 2024, 4:38pm

You won’t beat $120/year. I just looked at AWS pricing and the cost for deep archive is $0.00099/GB/month. For 12TiB this would be $12.16/month, or about $146/year. This doesn’t even include the transaction costs associated with S3, not to mention restore (retrieval) costs which are relatively high for this tier storage.

Besides the cost, there are complications and challenges when using Duplicati with archive tiers that are not “online.”

sul · April 29, 2024, 6:48pm

drwtsn32,

thank you for your honest feedback. I can certainly live with the higher costs since I would potentially do away with some labor like physical management of tapes and maintenance of on-premise tape hardware. However, I think Duplicati’s challenges with archive tiers could be a deal breaker.

I wonder–is it that Duplicati is not yet mature enough to robustly handle (e.g., with minimal, low cost S3 transactions) “Deep Archive” backups in a manner that I desire (i.e., daily incremental backup to deep archive)? Or more generally, is Deep Archive such a beast that perhaps the way I intend to use it is just not possible by any backup software–even commercial software–without breaking the bank?

Appreciate any insight.

Thank you!

-sul.

drwtsn32 · April 29, 2024, 9:43pm

Duplicati can work with archive-tier storage, but you must change some default options. See this post for more information:

And this thread:

Also for a 12TiB backup you’ll want to either change the default deduplication “block size”, or break up the backups into smaller sets, or both.

ts678 · April 29, 2024, 11:02pm

Thanks for jumping in. I don’t have anything AWS, and I personally worry about things like testing backups.

Because this one sounds like emergency-only use, I’m wondering how one will know that it’s working well. Downloading everything for a test will cost over $1000 I think – isn’t usual S3 egress around $90/terabyte?

Trying to arrange files so dlist and dindex stay in hot storage can test Recreate, but it might need a dblock.

You were at one time doing sync of a local backup on a NAS, where at least one could test local copy well. Additionally this would allow things like version deletes and compacting of wasted space from it to happen.

Later on, you were pursuing direct backup from Duplicati, and liked it. Did the good view here grow worse?

ts678 · April 29, 2024, 11:14pm

Does that mean this is a secondary backup or are you just planning on restore almost never happening?

How important is your data? Usually important data gets more than one backup. One-plus-sync isn’t two, however it allows not only testing, but restoring more often that years apart. It can also cut remote space.

You gave a focus on cost though. Lowest cost might mean less protection – along with other drawbacks.

sul · May 1, 2024, 2:49pm

drwtsn32,

This is all great information-- and I sincerely appreciate that you took the time to curate the relevant articles in your reply. I’ll try to navigate through it all. Being an old fashioned tape-backup guy, there is certainly a lot to learn about the nuances of backing up a 12+ terabytes to deep archive.
From an outsider looking in, it sounds like if someone doesn’t know all the “tricks” (block size, verification, etc.), it could significantly affect monthly storage costs. Hopefully I’ll have gathered enough confidence to move forward with S3 deep archive or an alternative provider in a way that is not too much more expensive than tapes.

Thanks again!

-sul.

Gloomfrost · May 1, 2024, 9:32pm

Hi all,

I was actually about to make a very similar post about a similar situation. I have a server with a 35TB data set, 8.4 million files (works out to about 440kb average file size). We also have 65% in deduplication savings using the windows server dedup. With B2, our actual remote size is 20TB as of last week. We’ve been using Duplicati for years and it’s worked fairly well (about 4hrs for incremental backups) to B2, but lately we’ve been getting issues with zip64 4GB streams which I recently updated, and also the backup is now taking almost 16 hours (spiked recently for some reason). We’re using a remote volume size of 500MB but I missed the block size recommendation so we were at 100kb the whole time which I am assuming is the culprit.

I’ve decided it’s time to update block size so before I dump a new 20TB+ volume to B2, I was wondering if there are any other variables to adjust, if any, other than what I already have for extremely large datasets?

auto-vacuum X
auto-vacuum-interval 1 month
zip-compression-zip64 true
blocksize 5MB
remote volume size 500MB

Besides splitting the backup into multiple backups which will not be possible in our environment, is there anything else recommended? We have a symmetrical fiber gigabit connection, and the data is on a RAID5 SSD volume. The system has a 1TB primary NVME drive, with 40GB RAM and a modern i5.

ts678 · May 1, 2024, 11:01pm

Welcome to the forum @Gloomfrost

You hit most of the highlights, although I usually like to keep backups below 1 million blocks, which means blocksize of 20 MB or so, but file count blows the budget, as a file has at least two blocks (one metadata).

Performance analysis at this size has been scarce so far due to limits of equipment and people, however another person (probably with a faster system, and yours sounds pretty fast) thinks it can do more blocks.

Regardless of where the speed drop off really happens, any larger blocksize will almost surely move it out.

Lots of files and blocks also get larger databases, which means overflowing the tiny default 2 MB memory cache in SQLite. It doesn’t show up in drive accesses (which I notice immediately on a mechanical drive), however I can see the OS level read rate start to go super high, e.g. in Task Manager or Process Explorer.

however the increased blocksize will probably reduce database size and access stress. Just mentioning it because you asked for other adjustments. As usual, the exact need if you go this route would probably be experimental. If the SQL gets too slow, or you see tons of file access (Process Monitor can detail those) to database or its journal of a file whose name starts with etilqs (likely in Temp), more cache might help it…

is a recent SQLite comment. I know the default blocksize is being raised times-ten, but I don’t know what else might be cooking. Possibly this quote will pull out some more comment, or maybe it won’t. We’ll see.