Duplicati and S3 Glacier

liquefry · October 29, 2021, 1:10am

I’ve been using Duplicati for a while but only just thought to start shifting infrequently accessed backups to glacier for long term storage. The description at How to use Glacier to store backups - Duplicati (google.com) is slightly out of date:

rather than a filter to move files regularly, it’s now called a “lifecycle rule” for the objects in a bucket. You create a lifecycle rule using the s3 console under “Management” for the bucket. See Setting lifecycle configuration on a bucket - Amazon Simple Storage Service
the prefix filter for the lifecycle rule cannot include a wildcard, so it should be just duplicati-b not duplicati-b* (eg see Resolve Amazon S3 Event Notification Issue With a Wildcard in the Object Name Filter)

For reference, my rule moves both current and noncurrent versions of all files with the prefix duplicati-b to Glacier 90 days after creation, and to Glacier Deep Storage after 180 days. S3 is an emergency backup only for me only if my local backup fails so I rarely need to restore from it and am happy to go to deep storage ASAP.

drwtsn32 · November 1, 2021, 6:20pm

This is a neat idea…using archive class storage for your secondary backup data only.

Out of curiosity are you running two backup jobs in Duplicati, one targeting local and another S3? Or do you have only one backup job in Duplicati, and then sync that local backup data to S3?

I do the latter: all my systems back up to my NAS, then I use Synology CloudSync to replicate to B2. Only time I’d really need to restore from B2 is if my NAS failed. I might consider using Glacier Deep Archive instead. It’s about 20% the cost of B2 hot storage, but I’ll have to think about how the Glacier overhead and transaction costs come into play…

liquefry · November 2, 2021, 2:06am

actually the second option is probably smarter! I just have two backup jobs. I’ve never actually had to back up from Amazon so not sure whether it works properly I guess.

WhoopsHelp · November 4, 2021, 11:30am

I have been using Duplicati uploading directly to Deep Glacier. I have multiple local copies of the data + local backups to a RAID array, so the backup is basically an absolute emergency break. The downside is that it’s not possible to verify the uploads immediately (because the files cannot be retrieved), the upside is that my backup is 225GB and it costs a few dollars per year. Uploading to regular S3 and quickly cycling to Deep Glacier would be more expensive, though it would guard against the marginal risk of a corrupt upload.

Kahomono · November 7, 2021, 11:08am

Interested in this also. I have been syncing copies of all backups using rclone to cloud storage. Currently LimaLabs but tentatively migrating to Jottacloud. I have a full sync at each site but the Personal plan of Jottacloud, with unlimited storage, as me throttled to 3MB/sec upload speed.

My use-case for cloud backup is, in need of a restore after a full-on disaster destroyed all the various copies in my home.

How do you calculate the cost of doing this in Glacier?

arbe · November 10, 2021, 7:44pm

I have many backup jobs targetting AWS S3 Deep Archive. The cost are roughly 10 USD per TB per year, coming with a guaranteed 3 replicas on AWS side. Since I have some professional experience with AWS already, it was also quite straight-forward to add IAM policies ensuring that my standard backup job gets only access rights to write into the target bucket. Even if the PC gets compromised by a hacker, the AWS credentials they find won’t be sufficient to delete any backup files. That’s a great feature adding an extra level of security at no extra cost on AWS side.

The restore process is of course a bit more complicated, but I also successfully tested that one. You trigger an object restore (which costs a bit, but not much in bulk mode). S3 then holds a 2nd copy of your objects in the standard storage class for a specified amount of time, say 5 days. After that period, the 2nd copy is delete automatically. You always have your deep archive stored version of the object as well.

For anyone concerned about a “marginal risk of corrupt upload”: Even in deep archive storage class, you can list the ETag of all objects without the need to restore. ETags are basically MD5 fingerprints of the uploaded files (in case of big uploads which are conducted as multi-part upload, there is a documented logic of how several chunks are hashed to the final ETag).

drwtsn32 · November 10, 2021, 8:22pm

@arbe thanks for sharing your experience!

Kahomono · November 11, 2021, 11:39am

Thanks so much! $10/TB annual is way less than I thought it would be.

What file size do you find is optimum to upload? Or does it matter?

Also - how do you get files deleted from S3 when Duplicati cleans out old files?

arbe · November 13, 2021, 12:00pm

I work with dblock size of 400MB. It should be less than 500MB (if I remember correctly, there is some AWS limitation to the restore modes bulk vs. standard for big files). Too small files are also not working well, especially as there is some AWS metadata adder to every file (but in the low KB range, I believe).

Duplicati deleting file is not an issue, I always follow the recommendation here in the forum to work with the parameters:
–keep-versions=-1
–no-auto-compact=true
Hence, Duplicati will not try to delete anything. And at this storage pricing, you have the advantage to have a full history of all your files. But of course, the local Sqlite database might get quite big with this setting if you run frequent backups. I haven’t looked into this issue yet in detail.

drwtsn32 · November 13, 2021, 5:15pm

I don’t think deleting from S3 Glacier is a problem at all. It’s just that with Glacier you pay for a minimum of 90 days storage, so if it’s deleted early you’ll be charged as if it continues to exist for the full 90 days. (With Deep Archive it’s 180 days.) So you could still prune versions if you wanted to.

Unlimited retention will bog down your local sqlite database over time. Depends on how many backups you take per day, how many versions it’s tracking, how large the backup is, your dedupe block size, etc.

I’m currently in the process of making the switch from B2 to S3 for the second copy of my backup data. My primary copy is on a NAS, and it will synchronize with S3. Not sure if it’s too aggressive, but my plan is to use Standard-IA when objects are first placed in the bucket and transition to Deep Archive after 14 days.

drwtsn32 · November 27, 2021, 3:32pm

@liquefry - how are you doing this? Are you storing your Duplicati files in the root of the bucket? If you have “folders” then it looks like you need to specify the folder name. eg, folder1/duplicati-b as the prefix filter.

In my case I have numerous “folders” - one for each backup on each PC. As you point out, wildcards aren’t supported in the Lifecycle prefix area, so it looks like I may have to set up a lifecycle rule for each of these folders. One alternative I’ve seen is to use Lambda to tag the objects and then you can filter by tag in the lifecycle policy. I might have to dig into that a bit deeper.

drwtsn32 · November 27, 2021, 11:04pm

Spent the day messing around with Lambda functions. Never really worked with them before. I managed to get one created that is triggered on S3 put and will tag it if it’s a dblock. Unfortunately it only works intermittently and I have no idea why. I can see in the logs on AWS that the function is being triggered for each upload, but when I check the S3 object only some of them will be tagged.

I’m taking a step back and thinking a simpler solution is to change the Lifecycle policy so that instead of working on prefixes or tags, it just works based on minimum object size. If my remote volume size is 50MB, I could just set the minimum size to around 50MB so they are flagged. No need to match based on filename. (None of my dindex or dlist files come anywhere close to 50MB…)

arbe · November 30, 2021, 9:24pm

That’s correct, but not really a problem for me. I use the JSON version of the policy generated in the AWS web UI. Then I just copy/paste this with the different prefixes and an arbitrary ID each. That’s less than 5 minutes of work for dozens of “folders”. All this goes into 1 JSON file which you can then activate via the AWS CLI tool:
aws s3api put-bucket-lifecycle-configuration --bucket myBucket --lifecycle-configuration file://myLifecyclePolicyFile.json

drwtsn32 · November 30, 2021, 10:00pm

Ah, very cool that it can be done that way! Thanks for the tip.

I think I’m good with just using the minimum size setting on the lifecycle policy and not mess with prefix at all. Tested it out the other day and it worked as expected. In one way it might actually be better, because small dblocks won’t be transitioned to deep archive. Those smaller ones are candidates for consolidation when Duplicati does a compaction.

Once the dust settles it’ll be interesting to compare my cost in S3 vs B2.

James_Coleman · May 23, 2023, 12:09am

Curious on the 2 years later experience on this setup, anyone care to update with that?

drwtsn32 · May 23, 2023, 4:09am

I’m still using this approach, and it costs me maybe half of what B2 did.

SimTech · July 25, 2024, 3:22pm

But how do you restore?

drwtsn32 · July 25, 2024, 4:25pm

I back up to my NAS, and then it syncs to S3 (using Synology Cloud Sync). I do all my restores from the NAS. I only have to worry about S3 restore if my house burns down or something. In that case I’ll have to change the S3 objects to a hot/online tier before I restore using Duplicati.

f18m · February 7, 2025, 11:21pm

Hey all,
I’m quite new to Duplicati but I’m trying to set it up to backup my NAS to S3 Glacier.
I’ve created my bucket and setup a test backup in Duplicati. Duplicati managed to upload my data to my s3 bucket in the right storage class. But then it reports errors in retrieving the files (for verification purposes):

2025-02-07 23:08:37 +00 - [Error-Duplicati.Library.Main.Operation.TestHandler-FailedToProcessFile]: Failed to process file duplicati-20250207T230752Z.dlist.zip.aes
Exception: The file duplicati-20250207T230752Z.dlist.zip.aes was downloaded and had size 0 but the size was expected to be 251533
2025-02-07 23:09:18 +00 - [Error-Duplicati.Library.Main.Operation.TestHandler-FailedToProcessFile]: Failed to process file duplicati-icefe7a924fdb4b36b91aad71075719e0.dindex.zip.aes
Exception: The file duplicati-icefe7a924fdb4b36b91aad71075719e0.dindex.zip.aes was downloaded and had size 0 but the size was expected to be 165341
2025-02-07 23:09:58 +00 - [Error-Duplicati.Library.Main.Operation.TestHandler-FailedToProcessFile]: Failed to process file duplicati-b2408f8439a76419fbfe67feaff4a5c4e.dblock.zip.aes
Exception: The file duplicati-b2408f8439a76419fbfe67feaff4a5c4e.dblock.zip.aes was downloaded and had size 0 but the size was expected to be 13002173
2025-02-07 23:09:58 +00 - [Error-Duplicati.Library.Main.Operation.TestHandler-Test results]: Verified 3 remote files with 3 problem(s)

I guess the reason is that S3 Glacier does not allow you to download the file right away. I should rather start a recovery procedure for each file right?
Is the verification step something you disabled in Duplicati when backupping to S3 Glacier?

thanks!

ts678 · February 8, 2025, 12:21am

Welcome to the forum @f18m

If you’re talking to the user community, it’s something you should do to stop verification errors.

I don’t use Glacier, but I think backup-test-percentage=0 backup-test-samples=0 is a nice way.
Sometimes you’ll see no-backend-verification used, but it turns off verification by file listing too.

You’ll also have to give up on running compact which reclaims wasted space as files age away.
Turning no-auto-compact on should do that, but then space usage will just continue to increase.

Amazon Glacier Best Practices? has some other discussion. I also see some code work for this, however I’m not sure how it handles space usage and potentially constant shifts to/from Glacier.