Duplicati and S3 Glacier

I’ve been using Duplicati for a while but only just thought to start shifting infrequently accessed backups to glacier for long term storage. The description at How to use Glacier to store backups - Duplicati (google.com) is slightly out of date:

For reference, my rule moves both current and noncurrent versions of all files with the prefix duplicati-b to Glacier 90 days after creation, and to Glacier Deep Storage after 180 days. S3 is an emergency backup only for me only if my local backup fails so I rarely need to restore from it and am happy to go to deep storage ASAP.

2 Likes

This is a neat idea…using archive class storage for your secondary backup data only.

Out of curiosity are you running two backup jobs in Duplicati, one targeting local and another S3? Or do you have only one backup job in Duplicati, and then sync that local backup data to S3?

I do the latter: all my systems back up to my NAS, then I use Synology CloudSync to replicate to B2. Only time I’d really need to restore from B2 is if my NAS failed. I might consider using Glacier Deep Archive instead. It’s about 20% the cost of B2 hot storage, but I’ll have to think about how the Glacier overhead and transaction costs come into play…

actually the second option is probably smarter! I just have two backup jobs. I’ve never actually had to back up from Amazon so not sure whether it works properly I guess.

I have been using Duplicati uploading directly to Deep Glacier. I have multiple local copies of the data + local backups to a RAID array, so the backup is basically an absolute emergency break. The downside is that it’s not possible to verify the uploads immediately (because the files cannot be retrieved), the upside is that my backup is 225GB and it costs a few dollars per year. Uploading to regular S3 and quickly cycling to Deep Glacier would be more expensive, though it would guard against the marginal risk of a corrupt upload.

Interested in this also. I have been syncing copies of all backups using rclone to cloud storage. Currently LimaLabs but tentatively migrating to Jottacloud. I have a full sync at each site but the Personal plan of Jottacloud, with unlimited storage, as me throttled to 3MB/sec upload speed.

My use-case for cloud backup is, in need of a restore after a full-on disaster destroyed all the various copies in my home.

How do you calculate the cost of doing this in Glacier?

I have many backup jobs targetting AWS S3 Deep Archive. The cost are roughly 10 USD per TB per year, coming with a guaranteed 3 replicas on AWS side. Since I have some professional experience with AWS already, it was also quite straight-forward to add IAM policies ensuring that my standard backup job gets only access rights to write into the target bucket. Even if the PC gets compromised by a hacker, the AWS credentials they find won’t be sufficient to delete any backup files. That’s a great feature adding an extra level of security at no extra cost on AWS side.

The restore process is of course a bit more complicated, but I also successfully tested that one. You trigger an object restore (which costs a bit, but not much in bulk mode). S3 then holds a 2nd copy of your objects in the standard storage class for a specified amount of time, say 5 days. After that period, the 2nd copy is delete automatically. You always have your deep archive stored version of the object as well.

For anyone concerned about a “marginal risk of corrupt upload”: Even in deep archive storage class, you can list the ETag of all objects without the need to restore. ETags are basically MD5 fingerprints of the uploaded files (in case of big uploads which are conducted as multi-part upload, there is a documented logic of how several chunks are hashed to the final ETag).

1 Like

@arbe thanks for sharing your experience!

Thanks so much! $10/TB annual is way less than I thought it would be.

What file size do you find is optimum to upload? Or does it matter?

Also - how do you get files deleted from S3 when Duplicati cleans out old files?

I work with dblock size of 400MB. It should be less than 500MB (if I remember correctly, there is some AWS limitation to the restore modes bulk vs. standard for big files). Too small files are also not working well, especially as there is some AWS metadata adder to every file (but in the low KB range, I believe).

Duplicati deleting file is not an issue, I always follow the recommendation here in the forum to work with the parameters:
–keep-versions=-1
–no-auto-compact=true
Hence, Duplicati will not try to delete anything. And at this storage pricing, you have the advantage to have a full history of all your files. But of course, the local Sqlite database might get quite big with this setting if you run frequent backups. I haven’t looked into this issue yet in detail.

I don’t think deleting from S3 Glacier is a problem at all. It’s just that with Glacier you pay for a minimum of 90 days storage, so if it’s deleted early you’ll be charged as if it continues to exist for the full 90 days. (With Deep Archive it’s 180 days.) So you could still prune versions if you wanted to.

Unlimited retention will bog down your local sqlite database over time. Depends on how many backups you take per day, how many versions it’s tracking, how large the backup is, your dedupe block size, etc.

I’m currently in the process of making the switch from B2 to S3 for the second copy of my backup data. My primary copy is on a NAS, and it will synchronize with S3. Not sure if it’s too aggressive, but my plan is to use Standard-IA when objects are first placed in the bucket and transition to Deep Archive after 14 days.

@liquefry - how are you doing this? Are you storing your Duplicati files in the root of the bucket? If you have “folders” then it looks like you need to specify the folder name. eg, folder1/duplicati-b as the prefix filter.

In my case I have numerous “folders” - one for each backup on each PC. As you point out, wildcards aren’t supported in the Lifecycle prefix area, so it looks like I may have to set up a lifecycle rule for each of these folders. One alternative I’ve seen is to use Lambda to tag the objects and then you can filter by tag in the lifecycle policy. I might have to dig into that a bit deeper.

Spent the day messing around with Lambda functions. Never really worked with them before. I managed to get one created that is triggered on S3 put and will tag it if it’s a dblock. Unfortunately it only works intermittently and I have no idea why. I can see in the logs on AWS that the function is being triggered for each upload, but when I check the S3 object only some of them will be tagged.

I’m taking a step back and thinking a simpler solution is to change the Lifecycle policy so that instead of working on prefixes or tags, it just works based on minimum object size. If my remote volume size is 50MB, I could just set the minimum size to around 50MB so they are flagged. No need to match based on filename. (None of my dindex or dlist files come anywhere close to 50MB…)

That’s correct, but not really a problem for me. I use the JSON version of the policy generated in the AWS web UI. Then I just copy/paste this with the different prefixes and an arbitrary ID each. That’s less than 5 minutes of work for dozens of “folders”. All this goes into 1 JSON file which you can then activate via the AWS CLI tool:
aws s3api put-bucket-lifecycle-configuration --bucket myBucket --lifecycle-configuration file://myLifecyclePolicyFile.json

1 Like

Ah, very cool that it can be done that way! Thanks for the tip.

I think I’m good with just using the minimum size setting on the lifecycle policy and not mess with prefix at all. Tested it out the other day and it worked as expected. In one way it might actually be better, because small dblocks won’t be transitioned to deep archive. Those smaller ones are candidates for consolidation when Duplicati does a compaction.

Once the dust settles it’ll be interesting to compare my cost in S3 vs B2.