I am looking for a tool for long term backup. Most of the data will not change much and consists of fotos which will not change (~75 %) and documents (Windows My Documents folder). Since these files are allready synced over several computers, the main goal is to save data agains e.g. user error or encryption malware.
So I plan to use Amazon S3 intelligent tiering. Most data won’t change and should therefore be automatically in the archive and later deep archive where storrage space costs nothing but should not be accessed or changed. Since data is shifted back into the expensive tiers everytime it is accessed, I would like to understand better how Duplicati work with custom retention policy and when deleting old backups.
Here my questions:
Is it possible to choose the S3 Intelligent-Tiering in the setup?
Am I correct, that Duplicati only access containers with blocks that are changed because the backup version is deleted due to retention policy and the last container, where the latest changes are stored. Therefore –no-auto-compact=true is not needed, because containers without changes don’t need compacting and are not accessed.
When 80 % of the data is not changed, most of the containers will not be accessed and therefore stored in the archive tiers?
Data that changes more often will be in the latest container which is accessed often.
A custom retention policy with fewer steps will lead to less change in the containers and therefore they will sooner be moved to the archieve tiers. E.g. 2W:1D,12M:1M,U:6M
–no-backend-verification=true disables the verification completely. Is it possible, that only freshly uploaded data is verified and old data is not touched?
If someone allready has some experience with the S3 Intelligent-Tiering I would be happy to hear them.
I think the automatic tiering between frequent and infrequent would be ok, but the archive tiers should be avoided in my opinion. Some here have experimented with Glacier and got backups to work, but I don’t know how their restores work when object availability is measured in hours.
Duplicati accesses (reads) the files in the bucket when you are doing a restore, doing verifications, doing compactions, recreating the database, etc. Some of these functions can be disabled. Also, unless you’re using unlimited retention, data blocks are going to be pruned. Even older files in your bucket may end up having ‘wasted’ space, possibly triggering a compaction event. You would still want to use --no-auto-compact if your goal is avoid accessing lower tiers of storage.
If you turn off verification and compaction, then yes the files in your bucket will not be accessed and will be automatically moved to a lower tier per your settings. But as mentioned above, archive tiers could be a serious issue. If you need to restore or recreate the database, it may not work right.
The most recent files aren’t accessed any more often than older files, if you disable compaction and verification. With those things disabled, once a file is placed in the bucket it won’t be read again unless you do a restore, database recreation, etc. If you are not using unlimited retention, then Duplicati will delete files as data ages off. With compaction disabled, it won’t be able to delete dblock files until all data referenced in the file has aged off.
Specific custom retention settings don’t really affect this with compacting disabled.
I don’t think there’s a way to have duplicati test the files that were just uploaded, but I could be wrong. The automatic verification chooses random files to test.
If your goal is to reduce costs, I recommend checking out Backblaze B2 or Wasabi. They are both hot storage and are about the same price as AWS S3 Glacier tier, and you’ll avoid all the potential issues.
ad 1) If I need a backup, which I think will never happen since it’s only the last line of defence, I could manually let amazon shift the data to the frequent tier by accessing it. But as far as I understood it, Amazon will automatically shift it to the frequent tier anyway as soon as I try to fetch the backup. So if there is a problem, I just wait a day and do the restore again. Time is no problem for me.
ad 2+3) I plan to use unlimited retention. I think of a retention policy like 2W:1D,6M:1M,U:6M. So after 6 months there should be ne pruning or compaction and therefore the data is not accessed anymore. Right?
I don’t care how long it takes to retrieve a backup and I highly probably will never need it. Therefore it should be as cheap and as secure as possible for long term storrage. Both Backblaze - 0.005 $/GB - and Wasabi - at least 5.99 $ - are more expensive than S3 Deep Archive with 0,00099 $/GB. Backblaze don’t have an SLA, S3 stores the data in 3 different data centers with eleven 9’s and I can choose to keep my data in the EU. So basicly, if the data is not accessed, it costs nothing to store it.
Yes, I think Amazon does recall the data automatically upon access. But I don’t know how that works with Duplicati. I have not tested archive storage at all, but I imagine Duplicati will try to access one file, get a timeout, and then fail the entire restore request. Amazon will only know that one file read was attempted. Maybe there’s a way to recall all objects using the AWS console.
Sorry, when I said unlimited I meant ‘retain all versions’. You are pruning versions, so compaction will happen unless you disable it.
And yes, deep archive is really really cheap, but it may not work in the end. I recommend you test it out on a small set of data first before committing to it and possibly having a hidden danger there. If you can report back, that’d be great!
In the worst case, I manually download the data, then it is in the standard tier and then do the restore.
As far as I understood Duplicati, it stores the files in blocks and the blocks in container. So new versions should be in new containers. After 6 months I just keep one version per 6 month. So at latest after one year there should be no versions to be pruned and the containers should not be accessed anymore. Or have I something missunderstood.
On the other hand, since deep archive is very cheap, it may be even cheaper to keep more versions and get the containers earlier into deep archive.
Testing is difficult because it takes half a year to get data to the deep archive. But maybe I will make test backup of some small data samle and try to restore in half a year.
Because of deduplication, the file blocks for you most recent backup may be contained in many different backup files (containers as you call them). With each backup, Duplicati only stores blocks that have never been stored before (and yes, for those it stores them in one or more new dblock files).
Versions over a year old will no longer get pruned, if that’s what you mean. But pruning will still happen in the more recent backup versions.
I think with deduplication it can be hard to predict which backup files (containers) may get space freed up by backup versions being pruned.
You seem intent on going this deep archive route! I would definitely recommend disabling compaction and disabling verification. Keeping all versions may be a good option, as well, but it might depend on how often you run your backup job. The local job database will have to track all these versions. Speaking of that, depending on how large your source data set is, there are probably other optimizations that you should do (like adjust the deduplication chunk size, etc).
Thanks to your suggestions I will change my strategy a little bit. I will stay with deep archive, because I think it is the least expensive with very high relyability for data that I think I will never need a restore.
I create a new folder for the data that I want to backup. Then only very few files will change and these are small files. So pruning and compaction is useless anyway. Does this also mean, I don’t need any retention policy since all data is stored anyway?
I will create a few small test backups. In half a year they should be in deep archive. Then I can test how restore works. I agree with you that a backup where I have to doubt that the restore works is useless.
Thanks for all the input. I did some more reading on S3 and had again a look at the prizing of Backblaze. I realized that Backblaze is so cheap that it isn’t worth the hazzle with S3. I will give Backblaze a try.
I think you’ll be happy with it. I tried S3, Wasabi, and Backblaze. For me all three work very, very well. I decided on Backblaze B2 because of its price and how you can have the bucket contents shipped to you if you want. Wasabi is nice in that it is about as cheap and has no egress costs.
I like this thread, because I’ve been thinking the same stuff a lot.
If I need a backup, which I think will never happen since it’s only the last line of defence
But you’ve got a serious problem there.
You need to regularly test backups, which means reading backups. Duplicati seems to be so unreliable, that without that step, the end-result is exactly the same as no backup at all. - And do not use duplicati test feature, it’s also unreliable. Do full restore on completely separate system (without the local db).
Secondly, I thought exactly the same mostly. Until the SBG2 fire happened. I had to restore ahem, a lot of DR backups, which were kept just in case, which never happens. If I wouldn’t have done the regular testing, I probably would have had many backups sets which are beyond restore / repair.
But sure, in case of DR backups I’ve been thinking the same, and using the archive tiers. Where it just takes time to get the volumes. But due to the point 1. you’ll probably need to have the primary copy with higher availability, which you then just mirror to the DR storage using rclone, and then you can deal with the tiers in that client as well.
Btw. Because I didn’t happen to have serious problems with the SBG2 restore, I’ll make a donation to Duplicati project, because it did this time save my case.