I have a 5TB external USB drive.
I want to backup this drive to an S3 storage.
First of all, so that there are no miscommunications, I want to make one thing clear:
I want to backup FROM EXTERNAL HD, not to external HD.
I want to backup some folders on this HD. That will total about 2 TB of files.
The first problem is, this HD is not always connected, so I want to do it in a way that doesn’t cause problems.
An example of a problem that can happen is the backup starts when the HD is not connected, and Duplicati considers that the files have been deleted, and deletes the files on S3.
I want to prevent this from happening, or the backup would always restart the next time I connected the external HD.
In this same backup, I will also add some local folders (from C:) and these are present at all times.
Different from the external HD that sometimes will be connected, sometimes not.
Can I make these 2 backups together, selecting the local folders and the folders on the external HD, even though I know that some will always be present and others will not?
If at some point I manually remove folders from the backup folder list, will they be deleted from S3?
I take the opportunity to ask for advice on the best settings to use, considering that:
I currently have 1TB of files (but it can go up to 2TB)
There are about 200,000 files, 5000 folders, of the most diverse types, sizes and formats.
It has thousands of small files of few Kbs, and it also has hundreds of large files of Several GBS.
It has 5, 10, 15, 20, 40, 50GB files in between.
I intend to create a Rule in S3 to move the files to Glacier
Considering all this, what are the best settings for my case?
There’s a lot to unpack here, and many of these are interdependent.
I think you have a couple options here. If the external HD was in its own backup job, you could use run-script-before to decide if the backup should run or not. (It could test for the presence of the external hard drive.)
But in question 2 you say that you also want files on C: in the same backup. Must it be in the same backup?
This depends on your retention policy. What do you intend to use?
They will be if dictated by your retention policy. Also, you can use the PURGE command to delete files from the backup data.
Then I would recommend setting --block-size to 1 or 2MB. This is the deduplication block size and must be set before the first backup is performed – it cannot be changed later. (See Choosing sizes in Duplicati for more info.)
Archive-class storage with most providers can cause issues with Duplicati. If you insist on using it, you probably need to set a couple options as mentioned in this post. Also you probably should use unlimited retention.
If you use unlimited retention, then your earlier concerns may be irrelevant (except the PURGE-ing of data).
First of all, thank you for your detailed and thoughtful response, I greatly appreciate the help.
I can even split it into 2 different backup tasks, for convenience I prefer everything to be in the same backup, but I can.
But I think that even using a script to check (which I don’t know how to do) any problems, such as the HD being disconnected or unmounted for whatever reason (and this happens sometimes) would make Duplicati delete all files from the S3.
The ideal would be that there was some option for it to understand that it is an external driver, and not delete it in case the driver is not contacted/mounted, just delete the files when it verifies that they were really deleted from the external driver.
Is there any way to do this?
I intend to configure to delete the files, 7 days after they were deleted from the drives.
(But I don’t want you to delete it if the driver is 7 days without being mounted)
Okay, I’ll use this configuration.
The problem with unlimited retention is that I have large files that are constantly being changed, for example, some 40, 50GB vhd’s that are modified with each use.
Over time, this will get astronomical.
Nope, it wouldn’t delete files on the back end just because they are not present on the front end. Remember Duplicati is a backup tool, not a synchronization tool.
Duplicati works by storing backup versions - that is, the way the filesystem looked at the time of backup. Depending on your retention policy, those versions are retained for some amount of time and then deleted (unless you are using unlimited retention).
I wouldn’t use 7 day retention then. If you had a single backup job doing both your external drive and your C: drive, a 7 day retention won’t take into account an external USB drive that has been missing for 7+ days and it will delete the files from backup.
What if you set up separate backup jobs, and didn’t schedule the USB external drive for automatic backup. Instead you could just trigger the backup manually when it was connected. If you do this, the 7 day retention would probably be ok.
But I do think 7 day retention is really short. Duplicati’s dedupe engine makes it very efficient to store many versions without greatly increasing storage. It does depend somewhat on your churn rate and how dedupe-able your files are, of course. As an example, on my main PC I have 283 backup versions going back to the fall of 2017 when I first started using Duplicati. The front end size is 45GB and the back end side is only 110GB.
Yep, unlimited retention will expand over time. But you are trying to use Glacier. Glacier has a 180 day minimum charge time, so your 7 day retention is at odds with that and you may find it ends up costing you a lot more than you expect.
If you were using Standard S3, or Standard-IA, then you definitely wouldn’t need to use unlimited retention. But Glacier changes things quite a bit. You also can’t easily do restores without changing the storage class of all your objects.
There is a simple alternative here which is to have the backup drive with its own Duplicati installation eg NAS server.
If your desire is to control the drive’s own backup with Duplicati from another source then I feel that over complicates it. But, it could probably be made less complicated with a possibly maybe simple code change. Just need a flag (user setting per backup) to flag a specific backup as a backup from external drive and then not do certain things when offline. Don’t know how Duplicati would react to that so possibly maybe simple.
Or maybe Duplicati shouldn’t do anything since the backup (perceived as local, not remote) location isn’t found when it might perhaps clearly be external. I don’t know what its seeing there (connection to drive type) but there might be an easy way to identify that and do that without using a setting and just block the backup from running in the first place.