I have been using Duplicati for a couple years now and after some struggles with the initial setup, everything has been working great. I have four backup tasks set up, and recently i noticed that one of them always takes 3+ hours even when no files have changed, while the other three are done in about ~10 minutes, even with lots more data.
I compared the logs for all backups and noticed while the other three backups had a relatively small amount of “Source files examined”, the one that always takes 3+ hours has 1.7 million source files examined in only 630 GB data, while the other tasks have more like 20,000 files over a couple TB. So apparently there are a huuuge amount of small files in this task.
How does Duplicati handle this? Is this bad for my disks to run for 3+ hours every week? Is there any way i can optimize the backup task for this? Maybe with certain parameters or something?
Duplicati enumerates all the files from the selected sources and this involves lot of directory accesses. I have done quite some tests this week-end and IIRC scanning 250000 files was taking less than 5 minutes (SSD hardware, lots of free RAM, 8 recent cores), 3 hours for 1.7M seems a bit slow. Sp, it may be linked to your system (not enough memory or slow hardware or slow operating system) and in this case there is not much that can be done.
If by any chance it’s the database that is the choke point, you could try to update to Duplicati 2.0.7 and set the environment variable
CUSTOMSQLITEOPTIONS_DUPLICATI=cache_size=-200000
to give more memory to the database engine.
I managed to delete almost 700,000 files already because it turns out i had multiple complete backups of windows installations laying around on my NAS.
I’m mainly just wondering if this many small files is bad for my hard disks or something and if i should seriously do something about it. I had one hard drive in my NAS fail recently and it’s a little early for it’s age. It could be coincidence of course but when i went to look around the NAS and i found this super high number and im just trying to connect some dots :).
Windows NTFS has a shortcut Duplicati can use, but I suspect this is your NAS with Duplicati Docker.
Hard to say, although speculation is possible. Some people think failures are kind of random, but
if you’d asked this on a forum with a lot of hardcore hardware people, you’d get more opinions…
Since nobody seems to want to jump in, I’ll try. First, your disk doesn’t directly see files. That’s OS.
Disk is probably pretty busy doing random accesses, as opposed to sequential reads of large files.
If you have any drive monitoring software, you might want to use it to see what’s actually going on.
If it can give you S.M.A.R.T. stats, that will enter into the next question – are you overusing drives?
Hard drive makers seem to be increasing metrics in their specs, e.g. we now have workloads, e.g. Annualized Workload Rate (Seagate) which looks at the reads and writes per year for the different
classes and varities of drives they sell. Is it real or marketing? I’d think it’s probably a bit of both…
There is little large-scale data to answer such questions. Backblaze’s reports don’t study workload.
I think they’ve looked at other ways to predict failures. You looking at S.M.A.R.T. might be one way.
Different NAS and drive vendors may have other ways, some of which are tied up in the WD mess.