What is the backup proces with lots of small files?

Nemesis · June 25, 2023, 1:48pm

Hi there,

I have been using Duplicati for a couple years now and after some struggles with the initial setup, everything has been working great. I have four backup tasks set up, and recently i noticed that one of them always takes 3+ hours even when no files have changed, while the other three are done in about ~10 minutes, even with lots more data.

I compared the logs for all backups and noticed while the other three backups had a relatively small amount of “Source files examined”, the one that always takes 3+ hours has 1.7 million source files examined in only 630 GB data, while the other tasks have more like 20,000 files over a couple TB. So apparently there are a huuuge amount of small files in this task.

How does Duplicati handle this? Is this bad for my disks to run for 3+ hours every week? Is there any way i can optimize the backup task for this? Maybe with certain parameters or something?

gpatel-fr · June 25, 2023, 10:37pm

Hello

Duplicati enumerates all the files from the selected sources and this involves lot of directory accesses. I have done quite some tests this week-end and IIRC scanning 250000 files was taking less than 5 minutes (SSD hardware, lots of free RAM, 8 recent cores), 3 hours for 1.7M seems a bit slow. Sp, it may be linked to your system (not enough memory or slow hardware or slow operating system) and in this case there is not much that can be done.

If by any chance it’s the database that is the choke point, you could try to update to Duplicati 2.0.7 and set the environment variable
CUSTOMSQLITEOPTIONS_DUPLICATI=cache_size=-200000
to give more memory to the database engine.

ts678 · June 26, 2023, 1:37pm

More statistics please. Below are examples from job home page and log.

It sounds like your source is 630 GB. High version counts can add delay.

It sounds like your Examined is 1.7 million files and other values are zero:

If other values are not zero, you can look in Complete log, e.g. BytesUploaded, to see if it’s low too.

Your status bar should start “Counting” soon after start. After that, it’s got 1.7 million files to look over.

Windows NTFS has a shortcut Duplicati can use, but I suspect this is your NAS with Duplicati Docker.

Checking for changes on 1.7 million files, at about 150 per second using hard drives seems plausible.

Nemesis · June 28, 2023, 9:52pm

It looks like this:

I managed to delete almost 700,000 files already because it turns out i had multiple complete backups of windows installations laying around on my NAS.

I’m mainly just wondering if this many small files is bad for my hard disks or something and if i should seriously do something about it. I had one hard drive in my NAS fail recently and it’s a little early for it’s age. It could be coincidence of course but when i went to look around the NAS and i found this super high number and im just trying to connect some dots :).

Windows NTFS has a shortcut Duplicati can use, but I suspect this is your NAS with Duplicati Docker.

Yes that is correct.

ts678 · June 29, 2023, 2:17pm

Hard to say, although speculation is possible. Some people think failures are kind of random, but
if you’d asked this on a forum with a lot of hardcore hardware people, you’d get more opinions…

Since nobody seems to want to jump in, I’ll try. First, your disk doesn’t directly see files. That’s OS.
Disk is probably pretty busy doing random accesses, as opposed to sequential reads of large files.

If you have any drive monitoring software, you might want to use it to see what’s actually going on.
If it can give you S.M.A.R.T. stats, that will enter into the next question – are you overusing drives?

Hold on a sec. When did HDDs get SSD-style workload rate limits?

Hard drive makers seem to be increasing metrics in their specs, e.g. we now have workloads, e.g.
Annualized Workload Rate (Seagate) which looks at the reads and writes per year for the different
classes and varities of drives they sell. Is it real or marketing? I’d think it’s probably a bit of both…

Seagate’s page also says they can use this to limit warranty claims. Another vendor (now in news)
Western Digital Begins Flagging 3-Year Old HDDs As Needing Replacement – just going by age…

There is little large-scale data to answer such questions. Backblaze’s reports don’t study workload.
I think they’ve looked at other ways to predict failures. You looking at S.M.A.R.T. might be one way.
Different NAS and drive vendors may have other ways, some of which are tied up in the WD mess.

EDIT:

Drive Failure Over Time: The Bathtub Curve Is Leaking finds that, yes, mechanical drives can wear.
Google’s Disk Failure Experience summarizes a long 2007 report with more subtleties in the matter.