Capabilities and limitations? near 1 million files

TheDaveCA · April 16, 2018, 9:55pm

Does anyone have any experience with larger sets of data? I’m looking at Duplicati for a few different servers, but one of them has been a bit of a pain for certain other backup products that I have tested.

Currently it’s sitting at 84.4 GB (90,710,489,019 bytes) which isn’t too bad, but this is across 866,808 Files (3,223 Folders), and the number of files has caused slowdowns in a couple products I have tested. While this averages out to around 100KB/file, a good chunk of the files are much smaller.

A vast majority of these files (currently 859,368 of 866,080) have content which is immutable, the only operations performed are adding new files, moving files between directories (including a rename), and deletions. The other files are essentially rewritten whenever they’re touched (I would not expect any 4KB block to match a previous version of the file as they’re small and regularly have data removed from within the file).

Is this within the expected/intended use of Duplicati? Any suggestions for configuration knobs to tune? I would expect enabling usn-policy might be worthwhile?

I intend to store this backup directly to B2 (this is a different environment than my local+copy-to-cloud question, I’m exploring where Duplicati would be a good fit, ideally I’d like to find a solution that works across the gambit of data under my responsibility to replace the piecemeal solutions I rely on now).

drakar2007 · April 16, 2018, 10:39pm

As long as the line of demarcation is pretty clear and easily established/maintained, it might be worth your while to set up 2 discrete backup sets, one for the more “immutable” subset of files, and one for the more dynamic subset. Various settings would most likely have different sweet spots between these two - the main ones I can think of off the top of my head will be volume size (larger for the more constant files, smaller for the more static ones), and frequency of backup job runs. Also you would probably want to implement, tweak and finetune Retention Policy settings, being especially heavy-handed on the set with the more static files (and by that i just mean they’d be thinned out more quickly and more heavily, and maybe with a shorter max lifespan if any).

cpo · April 17, 2018, 11:37am

Hi there
I use duplicati with backup sets containing up to 3.2 mio. files and 7 TB data.
Runs fine, except loading and/or rebuliding DBs takes a while

TheDaveCA · April 17, 2018, 6:19pm

Unfortunately it’s the bigger static set of files that really matter, the dynamic set is basically just metadata describing the static data set (plus some user-set flags). The data in the dynamic set is added/removed automatically based on the presence of the static set. I think it makes the most sense to match the backup and retention policies, but I need to ponder possible restore scenarios.

I’m running a backup on what was intended to be a subset of the static data, but due to a copy/paste mistake in the configuration there is no --exclude, so it turns out I’m less than an hour away from having a first full backup. If I need to tweak further I’m willing to delete and restart, but this gives me a starting point to test against.

Awesome, thanks! My biggest worry is being on the edge of what has been done before as this means I can bump into all sorts of undiscovered weirdness.

Thanks to both! I’ll post again if I run into any issues, but at least the initial run seems to be pretty smooth.