Best practice for large data set (18TB)?

Hi

First, thank you to the development team for an awesome piece of software.

I have a 18TB data set currently backed up to Crashplan. But since Crashplan Home will no longer be available I’m looking for a new solution, which is how i found Duplicati.

The 18TB data set consist mainly (~75%) of uncompressed black and white image files (.BMP) sized about 15MB each (raw data from optical white light 3D scanning). So there is a huge potential for data reduction using compression and de-duplication. On Crashplan the allocated archive size is about 6TB.

I’ve done some tests using Duplicati (local files to local folder) with mostly default settings, and I’ve noticed the backup process slows down a good bit after a few hundred GB. Using a Xeon E3-1225 v3 with 32Gb ram.

Are there any best practices for large data sets?

Thanks,
Chris

It depends on your ultimate goal. The current design of Duplicati doesn’t scale well performance wise when dealing with large numbers (or lengths) of file names / paths. It works just fine, but takes a while to get there - especially with file browsing during restores.

So if performance is your objective then you might consider having multiple jobs each handling a subset of your source data.

However, deduplication is most efficient when lots of files with similar content exist so if reduced destination space is your goal then a single backup will likely result in less needs there.

We can’t really predict exactly what performance improvements we’ll see from the new design. I don’t know whether existing backup sqlite files will be “upgradable” to the new format or not, but if an upgrade path is planned, then it might be worth taking the deduplication benefits and performance hit on a single backup now based on the hoped for improvements to come. But that’s just me.

3 Likes

Thanks for your response.

Performance is not my goal. The intention is to create a storage archive, with no or little incremental changes over time, using as small footprint as possible. My only concern in regards to performance is whether the initial backup will take months rather than weeks/days. The archive will eventually be stored remotely, but initial backup will be done locally.

I assume the performance bottleneck for large local datasets is linked to access time and size of the deduplication table?

New design? From the current 2.0 beta?

Thanks,
Chris

The canary versions have some performance improvements over the 2.0 beta in areas like encryption and hashing, but the biggest performance hit with sources that have large numbers of files is usually in the sqlite database lookups.

Basically the current database design (even in canary) uses an inefficient method to store file paths. So if you have 10,000 x 5MB files you’ll get worse database performance than if you 50 x 1,000MB files even though in both cases you’ll end up with about the same number of blocks (~512,000) and remote archive files storing those blocks (~1,000 - assuming NO deduplication).

3 Likes

BTW: A good way of showing your appreciation for a post is to like it: just press the :heart: button under the post.

If you asked the original question, you can also mark an answer as the accepted answer which solved your problem using the tick-box button you see under each reply.

All of this also helps the forum software distinguish interesting from less interesting posts when compiling summary emails.

1 Like

Not to be flip but that sounds like a job for RAR files, not backup.

2 Likes

I did a 1TB test and I’m getting compression and deup ratio very close to what I got on Crashplan, which is very Impressive.

Some questions:

  1. Is 1Gb file size reasonable?
  2. Is there a way to do a rough estimate of the deduplication database size? just to make sure I won’t run out of space on my SSD.
  3. Are you guys planning to keep future versions backwards compatible?

Thanks,
Chris

1 Like

From what I understand RAR doesn’t do deduplication (unless you re-compress the archive to get some sort of deduplication)?

Using a block-based deduplication, 75% of my files (all unique files) get a reduction ratio of 4-5, which is why deduplication make a lot of sense in my case.

Since I’m using FreeNAS as local and remote server it would be ideal to use the built-in ZFS compression/deduplication and replicate that to the remote server. The problem is the memory requirement, min. 5GB per TB of data.

However, I’m open to suggestion on other solutions.

Thanks,
Chris

For volume size? Yes!

If you are primarily looking for dedup with whole files, you can set the --block-size=1mb to get fewer (but larger) blocks as well.

Each block (defined by --block-size, default 100kb) takes up ~40 bytes, so you get roughly “(data size / block size) * 40 bytes” for just storing the hashes.
Then you also need to store all paths, which is done inefficiently, so you need to store each path at least once.

Yes! In the sense that any new version will automatically work with data created by older versions.

2 Likes

Out of curiosity, if I set --block-size=1TB would that effectively disable block level deduplication? (Obviously, exactly matching files would still be deduplicated.)

Just to be clear, as with the version 1 to version 2 shift the backup data created by older versions should still be usable, but that doesn’t necessarily mean the backup jobs themselves will be usable - right?

(Not that there’s any plan that I’m aware of to change backup job formats again.)

1 Like

Awesome! Yes, volume.

Any idea on when you guys will release the next ‘stable’ version? Just trying to refrain from the temptation of using the canary.

Thanks!

Yes. And you would get really big volumes …

The switch from 1.3.x to 2.0 was a big one. If we ever decide to make such a big change again (not likely) the two could be incompatible. But otherwise I intend to make sure the data always works with newer versions.

There is a bunch of upgrade logic in the code already, so it currently supports automatic upgrading from the very first 2.0 build to the current version.

As soon as possible :slight_smile:

We made this list, but I have not been able to fix any issues for the last few weeks: 2.0 stable Milestone · GitHub

1 Like

I’m sorry, but I think I’m missing something on estimating database sizes. Following your tip, for example 1GB of data, it would be (1,024/0,1) * 0,04 = 409 KB (for each 1GB of data). But I tested a backup job of 10GB and the local database ended up with 45MB (or 4,5MB for each 1GB).

Should I just expect it to be like that for every backup? 4,5MB for each 1GB, using defaults 100kb blocks and 50MB dblock.

for what it’s worth, there’s no way i’d go with dblocks this small for so large a backup set - i’d look into going with dblocks between 200mb and 2GB (depending on your download bandwidth considerations, if any).

Are you referring to backup set as large as 6TB~18TB from OP mentioned? Or example of 1GB I wrote?

I have a 50/5 Mbps broadband connection, so I usually have upload with maximum 500~600KB/s.

In case of local backups, we really should upper these values, right? Maybe 1MB blocks?!

I suspect it’s mostly the “Then you also need to store all paths, which is done inefficiently…” part of that post that’s causing your database to be bigger than you expected.

Well, that and the fact that job specific logs are ALSO stored in that database. You can use the --log-retention setting to configure that on a per-job case, if you’d like:

--log-retention (default 30 days)
Set the time after which log data will be purged from the database.

Sorry, i was thinking of the OP. 1GB would be fine with default size IMHO.