Configuration tips for large (200TB) SME cloud backup

Chris_S · July 18, 2023, 11:05am

Hi everyone,

We’re setting up a cloud backup solution for our SME. Cost is a factor, and so we’re exploring the use of Duplicati with Backblaze. Data volumes are large (~200TB) with large numbers of files (200,000,000+ files); these are spread across a number of mounts that vary in size. The data is relatively static, so once the initial upload is complete we should be OK. We have a synchronous 1G line with no volume caps.

Has anyone here had experience configuring Duplicati for a project like this? Any experience or suggestions relating to performance would be gratefully received; we are wondering, as an example, how choices relating to block size and volume size might impact things. Are there practical limits to the size of any single database that might impact how we split the backups?

Many thanks for your help,
Chris

gpatel-fr · July 18, 2023, 12:12pm

Hello

is 1G meaning 1 Gigabit ? if yes, you could not go below 500 hours of upload (15 days…) even if the network transfer rate could be even nominal. You would need 10 Gbit/s and a system that could feed it. I don’t quite see how to push Duplicati beyond 1 Gbit/s without using a quantum computer from 2050.
Just use a block based backup, I say.

Chris_S · July 18, 2023, 12:26pm

Yes, 1 gigabit, and it will of course take a good month for the initial upload. However, most of the data is static and subsequent uploads will be incremental only and therefore much smaller.

gpatel-fr · July 18, 2023, 12:37pm

I think that it’s beyond reason to advise Duplicati for this amount of data.

ali · July 18, 2023, 12:54pm

So what’s the benefit of Duplicati?!!

gpatel-fr · July 18, 2023, 1:00pm

What’s the benefit of a car if it can’t even handle 20 tons of rock like any decent truck ?

ali · July 18, 2023, 1:31pm

I see, you’re right, I have almost the same issue, @gpatel-fr what is the best uploading speed to cloud by using duplicati? Is it depending on cloud provider or all of configurations are included in Duplicati?

ts678 · July 18, 2023, 2:45pm

I think you might hold the record for large, but Google allows numeric range search so you can try
"duplicati" "1..999 tb" which unfortunately misses when it’s all run together, such as 200TB.

The rough rule of thumb for Duplicati scaling is to stay below 1 million blocks, else SQL gets slow. Sometimes raising the blocksize from the rather small default of 100 KB (good for 100 GB) solves,
however if there are lots of files, the tiny blocks (about a hundred bytes) of file metadata ruins this.

Although available volunteers and their equipment are too scarce to focus on performance testing, recently I tried a test with a larger backup than my usual rather selective one, and hit this issue by forgetting that I set up a 10 million file folder with 0 byte files, so that’s an extreme small-file setup.

Basically (as near as I can tell) it bogged down in the database despite a very generous blocksize. Observations were made at both the drive level (backup to an external drive) and on file accesses.

Recently one trick was added, which is to increase the database cache which I haven’t tried yet… Duplicati was running as a Windows service, and it’s harder to play with the environment variable. Possible after we gain confidence in that new option, it will become available as a Duplicati option.

My motivation for this test was that I used Macrium Reflect Free for occasional PC image backup, however it’s going away, so I was looking for replacement, then wondered if any file backup could perform as fast. Answer at least for Duplicati is that it can’t – but I’m happy enough with an image, provided I can restore specific files. Basically that plus frequent selective file backup works for me.

You’ve got both problems – lots of data and lots of files, so I think you should be looking elsewhere.
Maximum usable size of the repository? Petabyte scale possible? is a Kopia thread on large cases.

My personal opinion is that scaling well to petabyte or even 200 TB will not be possible for Duplicati, however reliable backups (an issue for any solution) at smaller sizes is still a worthwhile target goal.

Kopia is a somewhat newer entrant. You’ll see the thread mention restic, which I think is older, but it reportedly didn’t scale-to-huge well either. Finding a mature free solution may be hard. Good luck…

If this is a bet-the-business situation, choose carefully and consider having several backups available.

so you probably want something that deduplicates well for the additional data. Many solutions do that.

There is certainly less expensive storage even from major vendors, but there is usually a catch such as minimum file retention period. Some may require cold storage which can be a pain if you want restores. Simply pointing this out because this might get costly, and might go well with backup software selection.

If you go image based (I guess this was also referred to as “block based”), then the question is how well deduplication happens if you need frequent backups too. Some people have tried combining base plus frequent backups done different ways. You can find them here asking for backups of recent file changes.

That’s about all I can suggest for generalities, but there are forums where large backups get discussed.

gpatel-fr · July 18, 2023, 4:24pm

Err, that’s exactly what I have done since the last release. But I am aiming far lower than 200 TB.

Chris_S · July 18, 2023, 4:28pm

Thank you very much for you advice guys, we’re consulting on other options too and allocating budget as I think a paid solution may be the best option here.

ts678 · July 18, 2023, 5:03pm

You’ve also done lots more, so thank you. If you say focus was performance testing, then I’m wrong.
Point I’m making is that this is a small team with nobody volunteering purely on performance testing.
Compared to one company I used to work at that had a well-funded department, we’re pretty light…

is an example of the kind of number they’d try to make. Here nobody knows because nobody reports. Additionally it depends on a huge number of factors. There’s always some limiting aspect, but where?

One comment I forgot to make in the first load of advice is, especially for a business, consider cost of downtime in event of disaster. Full restore time might be slower than initial backup. Is that survivable?

so if disaster occurs, is this the only thing that will be downloading for months, or is there competition?

Duplicati also has a local database that is sometimes hard to rebuild, and that comes before a restore, assuming loss of system and no countermeasure for faster initial restore, e.g. backing up its database.

Server Backup 101: On-premises vs. Cloud-only vs. Hybrid Backup Strategies may have useful ideas. Going with a single cloud-only backup, paid or not, may require an enormous download for all its data.