250GB downloaded to restore a 2GB backup

ultramoo · April 28, 2024, 10:42pm

I have set up duplicati with default settings to a S3 equivalent cloud storage (using Storj)

I decided to test a restoration of a about 2GB worth of data in a pool of about 1TB worth of incremental backups over that 6 month period.

What I noticed was it kept downloading and checking many files, overall it took 5 hours for Duplicati to find all the files it needed and download them - but it took 250GB worth of data downloading just to get that 2GB. Is this normal process for duplicati? Is there any way I can make the restoration efficient?

DCinME · April 29, 2024, 7:52am

I apologize if I make any undue assumptions regarding your knowledge of how Duplicati works.

How Duplicati works (greatly oversimplified):

Duplicati divides the files you are backing up into blocks, and when it has enough of those blocks it packages them into volumes that are sent to your storage location. Duplicati keeps a database that records which blocks came from which files, and which volumes they were stored in.

Duplicati does this so that if only part of a file changes it doesn’t need to back up the entire file again. It adds only the blocks that changed to the backup.

When you restore a file from a backup, Duplicati will consult the database and make a list of blocks that it needs in order to reassemble the file, and a list of volumes it will have to retrieve in order to obtain those blocks.

If you are lucky, all or nearly all of the blocks you need will be contained in a small number of volumes. However, depending on your circumstances the blocks you need may be spread across many different volumes, and Duplicati must download the entire volume even if it only needs a single block that is stored within it.

This is all to say, depending heavily on your exact circumstances, this can be normal operation. Especially if you have very large files that change a little bit with each backup, such as virtual machine images.

There are a few things you can do to make this process more efficient, but as with most things there are tradeoffs involved.

Using smaller volumes can reduce the amount of “extra” data Duplicati needs to download during a restore, but will increase the total number of volumes that need to be stored. Some storage providers limit the number of files you can store, or suffer from performance problems if the number gets too high so this avenue might be limited.

Splitting large backup jobs into smaller ones can also improve efficiency but at the cost of having more backup jobs to manage.

Also, generally speaking, larger block sizes tend to perform better, at the cost of being less space-efficient. Block size can’t be adjusted for existing backups, which is another significant tradeoff.

ts678 · April 29, 2024, 10:56am

Very much. I’ll poke a little more.

Are you testing a regular restoration or a partial disaster recovery restoration. Latter must build a database, and can go into your “checking many files” and “find all the files”. With a database, database avoids all that.

What are you looking at anyway that leads to your conclusions on what’s going on? Got any better details?

Do you have files that change incrementally? Duplicati uploads changes, so on a restore must gather them.

Deduplication can add to download. A duplicate block avoids upload, but might be downloaded on a restore.

Are these small files or large? I “think” a file gets its timestamps from the remote. This can raise downloads.

Large files would feel this effect less, but a large file that changes lots might find its blocks scattered around.

Small files can also be scattered, e.g. if they get created at different times, they initially get different volumes.

Compacting files at the backend is another way to scatter blocks across destination if data churn gets large.

How the backup process works
How the restore process works
Choosing sizes in Duplicati
Increasing the Remote Volume Size

The downside of using larger volumes are seen when restoring files. As Duplicati cannot read data from inside the volumes, it needs to download the entire remote volume before it can extract the desired data. If a file is split across many remote volumes, e.g. due to updates, this will require a large amount of downloads to extract the chunks.

What’s your Options screen Remote volume size?

Some increase is normal for reasons stated, but the extent (yours seems quite high) depends on specifics.

ultramoo · April 30, 2024, 3:27am

Regular restoration on the same system as the one running Duplicati, so its not rebuilding the database

Yes many of the files would be changing in very small amounts.

Small files - largest would be 10MB in size

Not compacting - no-auto-compact = true

Remote volume size is 150MB, is that too big maybe??

ultramoo · April 30, 2024, 3:29am

This is probably the main reason, I just need to find a way to make this more efficent

kenkendk · April 30, 2024, 8:44am

That is hard to optimize without giving up some of the key assumptions in Duplicati.

Each backend is treated as a “simple” storage, and each has only the following operations: GET, PUT, LIST, DELETE.

This simplification ignores some features that may be implemented by the storage vendor, at the cost of being versatile. Basically, it means that there is no way to get part of a file (i.e. getting a single block) without getting the whole volume file. This is further complicated since the volume files are compressed and encrypted.

To get just the blocks you need, the backend (S3 equivalent in this case) would need to understand how to decrypt the file, decompress it, and return the blocks inside. And the S3 API would need a way of asking for this, instead of just asking for the volume file.

In your case this problem increases, because each small change causes a new block to be created, and the new blocks across the files are grouped into a larger volume. Since this happens almost for each backup, you need a full volume to be downloaded for each change.

One way of dealing with it could be to make multiple backups, so each VM (or similar) is stored in individual backups. This would mean that only the data relevant to a restore is included in the backup.

ts678 · April 30, 2024, 10:31am

Is there any specific time or download quantity target? Is this a frequently repeated operation to optimize?

There are lots of tradeoffs possible. As a Storj user, maybe price matters. Are you familiar with their plan?

Are you familiar with Storj segments, and price/performance differences for different sizes of stored files?

You can possibly reduce unwanted downloads with smaller volumes, but losses might be speed and cost.

You can possibly reduce the space-saving deduplication references to other volumes with large blocksize.

Sami_Lehtinen · May 3, 2024, 5:06pm

You’re not alone. That’s actually expected, with ever growing data set. To get stuff restored, everything stored need to be downloaded, because it’s still needed. Most of my backups are similar. And that’s why I’ve been at times complaining about the compaction speed and or asking for options which would allow it to be broken into smaller chunks.

But that’s not a bug, it’s a feature. Only bug is that if you start compaction and stop it, it can lead to data corruption. I haven’t yet seen anyone confirming this major failure would have been fixed yet. But when it is, I’m quite happy. Bonus of course would be sane way to set time limit for compaction. Clean exit after 8 hours of compaction or something similar.

Basically all chunking and de-duplicating systems suffer from the same problem, nothing new there. Unless there’s server side compaction.

Today I just restored ~300 GB database and sure, it took quite a while, but worked out otherwise as expected. Running the compaction (and getting rid of backup sets (aka versions)) reduces storage and time needed to restore.

I had one test binary from someone, which fixed some of the corruption problems. - If the latest version contains these fixes. Could someone please confirm it to me? So I’ll update the clients and STFU about this topic which makes me tired.

ts678 · May 3, 2024, 5:55pm

You’ve been in several of these, maybe:

was the plan, but I don’t know what you did after the Canary came out below. If you need Beta, it’s soon.

and if a Canary is too risky, there was a Beta release candidate out recently which you could test:

2.0.8.0_experimental_2024-04-19

but there’s also

but I’m not sure this one happened yet, because I think it looks like this pull request:

github.com/duplicati/duplicati

Fix "Index consistency check failed" when running repair

duplicati:master ← Jojo-1000:fix-repair-index-consistency

opened 08:40PM - 12 Dec 23 UTC

Jojo-1000

+119 -11

Closes #3202 [Forum discussion](https://forum.duplicati.com/t/any-way-to-rec…over-backup-repair-and-purge-broken-files-dont-help/17048/35?u=jojo-1000) - Uses `GROUP BY Hash` to select any one of the possible blocklists in the database for each blocklist hash, so that re-use of blocks is ignored - Add a test case to reproduce the bug - Add a hash check when reading blocklists from index files, to prevent issues in recreate due to existing incorrect index files that might be caused by compact ## Steps to test recreate hash check - Create unencrypted backup with at least one file larger than a block - Make backup version - Edit an index zip file on the destination, `list/[hash]` and duplicate the contents of the file (append to the back) - Run database recreate ### Old behavior - Recreate fails with error `Recreated database has missing blocks and 2 broken filelists. Consider using "list-broken-files" and "purge-broken-files" to purge broken data from the remote store and the database.` - The database cannot be used to continue the backup ### New behavior - Recreate finishes with warning `[file] had invalid blocklists which could not be used. Consider deleting this index file and run repair to recreate it.` - The dblock file is used to get the correct blocklist - The database can be used as normal ## Impact This modifies a core SQL statement that will be used in repair, compact and to recreate missing index files. It should be tested that these work correctly for a backup with some data and a compact history.

ts678 · May 7, 2024, 2:01pm

2.0.8.1 Beta is now out, which seemingly stuck to the format of Experimental being Beta release candidate.

Problem with that plan is it might not install on the brand new Ubuntu, but the timing of finding that was bad.

kenkendk · May 7, 2024, 2:38pm

I think it is possible to proactively re-group the data to ensure the data for the latest versions are grouped. It does require bandwidth and time to do it, but it could significantly speed up the restore.

I have looked at that a few times and it looks pretty solid. I will prioritize getting a review of that ready, but no, it is not in the beta release.

I think it is a matter of time before the Mono packages stops working anyway, so I would like to focus on the .NET8 builds that have recent Debian/Ubuntu support.

MacOS support is also hanging by a thread for the 2.0.8.1 release unfortunately.

Sami_Lehtinen · May 7, 2024, 3:03pm

Sure, but is it worth of it. Backup restore should hopefully be very rare operation. Another thing I’ve been thinking for a long time, but haven’t bothered to actually do, is putting different data types into independent backup jobs / sets.

As example when thinking from efficiency, it’s really stupid to put encrypted large files into same backup set, where I’m putting small logs which are retained for a long time. If I would separate those, it would be likely that the large encrypted files would expire in batches which would allow deleting whole block files and those annoyingly small log files mixed in between would cause the files to be compacted (or read back when restoring) when most of the data is expired anyway.

Grouping data with same retention period (in the source) would make perfect sense. Large encrypted whiles might be there just for a few days, and the small logs might hang around months. Mixing those into the same block-file is great example of turning the restore into very slow process and or making compaction really heavy and time consuming job.

Maybe I’ll do that one day, but to be honest, it hasn’t been a problem, even if I know it’s silly and really inefficient. The compacting of that data set still takes less than 12 hours (on local disk), so I haven’t minded it too much.

ts678 · May 7, 2024, 3:17pm

It would be nice to write official guidance on what to do for the known weak or trouble spots as-released.
macOS got code-level recovery from Python 2 loss, but I think the documentation on install is unsettled.

Ubuntu 24.04 LTS ‘Noble Numbat’ Is Now Available for Download, Here’s What’s New

Users of Ubuntu 23.10 (Mantic Minotaur) will be offered an automatic upgrade to Ubuntu 24.04 LTS (Noble Numbat) soon after the release. However, Ubuntu 22.04 LTS (Jammy Jellyfish) users will be able to upgrade (officially) on August 15th, 2024, when the Ubuntu 24.04.1 LTS point release arrives.

is one thought on when the 24.04 pressure will be increasing. I’m not sure we’ll Beta .NET 8 by August, although possibly the planned changes list can be reduced if there’s a need to hit a specific target date.

If possible, we should update and use the Kestrel server from ASP.NET.

so there’s some wiggle room, and maybe it won’t make the first release. Sometimes wishes must wait. There’s also now a chance that Duplicati Inc. could think of something that will drive next client release.