Restore operation very slow and causes huge network traffic

After updating to Duplicati 2.2.0.3 I tried a restore operation of an existing backup to verify this works. The backup is located in the cloud and accessed via WebDAV. The backuped user data are about 55.000 files with a total size 29.9 GB. The size of the backup in the cloud is 34.3 GB.

The restore operation is now running for about 8h. So far 4.163 files have been restored with 26.64 GB of 29.9 GB. However the restore process has become very slow. Less then one file is restored now per second, and its still getting slower. I am getting serious doubts whether the restore will complete at all. The download network traffic from the cloud is constantly at the maximum rate of about 10 MByte/s. In the 8h since start of restore, 292 GByte (!!) have been downloaded from the cloud. However as I said the overall backup size in the cloud is only 34.3 GB. This means that Duplicati must be downloading the same data again and again and again.

Does anybody have an advice what I can do to improve this?

I’m using the standard Duplicati options and a fairly recent notebook (AMD Ryzen 7 7730U, 32 GB RAM, SSD). OS Windows 11 25H2.

1 Like

See note on --use-legacy-restore if it worked better before:

Thank you for this suggestion. Unfortunately this option didn’t have a significant impact.

Duplicati manual says: ā€œAs Duplicati cannot read data from inside the volumes, it needs to download the entire remote volume before it can extract the desired data. If a file is split across many remote volumes, e.g. due to updates, this will require a large amount of downloads to extract the chunks.ā€ This could explain the huge network traffic. Therefore I reduced the volume size from 50 MB to 20 MB, but this also had no significant impact.

Finally I reduced the number of snapshots in the backup from 44 to 3, but this also didn’t improve the restore process.

Your opening sounded like you saw slowdown on 2.2.0.3. Or was this first test?

This was your early conclusion (and the new restore can do that), but then you discovered another theory in the manual, related to how volume sizing matters.

Now we have to try to determine what’s going on, and you’ve started on testing.

You can’t do this so well after-the-fact, as older larger volumes will stay around.

This seems plausible to me. Deleting versions doesn’t directly reduce what the chosen version has to collect, although it can sometimes be helped by compact.

You can look in any backup log to see roughly what the compact did, if anything.

Retention settings

After deleting one or more versions, Duplicati will mark any data that can no longer be referenced as waste, and may occasionally choose to run a compact process that deletes unused volumes and creates new volumes with no wasted space.

A side effect of combining still-useful data from many volumes into fewer is that chances increase that the data from several updates could go into one volume.

I would suggest looking at the job log from the backup that reduced snapshots.

Make sure you turned use-legacy-restore on. Just a GUI add might not do it.

Logs can also show which way you’re going. log-file at even information shows all the downloads, so can support or refute the idea of repeating them for caching reasons. The old restore downloads a dblock once and distributes data content as needed. This is described in the article on the design of new restore.

About Duplicati → Logs → Live is an easy way to watch restore. Verbose level provides more details. You can also use log-file and log-file-log-level.

EDIT 1:

Basically, make sure the option doesn’t look like below. This option is turned off:

image

Download 8.5 times the total backup size (and still going) seems unlikely if the legacy restore is actually in use. I could maybe see a smaller multiple of total, because file metadata (such as time stamps) is set using a separate later step.

The developers know the restore design (especially new one) better than I do.

Now I got it right with the option for legacy restore. In the GUI it has a different name (restore-legacy), this is why I overlooked it and tried to add use-legacy-restore as user option.Thank you so much for your patience!!

Now restore is running as expected, after 20 minutes 20.000 files of 55.000 and 10 GB of 29.9 GB have already be restored. The new restore mechanism really seems to be weird.

I could manage to get the log file option work. Duplicati always reports that it can’t access the log file directory. I tried several directories and checked the permissions of these directories, they look fine. However as restore is working now the log issue is not so important.

Hopefully this thread may help other users, I guess that I might not be the only one struggling over this issue. Quite unfortunate that I lost two days on it. I was already starting to test some other backup software (kopia, which looks quite nice and has the advantage that the snapshots can be mounted as virtual drive).

I see that’s what I called for, so I’ll take responsibility. Not sure where I got name from, although short description might be a guess. Sometimes they’re very close.

It seems to provide more pitfalls than old. It can also use lots of temporary space.

I’m not sure why that is. AFAIK it’s just a file, so ā€œshouldā€ follow permissions for Duplicati process user (which might be different from user driving the browser).

I’m glad it’s working, and I’m concerned that the new restore seems a problem.

Thanks for the feedback!

To clarify what the two ways of restore do:

Legacy restore

The goal of the legacy (or original) restore method was:

Download each volume at most 1 time, and then extract everything from that one volume before processing the next volume.

The way this works is roughly:

  • Find all files that need to be restored
  • Find all blocks that are needed
  • Find all volumes that contain needed blocks
  • For each volume
    • For each block
      • Patch all files that need this block
  • When done, check that all restored files are correct (hash check)

There is no parallelism, even though it could be parallelized by downloading multiple volumes and then patching in parallel as well.

This is a bit simplistic, especially for spinning disks, as the writes happen more or less at random.

Default restore

The default, or new restore, works with a different goal:

Write each file’s content in-order from start to finish.

The way this works is roughly:

  • Find all files that need to be restored
  • Sort files by biggest-first
  • Fetch blocks in order needed to restore file
  • Write each each block in order

The restore process is heavily parallelized so we restore multiple files at a time. The writing is done sequentially, which should give the theoretical maximum write speed for the disk (if the network can keep up).

One issue with this method is that we may need the same volume for multiple files. To handle this, the downloaded volume is stored in the temp folder until there are no more pending requests for this.

Because this can take up a significant amount of disk space, we introduced a cache limit. When the cache is full, we start to evict volumes, even if they might be needed later. If you are unlucky, and your numbers suggest this, then volumes are repeatedly evicted even if they are needed later.

For now, the workaround would be to increase the value of restore-cache-max to contain more volumes, and perhaps also increase restore-cache-evict to be less agressive in evicting volumes.

Future

Naturally, the default settings should ā€œjust workā€, so we are looking into how to optimize the flow. One thing we can do is look at the restore order, so we retain the original goal, but also attempt to pick an order where there is maximal overlap between different restores.

Hope this clears up a bit why the new restore can download more data than what is needed.

1 Like

Hi kenkendk, thank you very much for explaining the differences between the ā€œlegacyā€ and ā€œdefaultā€ restore mode. I understand the ideas of both concepts. Even if the legacy mode works fine for me, I was curious and tweaked the parameters of the default mode to see how this might improve the restore. I increased the restore cache size from 4 GB to 16 GB (which is about half the size of the backup). I also changed the restore-cache-evict value from the default to 20. Unfortunately these two changes didn’t result in a significant improvement.

For anyone else that is drawn to this subject line, I had what sounds like the same problem.

TL;DR: Try the ā€œā€“restore-volume-cache-hintā€ option. It controls the amount of temp space for storing remote dblock, etc files locally. The default is 5GB (volume size x 100). The parameter will want a size suffix, eg –restore-volume-cache-hint=20GB.

I periodically do a full, –no-local-db, restore of my backups. One of them was taking many hours (I let it run to completion one time and I think it ran to 40+hrs). Other backups might run to a few hours .. maybe, usually much less. I also use a dedicated temp directory (–tempdir) and I could see a LOT of file churn. My log files recorded many downloads of the same files; I stopped counting after 50 downloads for the same file!

For the record, the backup in question tips the scales at 60GB for remote storage with thousands of small files, most of which don’t change once created, but there’s fairly constant addition and removal.

Anyway, I saw the ā€œā€“restore-volume-cache-hintā€ option in the source code and, basically, set it to the size of the backup (ie the total size of the dblock, etc, files).

Result? Restore took around 45min.

The default value for the restore volume cache is volume size * 100, or 5GB with a default volume size of 50MB. Once the cache size is exhausted, volumes are deleted by a ā€œleast recently usedā€ (LRU) eviction policy (don’t quote me on that). Sounds reasonable.

The problem though is that the new restore process is ā€˜file-based’ whereas the previous, legacy, process, is ā€˜volume-based’. I’ll explain (what follows is a very simplistic take on what kenkendk explained above but, I think, more clearly gets my point across).

The legacy process iterates through each volume in the backup, extracts all the file data it contains, which can be many files, and builds ā€˜scaffold’ files which contains the extracted pieces of each file. As each new volume is downloaded, data is used to incrementally patch the missing data in each file, until all the volumes are processed and, if all goes to plan, all your files are fully patched and restored. You can see the process in the logs if you enable verbose logging. And, if you watch the temp folder, you’ll see a single volume being processed at a time.

The new process iterates through the list of files to restore and retrieves all volumes that contain data for that file. If all the file’s data is in a single volume, your golden. If, on the other hand, the data is spread over many volumes, you’re going to need to retrieve all those volumes. Say a volume contains a single 128KB (I think?) block from your file, the rest of the volume is basically hoping a file it has data for will be processed before it gets evicted. At 50MB/volume and 128KB/block, a single volume could potentially have data for 400 files. With large backups and small cache, you can see the problem.

Interestingly, the cache eviction policy already has an ā€œevict if all files with data in this volume have been processedā€ step which operates even when the cache is barely used, but the cache size takes precedence and triggers the LRU qualification once exceeded.

I also tried the legacy restore process (–restore-legacy=true) which came in at about 3.5hrs, miles ahead of the default restore configuration for this particular backup, but it’s easy to see why the new process, an impressive improvement, was implemented.

I know this is a long post but I’ll also add that I think the new restore process is definitely worthwhile but I think it needs to be more aggressive in it’s cache allocation, when required, and it would be good if there was some indication of when the cache was exhausted. Alternatively, some level of auto-tuning could be useful too.

Thanks for reading.

3 Likes

Hi aureliandevel, thank you for the detailed explanation. Together with the response of kenkendk, this makes things really clear for an average user who doesn’t look at the source code. I have now activated the ā€œā€“restore-volume-cache-hintā€ option and set it to the size of the backup. This is the solution! The restore completed in just 57 minutes, twice as fast in my case as with legacy backup. Perfect. Really great advice.

However I believe the current restore implementation might lead to much frustration for the unaware users who don’t know (and should not know) all the details. Just in case some developer is reading this post I would like to suggest a possible improvement: If the download bandwidth to the backup is limited (backup in the cloud) and if the same volumes are downloaded multiple times, it would be very helpful to issue a warning explaining what is happening and suggesting to increase ā€œā€“restore-volume-cache-hintā€ if possible or to use legacy restore.

By the way, I now also got the logging option working. My problem was that I configured a directory for the log file, but Duplicati expects a specific file name.

1 Like

Hi @hbl2bk

Glad that your issue was resolved.

FYI, I’ve raised Discussion: Proposed changes to new (default) restore process volume caching to take this further.

Hi @hbl2bk! I’m the author of the reworked restore flow, and while I’m a bit late to the discussion (sorry!), I thought I’d confirm everything that’s been stated.

The outline provided by @kenkendk and @aureliandevel are both correct and unearth the underlying issue of poor volume cache utilization. I originally designed the flow to focus on increasing parallelism and sequential disk writing. This required layers of cache to keep the individual parts busy and to reduce repeated work.

For backups that had been running for quite some time, it turns out that many files are spread across many volumes, leading to a substantial amount of volumes being kept in cache at once (makes sense now, but again, all is clear in retrospective light :slight_smile: ). This in turn resulted in temp stores being filled up, which lead to the introduction of the cache size limiting option.

My initial thought was that while some volumes would have to be redownloaded, evicting some would lead to better storage space for ā€œno longer neededā€-evictions. Turns out I was wrong; the remote backups are downloaded multiple times, resulting in aggressive disk writes and internet connection usage.

I do have some ideas in mind, and I like the suggestions that’s already been proposed here. So hopefully we’ll be able to fix this issue. I’ll, however, move the discussion to the other thread.

1 Like

Hi Carl, excellent to know that you read this thread. Thank you for your reply; I’m sure you will find some improvement. Luckily in the mean time we have the work-arounds described above.

1 Like