Processor features and deduplication


#1

I recently moved my duplicati installation from a MacBook pro (i7-3520M CPU @ 2.90GHz, running Ubuntu 18.04 server) to an older machine I’ve repurposed as a fileserver (Core2 Duo CPU 6400 @ 2.13GHz, also Ubuntu 18.04 server) . The first backup from the new installation has been running for 3 days now, and has only got through about 500 gigabytes. This is much slower than previously (last backup took 5:57 in total according to the UI). It is on " Backup_ProcessingFiles", and reports a speed of 1 byte/s.

I’m trying to figure out if the slow speed is a hardware limitation that will come up every time, or just an issue with this first backup because of some incidental changes.

  • file metadata has changed - new owner UID for all files
  • The backup used to use two directories as a source. They’ve been merged into one directory source.

As I understand it, this is enough that it will review every file, and update the index with the new location (“delete” the old location and add the new). It isn’t uploading anything, because deduplication catches it all.

I expected this process to go much faster, though. SHA sums aren’t exactly hard to calculate, but I can’t think of what else is tough about this. What’s likely to be the bottleneck?

Here’s a screenshot of system tools - ethernet, disk, cpu, and memory are all largely idle.

(edits for extra info)


#2

Hi @ohthehugemanatee, welcome to the forum!

You’re probably correct that every file is being re-hashed - and it’s probably not the SHA calcs slowing things down so much as the database lookups to see if the hash already exists.

The database SQL is a known less-than-optimal part of Duplicati what we’re (slowly) working on improving. Until then, I’m guessing things will be slow this first run (think if it as a “first backup” even though little is actually being re-backed up) and future runs will probably be much faster.

Note that depending on your retention policy, the old location isn’t immediately “deleted” so you’re effectively working on a double sized database for the moment.

Oh - and for future reference you can paste screen shots directly into posts here.

(But that’s nice system tool you’ve got and I hadn’t considered using my NextCloud for such uses. :slight_smile: )


#3

I’m wondering if the system is disk-bound. I think these are different tools. The htop “Load average” being somewhat high with CPU use being somewhat low is explainable by I/O waiting, for example in the red “D”. Bottom screenshot might be the top disk users shown by iotop, with quite a few Duplicati threads unseen.

iotop

In addition, the total I/O bandwidth read and written during the sampling period is
displayed at the top of the interface. Total DISK READ and Total DISK WRITE values
represent total read and write bandwidth between processes and kernel threads on the one
side and kernel block device subsystem on the other. While Actual DISK READ and Actual
DISK WRITE values represent corresponding bandwidths for actual disk I/O between kernel
block device subsystem and underlying hardware (HDD, SSD, etc.).

What’s odd is that total and actual are exactly the same. If these are Duplicati file scans, I’d have expected some read-ahead into memory. You could try running sar -b or something to see if you’re disk-read-bound.