Very slow backup of many files

thommyX · September 30, 2017, 1:42pm

I have a SVN Repository with about 110.000 revisions. So about 225.000 files with 28 GB.

Now as normal in a SVN Repository there are only a few files that change if new revisions are commited. So from one week to another there are maybe 200 new files with some MB (not GB) or sometimes nothing changes.

If Duplicati now makes a second, third,… backup (with nothing to do) it takes about 55 Minutes.
I also set the switch --check-filetime-only=true but the time doesn’t change. Between 53 and 55 minutes.

If I copy the complete repository from source to target with robocopy it takes less than 55 minutes and a second robocopy just checking for changes takes about 11 seconds.

Is there anything I can do to accelerate?

Only new files needs to be backuped and files with changes in the timestamp.

JonMikelV · September 30, 2017, 3:20pm

With so many files my GUESS is that the sqlite lookups to fetch the “last backed up” date to compare to the current file times are what’s taking most of the time. Depending on your source disk speed the file scans to go the current times could also be a decent chunk of that.

Assuming that’s the case, there are some discussions going on regarding how to improve sqlite performance however I don’t believe any actual coding has started based on those talks.

sanderson · September 30, 2017, 4:48pm

What OS is your source and what is your target/destination?

I’ve got a much larger backup set 2.5ish TB, 300K files, and nightly syncs are taking about 35-45 minutes and a large percentage (about 80%) of that is doing a directory list of the all the files on the remote SFTP server to verify they exist. The actual process to check for updated files and update the small number of changes is only taking 5-10 minutes.

thommyX · September 30, 2017, 9:05pm

Source is a WIndows 7, SVN on a SATA-Drive
Destination is a Windows 7, SATA-Drive
Duplicati-PC is running on a third Windows 7 machine, OS-HDD is a SolidState and the Duplicaty DB-Files are on this SSD

All 3 are wired with Gigabit Ethernet.

Source and Destination are used as Windows Shared Folders.

A Robocopy started on Duplicati-PC, copying from source to target takes about 45 Minutes. Depending on what is running on source and/or target.

A Robocopy started the second time on Duplicati-PC (so also with check if files have changed!!!) takes 7-15 seconds.

A Duplicati First-Time-Backup started on Duplicati-PC from source to target takes hours and hours. But that isn’t a problem.

A Duplicati Second-Time-Backup with nothing to copy stated on Duplicati-PC takes about 50 minutes. Also with this switch to compare on time base.

So the question is: Why can robocopy decide if something needs to be copies in 11 seconds while Duplicati needs 50 minutes.

Before Duplicaty I was using Areca Backup and this needed some minutes but not nearly a hour.

So I will wait for a new version. Maybe an improvment with the sql access will speed up the process.

JonMikelV · October 1, 2017, 12:34am

Robocopy is able to do a direct btree style FAT to FAT comparison. Duplicati should get btree FAT type speeds on the source but does a much less optimized sqlite lookup on the other side.

On top of that, when a timestamp difference is seen robocopy can do a simple file copy. Duplicati has to chop the source file up into blocks (default size 100KB).

Each of those blocks gets hashed then another sqlite lookup happens to see whether or not that block has changed. When a changed block is found it gets set aside for later upload. When enough changed blocks have been identified to fill up a dblock (archive file) they get zipped together, the resulting compressed file encrypted (if enabled), uploaded, and records for all the changed blocks added to the sqlite database file.

So in all those potential Duplicati steps one or more are the bottlenecks on your system. We’re working on adding some internal step monitoring to help identify exactly where slowdowns may occur, but they’re not done yet.

Until we’ve got those metrics, figuring out specific causes of slowdowns can be a bit of an art.

I’m glad to hear you like Duplicati’s functionality enough to keep future versions in mind!

thommyX · October 1, 2017, 6:48am

Thank you for the reply.

And yes: I like it!

mizu42 · October 1, 2017, 10:42am

Very happy to hear you will add data loggng to enable finding the root cause of the performance bottleneck. Probably also helpful for locating it in my case.
Are you planningto drop an interim version having some log data to dig in?
Thanks!
Mike

JonMikelV · October 1, 2017, 12:12pm

Probably not as soon as people (including me) would like. As a volunteer effort having specific timetables is difficult since how much time people have to spend developing can vary widely from things as simple as a change of seasons.

However, as an open source project there are ways to speed things up including visiting Duplicati on Github and;

checking out the code (C#, .NET, HTML, CSS, etc.) and submitting your own proposed changes
offering a Bounty to entice others to work on features you’d like to see (SIA support is a great example of that)
starting a discussion about the feature in the Features category so existing developers can get a better idea of exactly what’s expected out of a feature as well as gauge general interest in having it added, just vote with you (button)

kenkendk · October 5, 2017, 7:35am

I am aware of at least one performance bottleneck:

It talks about disk space usage, but it is also a performance problem.

Once this is fixed, Duplicati can make really fast “gimme all files in the folder” queries and this will significantly speed things up, as we do not need to make a database query pr. file.

jack_reacher · December 11, 2017, 4:01pm

nice article,