Multi-threading block and file hashing

seantempleton · October 24, 2017, 12:38am

Hello all. I have seen the posts about slow performance, and experienced it myself, and decided to look into it. I know that most (all?) of the work that Duplicati does is single threaded and if I am not mistaken Ken is working on multi-threaded uploading.

The plan I had was to take the ProcessStream function from BackupHandler.cs, extract it out into it’s own class, and make all of it’s functionality multi-threaded. Block hashing, file hashing, block list hashing, and then adding the block to the database/uploading it would each be on separate threads.

From some hacky and super preliminary testing I did with just the hashing done on different threads I am seeing about a ~30% to ~50% performance improvement. I wrote a test so I didn’t have to run the whole program and just compared the existing function against the class I was writing so it wasn’t doing any uploading or database work, just reading and hashing.

My question is, is this something that is desirable? Is anyone else working on this? If Ken is working on multi-threaded uploading, is he changing this code anyway?

I would love to be able to contribute some code to this project and help make it better!

drakar2007 · October 24, 2017, 3:24am

It sounds great to me - but i’m not sure what @kenkendk is working on in that respect. I was under the impression (based only vaguely on what i’ve seen here) that he’s mainly working on multithreaded uploads next, so there’s a chance your efforts might not be redundant.

JonMikelV · October 24, 2017, 4:21am

If possible you might want to include database work in your tests as I believe a chunk of reported issues seem to be due to database performance.

On top of that, I’m not convinced sqlite handles concurrency well, so the threads may end up blocking each other.

That being said, I’m sure there’s still performance to be gained from your code.

kenkendk · October 24, 2017, 11:53am

The fix I am working on splits everything into processes. This ensures maximum performance on all levels (directory reading, file reading, hashing, compression, and encryption).

It already works for the backup process, you can get it from the concurrent_processing branch.

I am in the process of rewriting the other functions to use the same structure.

The delete/repair/compact functions are all calling each other, so it needs a big push to get all operation working. Also, the BackendManager is currently implemented twice, once in the old version and once in the new, and I would like to remove the old version completely.

Part of this rewrite is also to provide a thread-safe interface to the database, as there is currently a bit of ad-hoc locking system. There is also a lot of state that is shared in unexpected ways which causes problems (search for “volume not finished”) which will be fixed with this update, as each process has explicit input.

It requires a fresh head from me to do this rewrite, and I am lacking the time to complete it, but hopefully I will fix it soon.

seantempleton · October 25, 2017, 1:35am

That all sounds good. I will take a look at the concurrent_processing branch to see how it all works. If I can’t effectively contribute in this area I’ll look around and see what other things I can contribute to. Thanks for all your hard work Ken!