Dealing with backends for implementing parallel uploads

seantempleton · February 16, 2019, 2:43am

I’ve been working on implementing parallel uploads to a backend and I have a question on the backends interface.

The IBackend and IStreamingBackend interfaces contain synchronous functions, none return Tasks. Therefore calling Put() from within a new Task.Run() call means the operation cannot be canceled. On the other hand it could be called from a new Thread that can be aborted but I don’t particularly like aborting threads in the middle of what they are doing as that can lead to corruption and other problems.

What is everyone’s opinion on converting void Put() to Task Put()? I would be willing to do the work of updating all of the backends but there’s no way I could test them all myself. In fact, I probably couldn’t test a majority of them.
The second problem is that some libraries would need to be updated, like updating System.Net.FtpClient to FluentFTP, and other code like CoreUtility.CopyStream() would need to be changed to support copying asynchronously.

Thoughts?

-Sean

seantempleton · February 20, 2019, 3:05am

I’m going to go ahead with changing Put() to return a Task.

Next problem after that is reporting progress. The GUI is clearly built for reporting progress of a single file upload at a time. Should we choose one of the files currently being uploaded and update the progress of that file to the GUI for now, letting the others “silently” upload, and make a pass later at changing the GUI?

warwickmm · February 20, 2019, 4:09am

This sounds like quite a “task”. Best of luck! I have a local test suite that I use to test operations on a variety of backends (too embarrassing to share, but perhaps one day if people think it’s useful), so I can help out with some testing if needed.

In case you haven’t seen it, there’s a related GitHub issue:

Also, do you see a way to do this that might open up the possibility for parallel jobs in the future?

ts678 · February 20, 2019, 2:58pm

Basically, let’s not make that worse. I think there are issues even currently, and test difficulties might exist.
I’m not a Duplicati developer, but reliability and maintainability matter to me personally and as I help others.

I’m not sure how much it’s used now, but maybe this could be put to some purpose in error handling tests:
“A testing backend that randomly fails”

seantempleton · February 21, 2019, 1:36am

Haha warwickmm, I see what you did there with that task comment. I like it! It sounds like your test suite could prove to be useful.
Thanks for the link to the GitHub issue. I can see how this work could help get us towards running jobs in parallel, but there will be plenty more work to make that happen. I want to see that happen as well and am willing to help out.

ts678, I agree with the goal of reliability. My goal is not to make anything worse, ideally either staying at the same level of reliability or increasing it. I’ve been testing my changes locally, having all blocks verified after the backup is done, as well as doing restores and then making binary comparisons from the original source files to the restored files, and everything has matched up so far. Those tests are only on my machine backing up to another hard drive. There are more changes to make and definitely a lot more testing to go.

warwickmm · February 21, 2019, 4:12am

@ts678 makes a good point. We currently have some unresolved reoccurring issues where users have a frustrating experience and it is difficult to nail down the exact cause of the issues. We should try to ensure that the introduction of more parallelism in the process doesn’t hinder our ability to help debug issues and lead to poorer user satisfaction.

kenkendk · February 21, 2019, 10:32am

When I wrote the backend interface, there was no Task construct in .Net. Had I written it today, it would certainly be async.

I don’t think we need to do that. For older libraries that do not support the non-blocking calls, we simply wrap them in Task.Run(). This does not help in any way with it being non-blocking, but it allows us to use the same interface for everything.

In later versions of the .Net framework, there is a copy method, including an async copy method on streams, so we could probably target a newer version of the framework and remove the function. See also the huge amount of work done by @mnaiman here:

Net 5 migration by mnaiman · Pull Request #3124 · duplicati/duplicati · GitHub

As I have stated before: my biggest reason for not doing parallel uploads is the cleanup required in case of failures. For example, it is possible that you generate block1, index1, block2, index2. Since we are doing parallel uploads, it could be that we complete in the order index1, index2, block2, block1.

If we then loose the connection at a random point, we cannot immediately clean up the remote destination. On the next connection, we need to know what to remove. And on restore from a “dirty” destination we need it to handle the out-of-order upload even if index1 exists but block1 does not.

This one is parallel on a different “level”. Currently all backups run in the same process (i.e. the server or tray icon process). This prevents us from running multiple instances, as some resources are shared, notably the certificate validation code is “per process”. To support multiple backups running in parallel, we need to support spawning the backups in a separate process (easy) and piping back progress (slightly complicated). For many reasons, such as force termination, I would like to implement this anyway.

I have thought about that a few times, and I think the GUI needs to change to show what is going on. For the upload part, I think we can cheat a bit and just treat parallel uploads as a single “metafile” upload (i.e. sum the file sizes and speeds).

seantempleton · February 22, 2019, 2:08am

@warwickmm Agreed! I’m willing to do whatever is needed to ensure this feature doesn’t make anyone’s experience worse.

I was wondering if that might be the case.

That’s a reasonable way of working around it. My primary motivation for wanting to switch things over to non-blocking calls wherever possible was because they generally support taking a CancellationToken. I have updated the FTP backend to FluentFTP locally and there was no breaking API changes. I haven’t actually tested it yet though.

That function, Stream.CopyToAsync(), is available in the version of the framework we are using now. That said I opted to make a CopyStreamAsync() function as CopyToAsync() will create a buffer on each call. JSONWebHelper for instance iterates over headers and for each header calls CopyStream(). We can use the same buffer for each of those headers rather than having the framework create a buffer on each call. Totally a premature optimization on my part.

Currently the BackendUploader uploads the block file, and if that succeeds, then the index file. Without making the code complex enough to handle all possibilities like you mention, it seems to me that keeping that pattern but parallelizing it should result in the same functionality and failure modes as we have now.

I was about to head down that path. I’ll try it out and see what it’s like.

seantempleton · February 28, 2019, 12:30am

I think I have this working now. The changes made so far:

Uploads are now run on separate threads.
IBackend and IStreamingBackend will now return a Task for the Put() function.
All the backends have been updated for the new signature.
Upgraded from System.Net.FtpClient to FluentFTP.
Added a new option to control how many concurrent uploads are allowed.
Created a new class that tracks the progress of all uploads and throttles them as necessary. All uploads meaning just those going through the BackendUploader.

I have run the File, FTP, and Dropbox backends. Is there anything else I’m overlooking that needs to be changed? Any type of specific testing to find more bugs? So far everything seems to be running properly and verify completes succesfully at the end of each run with it set to verify all volumes.

warwickmm · February 28, 2019, 2:16am

Can you try running a few backups, interrupting them in a few different ways, and attempting to run the backup again? Just curious how the database transactions are handled, etc.

If you’re willing to push your branch, I can run a a few of my tests when I have some time.

seantempleton · March 1, 2019, 3:33am

I pushed a branch, parallelupload, with all of my changes if anyone wants to look at it / test it out.

One thing I just thought of is we will probably need a per-backend maximum concurrent upload limit. I’m sure some backends will vary wildly as to how many uploads they will support.

Maybe users could set the global limit to a value, say 10 as an example, and each individual backend could have a hardcoded maximum that the provider supports. I think that would be nicer than the user not knowing what to set the value to and possibly getting cryptic errors if they set it too high.

warwickmm · March 1, 2019, 5:11am

While running some command-line tests, I’m getting errors with all my backends: Amazon Cloud, Box, B2 Cloud, Dropbox, Google Cloud, Google Drive, Local, Microsoft OneDrive, SFTP.

I also see some unit test failures (for example, BorderTests.Run10k). Do the unit tests fail when you run them?

seantempleton · March 1, 2019, 12:39pm

Thanks for testing @warwickmm. Can you share how you were testing with the Dropbox and Local backends? I’ve been running those through the GUI and they have been succeeding for me.

I’m sad to say I forgot about the fact that there are unit tests. I’m sure those will fail for me when I run them. I will look into it.

JonMikelV · March 1, 2019, 1:48pm

Thanks for doing so this with everybody!

@seantempleton, I’m curious what the non-happy path is for transfer failures - are they errors or warnings with X retries?

If all upload attempts fail on a thread is the job aborted (other there’s cancelled) or do you continue uploading everything else?

warwickmm · March 2, 2019, 3:22am

My test runs the backup command a few times and checks the exit codes:

$ duplicati-cli help returncodes

Duplicati reports the following return/exit codes:
  0 - Success
  1 - Successful operation, but no files were changed
  2 - Successful operation, but with warnings
  50 - Backup uploaded some files, but did not finish
  100 - An error occurred
  200 - Invalid commandline arguments found

The initial backup run should return 0.
After modifying a source file, the second run should return 0.
Without modifying a source file, the third run should return 1 (this fails).

Also, when running the test command, I see the following output

  Listing remote folder ...
  Downloading file (1.36 KB) ...
  Downloading file (1.25 KB) ...
  Downloading file (2.72 KB) ...
  Downloading file (3.56 KB) ...
Failed to process file duplicati-i9b8432b94db44378a1a4a03644c8b52b.dindex.zip.aes => Invalid JSON, expected "String", but got Null, 
  Downloading file (3.53 KB) ...
  Downloading file (1.28 KB) ...
Failed to process file duplicati-i893890c6cfe4489aa63af85aca096e3b.dindex.zip.aes => Invalid JSON, expected "String", but got Null, 
  Downloading file (3.47 KB) ...
duplicati-i9b8432b94db44378a1a4a03644c8b52b.dindex.zip.aes: 1 errors
	Error: Invalid JSON, expected "String", but got Null, 

duplicati-i893890c6cfe4489aa63af85aca096e3b.dindex.zip.aes: 1 errors
	Error: Invalid JSON, expected "String", but got Null,

seantempleton · March 4, 2019, 4:14am

Thanks warwickmm, I was able to see those same errors when running the unit tests and the tests are now all passing.

As it is currently written it should behave the same as before, X retries and then there will be an error. Any other current uploads will be cancelled. It would be feasible to let other current uploads finish transferring and then throw an error. I wasn’t sure which way to go so went with the cautious approach of being able to alert the user immediately.

JonMikelV · March 4, 2019, 2:00pm

I think that’s the right way to go - as far as I know anything uploaded as part of an eventually failed backup just gets deleted as cleanup during the start of the next backup.

warwickmm · March 5, 2019, 4:05am

I’m still getting some errors with Amazon Cloud (seeing {"message":"Internal failure"} in the output), and Box.com (it appears that no files are uploaded).

@seantempleton, do you want to move this to a GitHub pull request? It’s a bit easier to review the code and provide feedback there.

seantempleton · March 5, 2019, 4:33am

That’s kind of what I was afraid of, that I would wind up breaking a bunch of backends. But… I’m committed, can’t stop now!

I think I see the problem in the Box backend. That was a pretty small change. I arguably did a little refactoring that wasn’t strictly needed.

I’ll go ahead and create a PR like you suggest @warwickmm.

kenkendk · March 28, 2019, 9:29am

I made a proof-of-concept implementation of running with isolated backends here:

It took a bit longer than expected and does not work perfectly just yet. The logging system uses the call context (essentially the stack trace) but once methods are executing in two different processes, the call stack is not preserved and some manual plumbing is required to re-wire it.