Dealing with backends for implementing parallel uploads

seantempleton · March 30, 2019, 12:06am

What is the advantage to isolating each backend in a separate process? Simply in case one crashes? As opposed to running multiple threads in the main process and instantiating classes as needed for the different backends. Instantiate two BackendUploaders for B2 and Amazon for instance. I guess that way would require a ton of refactoring for all the different classes involved.

This is conceptually the same as running two instances of Duplicati side by side right? So if I have 10 files to upload they will be hashed, encrypted, etc. once for each backend? It would be cool to do all the work as normal and then just split at the point of upload to the different backends. Thinking ahead though, you could have different block sizes for the 2 backends, different encryption, etc.

kenkendk · March 31, 2019, 7:48pm

Crash resilience is not so important. There is exception handling all the way, so it would have to crash in a nasty way for it to matter.

The problem solved with multiple processes is interference control. There are a bunch of shared variables and settings in .Net (basically anything static) that is per-process. Most importantly, the ssl certificate validator can only be set for the entire process. That means if the user has a particular certificate hash allowed (or allows all certificates) this setting is applied to all http requests, including OAuth and updater checks.
(This design is because the http requests in .Net are pool based, so you cannot easily map a request to the source).

Another is cancellation where we want to force-stop a transfer. This is not currently possible for some backends, as they do not offer a cancellation token. It does sort-of work because we can Dispose the source stream, but it is possible for the backend to hang. Due to worker pools (especially with Task’s) it is not possible to simply kill the thread, as there is no telling which thread is actually doing the work.

That would be the next step. If we can use the Controller class remotely, it is possible to run multiple backups side-by-side with no interference problems, and super simple kill functionality.

That sounds like the “RAID backend” idea where you split the uploads over multiple providers. I have not follow that idea as the failure detection is complicated. You need to somehow know which destinations have what files, and then deal with cases where two destinations report the same file but with different metadata or content.

seantempleton · April 3, 2019, 3:08am

I am not doubting you but I’m curious, what makes the validator only work per process? I see nothing static in it and it looks like you can load it up with all the valid hashes of the certificates you want.
Maybe I am thinking of this wrong but if the user has a certificate allowed for a certain backend then that hash shouldn’t match the certificate for any other backend right? So wouldn’t it still be fine in that case to run multiple backends in the same process?

With the change to PutAsync cancellation tokens are now passed to all the backends. Whether it works properly from there or not I can’t say.

Obviously this could be made as complex as you could imagine (as evidenced by the thread on the RAID idea). My thought process was to make it simple to save the user time in one scenario of wanting to backup one source to multiple destinations.

Treat it like setting up two jobs that would normally run independently, but it just so happens that all the hashing, etc. is done once. The uploads go to their respective destinations and for all intents and purposes the database would look the same as if you created two jobs right now from the same source and same backup settings, and ran them one after the other. But… all this is easier said than done that’s for sure.

kenkendk · April 9, 2019, 8:54pm

At least some backends do not use the token, so it only works in some places.

You register it using System.Net.ServicePointManager which is static. You can see how it is done here:

github.com

duplicati/duplicati/blob/master/Duplicati/Library/Modules/Builtin/HttpOptions.cs#L179


      
          bool disableExpect100 = Utility.Utility.ParseBoolOption(commandlineOptions, OPTION_DISABLE_EXPECT100);

          

          // TODO: This is done to avoid conflicting settings,

          // but ideally, we should run each operation in a separate

          // app-domain to ensure that multiple invocations of this module

          // does not interfere, as the options are shared in the app-domain

          m_resetNagle = commandlineOptions.ContainsKey(OPTION_DISABLE_NAGLING);

          m_resetExpect = commandlineOptions.ContainsKey(OPTION_DISABLE_EXPECT100);

          m_resetSecurity = commandlineOptions.ContainsKey(OPTION_SSL_VERSIONS);

          

          m_useNagle = System.Net.ServicePointManager.UseNagleAlgorithm;

          m_useExpect = System.Net.ServicePointManager.Expect100Continue;

          m_securityProtocol = System.Net.ServicePointManager.SecurityProtocol;

          

          if (m_resetNagle)

              System.Net.ServicePointManager.UseNagleAlgorithm = !disableNagle;

          if (m_resetExpect)

              System.Net.ServicePointManager.Expect100Continue = !disableExpect100;

          

          string sslprotocol;

          commandlineOptions.TryGetValue(OPTION_SSL_VERSIONS, out sslprotocol);

The HttpContextSettings class kind-of fixes it, in that calls made through the AsyncHttpRequest class honor it, but other calls do not. The S3 library and others, simply use the .Net WebRequest, which is served by a pool and thus does not preserve the call stack.

You can see the comment here, where we just pick any settings we can find:

github.com

duplicati/duplicati/blob/c3136e31dfcf3c0857ed7f5eecf6cfb362d56df7/Duplicati/Library/Utility/CallContextSettings.cs#L175


      
          /// <param name="sslPolicyErrors">Errors discovered.</param>

          private static bool ServicePointManagerCertificateCallback(object sender, X509Certificate certificate, X509Chain chain, SslPolicyErrors sslPolicyErrors)

          {

              // If we have a custom SSL validator, invoke it

              if (HttpContextSettings.CertificateValidator != null)

                  return CertificateValidator.ValidateServerCertficate(sender, certificate, chain, sslPolicyErrors);
          
              // Default is to only approve certificates without errors
              var result = sslPolicyErrors == SslPolicyErrors.None;

          

              // Hack: If we have no validator, see if the context is all messed up

              // This is not the right way, but ServicePointManager is not designed right for this

              var any = false;

              foreach (var v in CallContextSettings<HttpSettings>.GetAllInstances())

                  if (v.CertificateValidator != null)

                  {
                      var t = v.CertificateValidator.ValidateServerCertficate(sender, certificate, chain, sslPolicyErrors);
          
                      // First instance overrides framework result
                      if (!any)

                          result = t;

archon810 · August 7, 2019, 11:23pm

What’s the number of parallel uploads by default, and can it be tweaked with a parameter? Advanced Options - Duplicati 2 User's Manual doesn’t seem to mention this.

ts678 · August 8, 2019, 12:58am

Manual hasn’t caught up with the code (which is still not beta, but no guarantee it will catch up if released).

  --asynchronous-concurrent-upload-limit = 4
    When performing asynchronous uploads, the maximum number of concurrent
    uploads allowed. Set to zero to disable the limit.

Available from Duplicati.CommandLine.exe help advanced, or Commandline menu plus adjustments.

Pectojin · August 8, 2019, 6:50am

There is also always a description in the UI when adding the parameter. And it will always display the default value as well.

But yeah it’s not always the developer who ends up documenting their change which is of course a bit sad since it may end up being undocumented

ts678 · August 8, 2019, 1:53pm

To sound a little less circular (i.e. one can find the description and default if one knows the option name)

So basically the idea is one goes fishing in the program itself for latest options, gets option name and short description as a guide on the dropdown, then picking from dropdown adds it to the list with long description.

Any way you do it (even if via manual) it’s a bit of looking (but sometimes machine-searchable) for options. Ideally when features are announced in the release notices, usage directions (or link to manual) may exist. Things aren’t that sophisticated yet, and as with most things it’s limited by available volunteers to do better.