Google Drive: 500 internal server error

NTWarrior · November 7, 2018, 7:08am

I’m backing up from Windows to Google Drive (G Suite), and yesterday my PC crashed during a backup. Since then, backups will get as far as verifying and I can see the traffic but after a few hours the backup terminates with a 500 error:

   at Duplicati.Library.Main.BackendManager.List()
   at Duplicati.Library.Main.Operation.FilelistProcessor.RemoteListAnalysis(BackendManager backend, Options options, LocalDatabase database, IBackendWriter log, String protectedfile)
   at Duplicati.Library.Main.Operation.FilelistProcessor.VerifyRemoteList(BackendManager backend, Options options, LocalDatabase database, IBackendWriter log, String protectedfile)
   at Duplicati.Library.Main.Operation.BackupHandler.PreBackupVerify(BackendManager backend, String protectedfile)
   at Duplicati.Library.Main.Operation.BackupHandler.<RunAsync>d__19.MoveNext()
--- End of stack trace from previous location where exception was thrown ---
   at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()
   at CoCoL.ChannelExtensions.WaitForTaskOrThrow(Task task)
   at Duplicati.Library.Main.Controller.<>c__DisplayClass13_0.<Backup>b__0(BackupResults result)
   at Duplicati.Library.Main.Controller.RunAction[T](T result, String[]& paths, IFilter& filter, Action`1 method)
   at Duplicati.Library.Main.Controller.Backup(String[] inputsources, IFilter filter)
   at Duplicati.Server.Runner.Run(IRunnerData data, Boolean fromQueue)

Would repairing the database at all, or is the problem squarely with Google Drive? I’ve now got >10TB with 50mb volume size, could Drive be struggling with that many files?

kenkendk · November 7, 2018, 9:57am

The 500 error means that some error happens on Google’s servers when attempting to list the files present there.

I think the problem is only related to Google Drive. Unfortunately, it is really hard to figure out why it fails, as it happens on a server we cannot access, and with credentials/data I don’t have, on an encrypted connection.

One potential route is to look at the network traffic using this approach:

Note: If you try this, be sure not to share the log files with anyone, as they can contain credentials that would allow someone to take over your google account.

ipitcher · November 9, 2018, 5:24pm

Starting a few days ago, I have been having the same problem with some of my larger backups that use my G Suite Google Drive account for storage. I have a clone of one of the failing backups running which uses local storage, and that one works well as usual. I’m wondering if this is a coincidence, or if there has been a recent change on the Google Drive end that is causing this. I should note that my problem did not coincide with a computer crash or interrupted backup. Just a normal scheduled backup that has been running for months without issue.

It seems like List operations are failing some of the time with large backups. Smaller backups using the same Google Drive account seem fine.

Operation List with file  attempt 1 of 5 failed with message: The remote server returned an error: (500) Internal Server Error. => The remote server returned an error: (500) Internal Server Error.
Backend event: List - Retrying:  ()

Occasionally, after a couple of runs that fail with 500 errors, the operation will complete successfully after a few retries.

JonMikelV · November 9, 2018, 7:15pm

Thanks for the local vs. remote comparison.

Perhaps you have so many destination files that Google is taking too long to list them - have you tried adjusting the --http-operation-timeout parameter?

The default timeout is 10m - does it seem to take about that long for the error to appear?

This option changes the default timeout for any HTTP request, the time covers the entire operation from initial packet to shutdown
Default value: “”

NTWarrior · November 9, 2018, 7:19pm

After a few more attempts, my backup has started running again so it may have been simply a temporary issue with Google Drive.

JonMikelV · November 9, 2018, 7:30pm

Temporary GoogleDrive issue…I’m not sure if that’s a good thing or a bad one.

ipitcher · November 9, 2018, 8:04pm

Good idea. I did an A/B test with and without an increased “http-operation-timeout” and with the timeout set higher, I don’t seem to be getting the timeout/retry issue. Without the timeout set, it seemed to be failing with a minute or less between retries.

When you say the default is “10m” do you mean 10 minutes? That doesn’t seem right. For testing, I set it to 5 minutes, which seems like an absurd amount of time, but it worked. I’ll try moving it down in 30 second increments to see what I can get away with.

JonMikelV · November 9, 2018, 9:52pm

Yes, I meant 10 minutes - but that was based on what was said in another post, I don’t know if that’s actually what is currently in the codebase.

Glad to hear the timeout parameter helped (maybe)!

ipitcher · November 11, 2018, 7:03pm

OK, I spoke too soon. It doesn’t seem like adjusting --http-operation-timeout was really having a positive effect like it seemed; however, setting --number-of-retries=15 and --retry-delay=30s did reliably succeed for me. It makes sense, considering that the Google Drive List operation was failing and returning a 500 error within several seconds and not actually timing out. Increasing the retries and the delay seems to skew the odds in favor of the list succeeding. I think I’m going to try listing these folders with a quick REST API program to see if I can replicate the problem outside of Duplicati.

ipitcher · November 13, 2018, 1:39am

I wanted to narrow down the conditions that cause this error, and I think I’m onto something. Using the command line gdrive program, I found that listing the folder containing the affected backup fails with a 500 error when I don’t limit the results. Eg. gdrive list --query "'<FolderID>' in parents" --max 699 succeeds, but gdrive list --query "'<FolderID>' in parents" --max 0 (no limit) will fail.

The strange thing is that the upper limit seems to be a moving target, ie. sometimes --max 1000 will work, and other times it won’t. This is very frustrating, as I can’t even delete files to change my retention rules and run a new backup.

JonMikelV · November 13, 2018, 7:36pm

Interesting tool, thanks for sharing it!

Assuming I’m understanding things correctly, the gdrive tool helped isolate the issue to Google Drive (or the connection to it) so the only “fix” is to either get Google Drive (or the connection) to not timeout and/or updated Duplicati to better respond to this when it happens.

I don’t know how Duplicati currently handles it, but perhaps we should start with --max 0 and if that fails go to --max 1000 and for each failure after that cut the max in half until we get to 0 at which point it’s a true failure.

Of course this assumes Duplicati already handles aggregating the multiple fetches (which it might not yet do).

Alternatively, we could look at the database which should be able to tell us how many files to expect and then start with a max of double that…

kenkendk · November 13, 2018, 9:05pm

Yes, Duplicati already handles pagination:

github.com

duplicati/duplicati/blob/master/Duplicati/Library/Backend/GoogleServices/GoogleDrive.cs#L485


      
                          mimeType = FOLDER_MIMETYPE,

                          labels = new GoogleDriveFolderItemLabels { hidden = true },

                          parents = new GoogleDriveParentReference[] { new GoogleDriveParentReference { id = parent } }

                      };

          

                      var data = System.Text.Encoding.UTF8.GetBytes(JsonConvert.SerializeObject(folder));

          

                      return m_oauth.GetJSONData<GoogleDriveFolderItem>(WebApi.GoogleDrive.CreateFolderUrl(m_teamDriveID), x =>

                      {

                          x.Method = "POST";

                          x.ContentType = "application/json; charset=UTF-8";

                          x.ContentLength = data.Length;

          

                      }, req =>

                      {

                          using (var rs = req.GetRequestStream())

                              rs.Write(data, 0, data.Length);

                      });

                  }

              }

          }

We currently rely on Google’s API to use a sensible default for the limit, but we could easily change that to use a forced limit. The docs say default is 100:

ipitcher · November 16, 2018, 5:50pm

Interesting. This problem seems to have magically disappeared. My previously failing scheduled backups began working again today. If Duplicati depends on the Google Drive API to set a limit, maybe something on Google’s end was bungled, causing list queries with no specified limits to return more than 100 records at a time?

JonMikelV · November 16, 2018, 6:17pm

I don’t suppose you’d believe we had Google rewrite their API to work better with Duplicati, would you?

We’ve seen things like this in the past with other providers where a few (or even all) users have odd issues with the LIST step of things then they magically resolve themselves.

I wish there was a better way to pin down exactly where the issue is coming from so we could assure users without a bunch of manual steps. But in the end (at least this time) it sounds like Duplicati is “doing the right thing” by saying “hey - something isn’t wright the destination provider, let’s hold off on doing any backups until it’s resolved.”.

ts678 · November 16, 2018, 6:23pm

For whatever it’s worth, I saw this and tried it two days ago and saw the problem. Just now, all is well again…

There were 15 backend files at the time of failure (now 18 with today’s backup) so it might not be a size bug. Proof is not certain because Duplicati has been restarted (I’m not sure what version it was), but I’ve seen the error come and go before. I’m not sure what causes it, and web searches didn’t offer me a definitive answer.

Somewhere there’s an official Google document that advises exponential backup, so if it was still happening, –retry-delay of something larger than the default 10 seconds would have been something worth playing with, which has already been done above with good results. Network tracing as mentioned above might also help, and the experiments from @ipitcher (thanks!) are also sounding useful in figuring out what affects the error.