Does it also plausibly explain the second part, where Puts didn’t actually leave files (ouch)?
Yes (ouch indeed). The destination server is told in advance how large the file is. When the timeout happens it stops the transfer, and the destination server correctly discards the upload.
Since the failure is treated as a requested cancellation, no errors are triggered. During the post-backup verification step, this missing file is then seen as a partial upload and purged from the database, leaving no trace of the problem, besides an entry in the table
RemoteOperation
.This post-backup cleanup was the reason that no testing found it. It simply purges the data associated with the missing file, making it as if some operations never happened. Only in the rare cases where it affects
dlist
files does it become immediately visible to the user.It is hard to find in the database, but what you will see is that there is a PUT entry in
RemoteOperation
mentioning the file, but it is missing inRemotevolume
.Because files can be removed later, a missing entry in
Remotevolume
can be correct or a symptom of the error.I have been building a more strict verification and cleanup strategy, so errors of this kind will be caught should they happen in other places.
I think this is the issue in question.
I thought it was a known issue that was being addressed, but if it isn’t I will certainly provide whatever info I can. I will post info the next time it happens.