Timeout and missing files on B2 with canary 2.1.0.125

Today I have a backup that is reporting missing files.

  • “repair” won’t run without --rebuild-missing-dblock-files as mentioned in my last post
  • the repair still has missing blocks as expected
  • “purge-broken-files” fails with “The operation PurgeBrokenFiles has failed => Hash was null”

Fortunately this is a relatively small backup so deleting and recreating the database is not a problem.

On a Windows 10 system running repair:

Failed to access file: C:\WINDOWS\system32\config\systemprofile\AppData\Local\Duplicati\SOKWEROHXU.sqlite => The process cannot access the file 'C:\WINDOWS\system32\config\systemprofile\AppData\Local\Duplicati\SOKWEROHXU.sqlite' because it is being used by another process.
Failed to access file: C:\WINDOWS\system32\config\systemprofile\AppData\Local\Duplicati\Duplicati-server.sqlite => The process cannot access the file 'C:\WINDOWS\system32\config\systemprofile\AppData\Local\Duplicati\Duplicati-server.sqlite' because it is being used by another process.

The repair otherwise appears to have run OK.

I have switched some of the backups that seem to be more prone to the problem over to the S3 compatible backend. We’ll see how it goes…

Then there is some detective work required. How did those files disappear?

Judging from the timestamp and the filenames, this looks like a single backup that has disappeared, specifically on from 2025-07-18, so not a new backup. A guess would be that this backup was deleted by the previous backup, but somehow the database update was not persisted. Alternative is that something else deleted the files, which does not sound plausible.

Do you have any way of figuring out where the files went? B2 logs? Duplicati logs?

Still valid. The idea is to detect stalls and otherwise don’t care how long the operation takes.

It is not stalled as I understand it. The problem is that we activated chunked uploads to B2, and it looks like this is handled incorrectly when used with B2. It looks like the transfers actually complete, but we do not get the transfer to complete when it is done, instead it waits for a response from B2 which can be delayed.

With the fix, the activity timeout should only cover the transfer itself.

That should work the other way around. There should be more files registered (the “failed” uploads), but this is usually fixed automatically.

Did you also see the problem with a higher timeout value?

I share the concern and would like to understand how to replicate the issue.
EDIT: Created an issue to track this.

Yes, that certainly looks to be the case. Is there any way to log this on the B2 side, so we know if B2 considers the file received or not? We rely on B2 to send a 200 - OK status code back, but maybe there is a deviation somewhere?

I would really like a look, maybe I can find a clue somewhere.

Yes, that is the rule.

Yes, that is the logic. Once the transfer starts (or really, when the file is ready to be transmitted), its status is recorded as “Uploading” in the database. This is committed to disk before the actual transfer is done to prevent “unknown” files from being found on the remote destination.

After the transfer completes, the status is updated to “Uploaded”, but this is a bit less strict in commit timing due to the asynchronous nature of the upload.

During the file list check, a file that is listed as “Uploading” but is found in its full length is promoted to “Uploaded” to account for the gap in transaction commit.

If the file is “Uploading” and missing or incomplete, it is removed, as it is assumed the transfer failed. If it exists, it is deleted as well. Some other code can sometimes create a replacement file (index files and filelists).

If the file is “Uploaded” and missing or incomplete, this is an error because the backend has acknowledged the transfer, but now the file is gone. And this is what @jberry02 reports happens.

Do you have the stack track for that one? We have discovered an error in the Test code that does this, and I think we have generally fixed it, but I would like to verify.

Do you have the exact string so I can edit it? I think it suggest “purge” because you could theoretically purge the files in question manually, but maybe purge-broken-files is a better suggestion.

Backblaze B2 uploads fail, often with timeout. Big read-write-timeout may help. #6438

I believe its because on B2 we use StreamContent(timeoutStream, B2_RECOMMENDED_CHUNK_SIZE) where B2_RECOMMENDED_CHUNK_SIZE is 100MB. So each write operation to the time observing stream takes longer than the Timeout itself.

It seems this was miss interpreted from the documentation as there the chuck size pertains to operations using b2_start_large_file & b2_get_upload_part and we use the b2_upload_file url.

I don’t think there’s any chunked upload, but the large buffer dropped to the small default (and what the optimal buffer size is seems like a good question – maybe default is not it).

Backblaze B2 Event Notifications Now Generally Available might work, but is some work.

This is why I keep asking about ability to log low level activity for debugging purposes.
Network tracing used to be able to do this (but not without issues) in .NET Framework.

Question is whether it’s an error that can be recovered from, like file error left at Uploading. It might depend on scenario. Probably also questions of how much alert to give to the user. Going to far extreme would be Verified file vanishes, and Duplicati does purge-broken-files, which seems to me like a step too far.

Can “purge” delete dblock files? I was thinking it could only delete the files within the fileset?

Re: logs, etc.

I have a new example from overnight - I’ll see if I can’t put all the logs and other info together.

The default StreamContent buffer is 8KB, buffering data between the input stream and HTTP request stream. Network transmission is limited to MTU (~1.5KB), so buffer size doesn’t impact network performance if no bottlenecks exist. A 100MB buffer with a 15s timeout risks timeout if a chunk takes >15s (e.g., slow I/O), as our custom idle stream detection checks only after a full chunk. The 8KB default (~12,500 chunks for 100MB, ~10ms each) ensures frequent checks, avoiding timeouts, so we reverted to 8KB. Benchmarking larger buffers might improve input stream/disk reads but likely not HTTP writes

This got rid of the 100MB buffer.

I uploaded the logs from the repair sequence to Google Drive and sent you a share link. Since the bug causing the file(s) to be missing has been identified I did not provide the log from the backup that claims to have completed yet has a missing file. If you sill want a look at the prior logs, let me know.

It did not present this issue this time around. I will try to remember to post a stack trace if it happens again.

The backup storage destination is missing data files. You can either enable `--rebuild-missing-dblock-files` or run the purge command to remove these files. The following files are missing: duplicati-b47558abd5b524d3fb302b3e68a2a07b9.dblock.zip.aes

I tried doing a “purge” (not “purge-broken-files”) against the specified missing file, but it refused to even try saying I needed to run repair. See the logs.

After doing “purge-broken-files” everything appears to be OK.

I recall you talking about this before (not sure if for this B2 issue or diagnosing some other case) and forgot to ask @kenkendk if there a particular reason why we can’t have a build with Network tracing enabled.

No reason we cannot do that. I have investigated using HttpDiagnosticsEventListener and then redirecting the logged events to the regular log system, and it looks like this could work.

The MTU is quite far away from StreamContent. At the TCP level, I think Nagle’s algorithm is on by default to coalesce data. Above TCP is generally TLS, where a TLS record can be up to 16 KiB, and might encrypt and package less efficiently if it gets smaller inputs of data.

This worried me because web info seems to think the default StreamContent buffer is 4KB and you think 8KB. To add confusion, Microsoft source and docs suggest it’s really 81920.

Google site:learn.microsoft.com/en-us/dotnet/ 81920 and see if you think that fits, and if so, that seems big enough in theory anyway. I also tried tracing code, and found this long and internals-dependent source of the 81920, but I’m not 100% certain it’s applicable.

Yes, 100% its 81920. I myself misinterpreted it from here whilst digging for buffer sizes when I was writing the fix.

Yes, I was looking at the IO sequence on the stream, where at the streams read write loop we have the input stream(disk) being read at 80KB chucks at abstract level (downstream influenced by device block size) and written to the output (socket) at 80KB (then being influenced by MTU). Both are indeed far away from the StreamContent abstraction.

  Listing remote folder ...
The operation PurgeBrokenFiles has failed => Hash was null

System.Exception: Hash was null
   at Duplicati.Library.Main.Database.LocalListBrokenFilesDatabase.GetOrphanedIndexFiles(CancellationToken token)+MoveNext()
   at Duplicati.Library.Main.Database.LocalListBrokenFilesDatabase.GetOrphanedIndexFiles(CancellationToken token)+MoveNext()
   at Duplicati.Library.Main.Database.LocalListBrokenFilesDatabase.GetOrphanedIndexFiles(CancellationToken token)+MoveNext()
   at Duplicati.Library.Main.Database.LocalListBrokenFilesDatabase.GetOrphanedIndexFiles(CancellationToken token)+System.Threading.Tasks.Sources.IValueTaskSource<System.Boolean>.GetResult()
   at Duplicati.Library.Main.Operation.ListBrokenFilesHandler.GetBrokenFilesetsFromRemote(IBackendManager backendManager, BasicResults result, LocalListBrokenFilesDatabase db, Options options)
   at Duplicati.Library.Main.Operation.ListBrokenFilesHandler.GetBrokenFilesetsFromRemote(IBackendManager backendManager, BasicResults result, LocalListBrokenFilesDatabase db, Options options)
   at Duplicati.Library.Main.Operation.PurgeBrokenFilesHandler.RunAsync(IBackendManager backendManager, IFilter filter)
   at Duplicati.Library.Main.Operation.PurgeBrokenFilesHandler.RunAsync(IBackendManager backendManager, IFilter filter)
   at Duplicati.Library.Utility.Utility.Await(Task task)
   at Duplicati.Library.Main.Controller.RunAction[T](T result, String[]& paths, IFilter& filter, Func`3 method)
   at Duplicati.Library.Main.Controller.RunAction[T](T result, Func`3 method)
   at Duplicati.Library.Main.Controller.PurgeBrokenFiles(IFilter filter)
   at Duplicati.CommandLine.Commands.PurgeBrokenFiles(TextWriter outwriter, Action`1 setup, List`1 args, Dictionary`2 options, IFilter filter)
   at Duplicati.CommandLine.Program.ParseCommandLine(TextWriter outwriter, Action`1 setup, Boolean& verboseErrors, String[] args)
   at Duplicati.CommandLine.Program.RunCommandLine(TextWriter outwriter, TextWriter errwriter, Action`1 setup, String[] args)
Return code: 100

Thanks! We had not caught that one. I have a fix ready.

Confirming in 2.1.1.101_canary_2025-08-19 after hitting in 2.1.1.100_canary_2025-08-08.

Fretting a bit about seeming source lack of fix in 2.1.2.0_beta_2025-08-20.