Backblaze b2 times out irregular

vbs-yvr · January 2, 2024, 9:43pm

I am having issues with Backblaze b2

System.Net.WebException: Error: ConnectFailure (Connection timed out) ---> System.Net.Sockets.SocketException: Connection timed out

There are several threads on the forum about this issue

And few others that say that the issue disappeared by itself for the user.

What I find very strange (and I can’t explain) is the following

I have 1 bucket with 3 different folders
Each folder has own backup 17 MB, 2.2 GB and 1.5 TB
Each of these backups uses the same
B2 Application ID and B2 Application Key and the same encryption key - the folder separation is done because I consider some data critical enough to have it stored more than once even on the remote location.
If I do a test connection - the first 2 backups test successfully - right away.
The 3rd - the biggest backup - times out.
I was able to replicate time out both in UI and CLI (Linux)
I also have a second API key theat is read only and I use for CLI - specifically the compare command
Using Duplicati from the Command Line - Duplicati 2 User's Manual
The CLI test was done on one machine but I utilized both api ID and Key. This machine has duplicati installed but it is not being used (except CLI compare now and then)
the UI was on another machine (that normally runs the backups).
Both machines are on the same OS
I am using Linux - PopOS 20.04
running Duplicati - 2.0.7.1_beta_2023-05-25

Does anybody have ideas what can I do? I did open a case with backblaze (still waiting to hear from them) but my concern is that if same API ID and Key work for one folder it will circle back to the integration (duplicati).

I am quite perplexed with the behavior.

Jojo-1000 · January 2, 2024, 9:48pm

So, it worked in the past with the same folder? How many files are in the remote folder? Maybe the timeout is too short to list them all at once, which is what the test command does.

vbs-yvr · January 2, 2024, 10:00pm

Almost 2 years.
about 71000 files I’ve had about 72500 in the past
1780 GB - I’ve had 1880 GB in the past.

gpatel-fr · January 2, 2024, 11:25pm

Hello

did you setup an alarm when you go beyond hosting bandwidth limits ?

The timeout is possible however if I understand the B2 driver correctly, the b2-page-size limits the amount of data in a single http transaction to 500 by default, so it’s not sending the 71000 file names at a time. You can try to fiddle with http-operation-timeout in the backend options nonetheless.

BTW, if you want to have some very precious data duplicated in a remote location, you can setup free accounts for less than 5 GB (sometimes even more than that) with other providers.

vbs-yvr · January 3, 2024, 12:55am

I am a bit lost - sorry - can you clarify please?

Is this an option in duplicati?
On the B2 - I have set limits but I never reached them.

I never changed this setting so it is 500 recommendation 100 to 1000

This is not set
The setting is saying
This option changes the default timeout for any HTTP request, the time covers the entire operation from initial packet to shutdown
Default value: “”

By measuring I can see that each line “Listing remote folder” is about 3 min. I can try increasing the above to see if it is going to change.

When it is failing on CLI I am seeing

Enter encryption passphrase: 
  Listing remote folder ...
  Listing remote folder ...
  Listing remote folder ...
  Listing remote folder ...
  Listing remote folder ...

System.Net.WebException: Error: ConnectFailure (Connection timed out) ---> System.Net.Sockets.SocketException: Connection timed out
  at System.Net.Sockets.SocketAsyncResult.CheckIfThrowDelayedException () [0x00017] in <a85c1a570f9a4f9f9c3d2cfa5504e34f>:0 
  at System.Net.Sockets.Socket.EndConnect (System.IAsyncResult asyncResult) [0x0002f] in <a85c1a570f9a4f9f9c3d2cfa5504e34f>:0 
  at System.Net.WebConnection.Connect (System.Net.WebOperation operation, System.Threading.CancellationToken cancellationToken) [0x00006] in <a85c1a570f9a4f9f9c3d2cfa5504e34f>:0 
  at System.Threading.Tasks.TaskFactory`1[TResult].FromAsyncCoreLogic (System.IAsyncResult iar, System.Func`2[T,TResult] endFunction, System.Action`1[T] endAction, System.Threading.Tasks.Task`1[TResult] promise, System.Boolean requiresSynchronization) [0x00021] in <12b418a7818c4ca0893feeaaf67f1e7f>:0 
--- End of stack trace from previous location where exception was thrown ---

When it CLI is working (simply compare feature)

Enter encryption passphrase: 
  Listing remote folder ...
  Listing remote folder ...
  Listing remote folder ...
  Listing remote folder ...
  Downloading file (12.85 MB) ...
  Downloading file (12.85 MB) ...
Listing changes

so - it timed out 4 times but after that downloaded 2 files

I tried creating another API key. Strange thing is - even this new key worked on the CLI but failed on the UI with timeout.

My understanding is that if the user is to click test connection button - this should simply verify if API key and pass are accepted by the B2. I don’t think listing should be imitated - more expected is okay / not okay response. Not sure if my understanding is correct here.

my connection is quite fast 250 mb/s. I am puzzled because even when it was slower I was able to manage more files.

ts678 · January 3, 2024, 1:31am

Backends vary, but the test for most is list without looking at file names. Simply not breaking is enough:

github.com

duplicati/duplicati/blob/11c6b4b5de6b70d22bedd1f9153b8160fa0cf122/Duplicati/Library/Interface/BackendExtensions.cs#L19-L26


      
          public static void TestList(this IBackend backend)
          {
              // If we can iterate successfully, even if it's empty, then the backend test is successful
              foreach (IFileEntry file in backend.List())
              {
                  break;
              }
          }

Unfortunately yours broke with a ConnectFailure. Some people are pondering if it’s too much data, but
WebExceptionStatus Enum has The remote service point could not be contacted at the transport level, therefore I’m pondering whether that plus “Connection timed out” says TCP tried, but couldn’t connect.

On Linux using time nc to an IP that won’t answer a ping runs for 2 minutes and 11 seconds while the netstat -an | grep SYN_SENT shows TCP trying. You could see if you get that during test wait time.

EDIT 1:

netstat -s | grep TCPSynRetrans will go up while it’s retrying if this is indeed initial connect failure.

EDIT 2:

Windows also uses SYN_SENT, but it times out much faster. I am also not seeing TCPSynRetrans stat.

I’m not sure how many connections are needed to finish a list, but if multiple then that might be a way that the file count could impact the chance of a connect problem, depending on connection reuse policy.

A network trace of this would be nice, but just using netstat is a far easier way to start testing a theory.

gpatel-fr · January 3, 2024, 9:14am

the point of the alarm - something that is set in B2 - is to warn you explicitly if you reached them.

this is most definitely not what is doing Duplicati. it is trying to download the whole file list to cache it for further accesses. The goal is to get around the possible creation of files having the same name on the backend by using the most recent version (if I understand the code correctly). So theoretically it could be possible that the download endpoint url changes during the test.

Something must have been changed somewhere. The Duplicati B2 specific code has not changed since at least 3 years. The underlying helper library (Json) has been upgraded more recently though, with Canary 105 in may 2023, so if the problem was seen only after upgrading to 2.0.7.1 beta, it could be because of this upgrade.
Or it could be because of a hoster (B2) change. The B2 Api is supposed to not change, but there are sometimes some deviations between theory (compatibility) and reality. And the delivery of data by B2 sure seems complicated. I remember vaguely of people having trouble with accessing some download servers - maybe with B2 - because of DNS trouble.

gpatel-fr · January 3, 2024, 9:17am

Yes this looks like it.

ts678 · January 3, 2024, 2:25pm

It’s the easiest to check (it’s just a netstat), but not guaranteed to be the only case. TCP is full of timers and retries. Some have limits. On Linux (Windows likely has similar), one can see some of the retrying:

$ netstat -s | grep -i retrans
    20 segments retransmitted
    TCPLostRetransmit: 16
    TCPSynRetrans: 20

tcp(7) — Linux manual page dips a toe in some of the complexity, and also cites some other references.

vbs-yvr · January 3, 2024, 7:57pm

Just wanted to give you an update

bojo@serval2:~$ netstat -s | grep -i retrans
    18592 segments retransmitted
    TCPLostRetransmit: 1538
    14575 fast retransmits
    3 retransmits in slow start
    TCPSynRetrans: 486

The above result is after I upgraded the router. I tried to eliminate possible issues on my side.

This is before upgrading the router

bojo@serval2:~$ netstat -s | grep TCPSynRetrans
    TCPSynRetrans: 486
bojo@serval2:~$ netstat -s | grep TCPSynRetrans
    TCPSynRetrans: 486
bojo@serval2:~$ netstat -s | grep TCPSynRetrans
    TCPSynRetrans: 486
bojo@serval2:~$ netstat -s | grep TCPSynRetrans
    TCPSynRetrans: 486

I tried retrieving a file. Duplicati said it succeeded but for some reason I was unable to find it.
Over the course of few days I noticed that the DB got corrupted. I should have kept the errors but I did not. At a point it said that something is 0 it expected 1 or vise versa. On another try it pointed to a unique constrain.

Either way - my conclusion was that the DB has to be fixed somehow.

I tried repair - it did not help
I tried recreate - delete and repair. I had few retries but eventually it started repairing the DB.
Initially downloaded few files that were about 12 MB (I noticed the system re tried few times but continued)
after that it started .dlist.zip.aes - these were small (few KB) but lots of them about 36000
the system was downloading steadily one at a time (at this point I had a case with B2 and had provided the technician with information)
I left duplicati to download - it took almost 6h.
What I noticed is that after 1-2 h - the speed of download changed from 1 .dlist.zip.aes at a time to about 5 .dlist.zip.aes at a time. I may never know if something changed on B2’s side or not.
After the DB restoration completed successfully - I was able to make a backup that succeeded.
The next one on schedule will start in a few min and I am hoping it will complete okay.

To me - one of 2 possibilities exist

Perhaps a corruption in the DB caused an issue (odd part is that on Dec 31 I was not getting a message for corrupted DB - just time outs)
There may have been something either on B2’s side or on the route that interfered. I may never know what it was.

And also the feedback from B2 support

Question is if I have much bigger files. I have kept to the default settings so the biggest file would be 50 MB give or take if I am not mistaken
They are suggesting manual upload to the folder as a test. I have not done so because the DB was recreated and I have 1 successful backup.
They are suggesting increase of the timeout

I think this comes back to the suggestion from @gpatel-fr
Would that be global setting or per backup setting? What would be a good value to keep in mind?
It may not be applicable now but I would like to keep that as a reference in case the issue repeats.

lastly - B2 support are curious what your thoughts are.

ts678 · January 3, 2024, 8:06pm

A single point is meaningless. Stats are cumulative. You should look at them while issue is happening, preferably on a system that’s idle enough that something else isn’t also driving network activity/retries.

EDIT:

is the kind of activity where you’d want to keep an eye on network, whether or not it finally works or fails.

Is the original use still acting up? I was thinking “Test connection” which you did. What was the CLI run? Doing a list at start of a backup would be normal, so in that sense it would be like Test connection.

You could try such things entirely separately (but don’t hurt the backup folder) using command line tool Duplicati.CommandLine.BackendTool.exe. Or use Duplicati.CommandLine.BackendTester.exe instead.

Or maybe all is well now, for unknown reasons, due to unknown errors. It’s sounding pretty messy now.

gpatel-fr · January 3, 2024, 8:58pm

Uh. Sorry, but this can’t be correct. We know that you have about 1.5 Tb on your backend, your remote file size is about 50 MB and you have about 70000 files total. So if we assume that half of that are data files, that would do 35000 * 50 MB = 1.7 Tb. So the rough calculation fits, and it implies that you have probably 35000 dindex files - there are usually one small index file for each big data file. So the thousands of files you have seen must have been dindex files.

Let’s assume that these files were 20Kb. From a pure network point of view, downloading 35000 20 Kb files should have taken, well, maybe 2 hours at 250 Mbits/s. Let’s take in account the database rebuild, maybe 6 hours. However what is not making much sense is the downloading getting faster after 2 hours - the database rebuild should go slower as the database grows, so this acceleration must have come from the network, not from Duplicati limiting the speed. So the wild guess that I would make is that your network may be unreliable at some moment. Having big amounts of data is a sure way to trigger problems when something is unreliable on the network, so you having troubles with your big backup only seems to make sense at least. Not sure if you can try to get further by retrying the database recreate after setting the http timeout to 5 minutes (or something else) and looking at what happens with the help of netstat -s during the process. It sure would be painful.

Duplicati could benefit to have better debugging at the backend level. It would allow external techies like your Backblaze support to get a better view of what is happening at the client level - it sure beats people working at the client level trying to divine what is happening at the server level, something that is covered by business confidentiality. It could be possible in this case because the underlying library - Newtonsoft - is very advanced and has support for debug tracing. If I had a hundred heads and 2 hundred hands, I’d get to work immediately on that and it would be in next Canary

vbs-yvr · January 3, 2024, 9:51pm

It is very likely that I am wrong with the particular file type. I admit there is still a lot to learn.

Just found that I kept one of the messages. Here is what I was observing in the live log in the UI
This was observable yesterday - in the very beginning when I started recreating the DB.

20231231T200000Z.dlist.zip.aes attempt 4 of 5 failed with message: Error: ConnectFailure (Connection refused)

Needless to say - no errors now.

Yes - these were small files that were arriving 1 at a time and then the speed increased.
Total repair time
Jan 2, 2024 11:46 PM - Operation: Repair
Duration 05:48:24

I could not detect any issue. Tested multiple times with https://www.speedtest.net/ - never slowed down. It is a fiber optic one and it has been quite reliable. If there were any issues - most likely they have been somewhere else.

Deleting the DB again can set me back 12h or more.
close to 6h for repair
the first backup took close to 6h 30 min (completed okay)
and the normal scheduled backup took 35 min (this is very much what I expect to see).

What I could do however (but not sure if there would be a benefit) to import the backup on a second machine. My backup runs daily and I must ensure that the 2 are not going to collide.

For what is worth - yesterday I was using UI on the one machine and CLI on the other.
Both were having issues.
And here are some results from CLI test today

bojo@serval2:~$ duplicati-cli compare "b2***" 0 1 --full-result

Enter encryption passphrase: 
  Listing remote folder ... (yesterday I had at least 2 of these and after 5 it would fail)
  Downloading file (12.85 MB) ...
  Downloading file (12.85 MB) ...
Listing changes
  0: 1/1/2024 12:06:46 AM
  1: 12/31/2023 9:44:03 PM

Size of backup 0: 1.50 TB

Before executing the above

bojo@serval2:~$ netstat -s | grep -i retrans
    18627 segments retransmitted
    TCPLostRetransmit: 1538
    14575 fast retransmits
    3 retransmits in slow start
    TCPSynRetrans: 486

During the process

bojo@serval2:~$ netstat -s | grep -i retrans
    18651 segments retransmitted
    TCPLostRetransmit: 1552
    14575 fast retransmits
    3 retransmits in slow start
    TCPSynRetrans: 486
bojo@serval2:~$ netstat -s | grep -i retrans
    18657 segments retransmitted
    TCPLostRetransmit: 1556
    14575 fast retransmits
    3 retransmits in slow start
    TCPSynRetrans: 486

After completion

bojo@serval2:~$ netstat -s | grep -i retrans
    18658 segments retransmitted
    TCPLostRetransmit: 1556
    14575 fast retransmits
    3 retransmits in slow start
    TCPSynRetrans: 486

I am glad it works now.
Thank you all!

gpatel-fr · January 3, 2024, 10:14pm

don’t forget to immediately disable the schedule before saving the imported job…and possibly disabling the source also if it’s compatible with your second computer - that is, if it’s something like C:, it could exist on both computers and so a wrong click on the ‘backup’ button could bring problems.

vbs-yvr · January 4, 2024, 2:24am

This completed in
Duration 03:45:56
compared to yesterday
Duration 05:48:24

did not observe any time out in the UI - did not have to adjust the timeout
I did observe initial slowdown as yesterday - it was always fetching the small files a few at a time.

Overtime (for the duration of the restore) this changed

From

bojo@serval2:~$ netstat -s | grep -i retrans
    18627 segments retransmitted
    TCPLostRetransmit: 1538
    14575 fast retransmits
    3 retransmits in slow start
    TCPSynRetrans: 486

to

bojo@serval2:~$ netstat -s | grep -i retrans
    19111 segments retransmitted
    TCPLostRetransmit: 1671
    14575 fast retransmits
    3 retransmits in slow start
    TCPSynRetrans: 497

The change was gradual - a bit at a time.
Also - I could not leave the machine dedicated to duplicati.
There were items like browser and email client still active.

This machine is much faster comparing to the original but I don’t think there is much difference because of the processor speed.

gpatel-fr · January 4, 2024, 10:19am

Now I think that the initial slowdown is just the usual Duplicati behaviour: initially it is importing the dlist files, then the dindex files, and finally if the dlist and dindex files loading is not conclusive, some or all of the dblock (actual data) files. The dlist files are more complex to handle than the dindex files so are slower to load.

vbs-yvr · January 4, 2024, 6:00pm

I am sorry - I made a typo yesterday and realize only now. I meant to say “I did not observe…”

either way - have 2 good backups and the system appears to be working now.