Tales of my own stupidity


#1

I’ve been doing various tests with Duplicati on some smallish datasets and found it to be working just fine to be a Crashplan replacement.
So I went on and tried it on the whole set of files I want to backup, totally around 500GB in about 653000 files
As the upload speed at home is deplorable, I used the local file storage with the intent to move the disk to a location where upload speed is better and then send the backup files (dblock,dlist,dindex) to B2 by using rclone.
The local backup took 4.5 days while the upload itself only took 8 hours.
However, by experimenting the command lines for rclone I somehow mixed up the arguments and about 100 files were deleted on the hard drive before I had the chance to stop my buggy command.
Clearly, I should have double checked my command line before validating it, but that’s why this message is a tale of my own stupidity.
And for good measure, I still uploaded all the remaining files to B2.
So, the current situation is that I have a broken backup both on the local hard drive and in the B2 bucket.
I went on and read the documentation here:
https://duplicati.readthedocs.io/en/latest/08-disaster-recovery/

And because I believed only dindex files were missing, I clicked the “repair” option in the web interface. That was last Friday at 23:00, Paris time and it’s still running!
So basically, it takes much more time to repair a database than to actually redo the entire backup.

What have I missed?
What can I try?
The SQLite file for that job is about 1.2GB in size, can this explain the lengthy repair?

Thanks for having read me.


#2

I don’t know that you missed anything. While repairs are generally much faster than a database Recreate, some people do report them taking a long time.

If the files you accidentally deleted include dblock files then most likely the repair will identify the missing files but not actually resolve the issue.

A missing dblock file can’t be replaced so you’ll likely have to try running list-broken-files (to see what you’ve lost) then purge-broken-files to actually true them from the backup.

Note that even in its current state you can still restore (what’s restorable) from the backup, so consider that before trashing it.


#3

Well, it is still running as I type this, and the progress bar in the WebUI did not move a bit during all this time.
What can I do to get a feeling of whether or not it is close to having finished?
Is there somewhere I could look to see what’s taking it so long?
As I’m building the solution myself (the sources are 2.0.3.6_canary_2018-04-23), could there be a set of changes that I could apply to fix these overly long delays?
Or, better, what tracing elements could I add to give feedback to the team as to where it’s spending so much time?

I’m really willing to help find the root cause of those delays so that they can be fixed.


#4

Ok, it’s been 18 days and it is still stuck at the same progress step, eating the CPU, so I decided to move ahead and asked for a thread dump by sending the SIGQUIT message to the mono process.
Most threads are simply waiting for things to happen but there is one thread that seems to be running and has this stack trace:

"WorkerThread<IRunnerData>"  at <unknown> <0xffffffff>
  at (wrapper managed-to-native) FasterHashing.InteropOpenSSL10_libssl_so_1_0_0.EVP_DigestUpdate (intptr,byte[],uint) [0x0000b] in <787629f3dc434608a1b2bf412946d433>:0
  at FasterHashing.OpenSSL10_libssl_so_1_0_0_HashAlgorithmSHA256.HashCore (byte[],int,int) [0x0001b] in <787629f3dc434608a1b2bf412946d433>:0
  at System.Security.Cryptography.HashAlgorithm.ComputeHash (byte[],int,int) [0x00068] in <a83e350a7a6e4b22b8caffb7f2e8a60c>:0
  at Duplicati.Library.Main.Operation.RepairHandler.RunRepairRemote () [0x00db6] in <273b7499be7146249f50f02ffdc4dabb>:0
  at Duplicati.Library.Main.Operation.RepairHandler.Run (Duplicati.Library.Utility.IFilter) [0x001aa] in <273b7499be7146249f50f02ffdc4dabb>:0
  at Duplicati.Library.Main.Controller/<>c__DisplayClass18_0.<Repair>b__0 (Duplicati.Library.Main.RepairResults) [0x0001d] in <273b7499be7146249f50f02ffdc4dabb>:0
  at Duplicati.Library.Main.Controller.RunAction<T_REF> (T_REF,string[]&,Duplicati.Library.Utility.IFilter&,System.Action`1<T_REF>) [0x00124] in <273b7499be7146249f50f02ffdc4dabb>:0
  at Duplicati.Library.Main.Controller.RunAction<T_REF> (T_REF,Duplicati.Library.Utility.IFilter&,System.Action`1<T_REF>) [0x00008] in <273b7499be7146249f50f02ffdc4dabb>:0
  at Duplicati.Library.Main.Controller.Repair (Duplicati.Library.Utility.IFilter) [0x0001b] in <273b7499be7146249f50f02ffdc4dabb>:0
  at Duplicati.Server.Runner.Run (Duplicati.Server.Runner/IRunnerData,bool) [0x003f3] in <77df1fba921442aa947ecda2cb8bc432>:0
  at Duplicati.Server.Program/<>c.<RealMain>b__42_2 (Duplicati.Server.Runner/IRunnerData) [0x00001] in <77df1fba921442aa947ecda2cb8bc432>:0
  at Duplicati.Library.Utility.WorkerThread`1<Tx_REF>.Runner () [0x00203] in <a09b6028bcff428bb7c0bfd99903b001>:0
  at System.Threading.ThreadHelper.ThreadStart_Context (object) [0x00014] in <a83e350a7a6e4b22b8caffb7f2e8a60c>:0
  at System.Threading.ExecutionContext.RunInternal (System.Threading.ExecutionContext,System.Threading.ContextCallback,object,bool) [0x00071] in <a83e350a7a6e4b22b8caffb7f2e8a60c>:0
  at System.Threading.ExecutionContext.Run (System.Threading.ExecutionContext,System.Threading.ContextCallback,object,bool) [0x00000] in <a83e350a7a6e4b22b8caffb7f2e8a60c>:0
  at System.Threading.ExecutionContext.Run (System.Threading.ExecutionContext,System.Threading.ContextCallback,object) [0x0002b] in <a83e350a7a6e4b22b8caffb7f2e8a60c>:0
  at System.Threading.ThreadHelper.ThreadStart () [0x00008] in <a83e350a7a6e4b22b8caffb7f2e8a60c>:0
  at (wrapper runtime-invoke) object.runtime_invoke_void__this__ (object,intptr,intptr,intptr) [0x0004d] in <a83e350a7a6e4b22b8caffb7f2e8a60c>:0

It seems as if it’s taking ages to create the SHA256 hash for a block or an index file.
So I’m wondering if it’s not actually in an infinite loop somewhere in the hashing method, or an extremely big size being given to the hashing mechanism.

Any opinion here?


#5

A stack trace being a static view, it’s hard to distinguish between it-never-moved versus it’s-back-here-again.

I wonder if you could spot anything moving (or not) by doing menu’s About --> Show log --> Live --> Profiling

What OS is this? Some have tools aimed at looking at I/O usage (e.g. iotop or sar) or low-level call tracing…

For a lower-tech approach, is the repair to the local hard drive, and if it is, does it have an activity LED on it?


#6

I understand that this is just a snapshot, but before stopping the repair, I had to try to see what’s going on.
Looking at the profiling log, the latest entry is from August 16 at 11PM, that is the time when I started the repair. The entry is for select query that took 13 seconds to retrieve the missing blocks.
This is running on an (ancient) Linux system where I do have activity LEDs on the harddrive, and clearly, they are not lit like they were when the initial backup ran for 4 days.
I also tried iotop, and it confirmed the LED status, there is next to no I/O activity.

My next idea is to stop the repair, edit the RepairHandler sources to add as much console logging as I can to get a better understanding of what’s running and what values are given to the various functions that are called.
But I’m not too keen on doing that as I’m afraid this would leave the database in the “repair in progress” state, thus making it unusable.


#7

Your interest in figuring this out is wonderful. You have the sources, and build tools. If you have a debugger, that might reduce the need to instrument the source. Sometimes the JIT optimization needs to be subdued in order to get meaningful debug results. Stopping the repair may not be necessary if you can just attach to the running process and hope breakpointing will offer clues. I’m not aware of a way to un-optimize the code while it’s running, however there are options for that if you restart, and especially if you do another source rebuild.

Any unusable database problem can sometimes be “fixed” with the “Recreate (delete and repair)” button on the web UI for the job, however I’d suggest backing up the database because there’s also a chance it won’t work (too much damage, or coding bugs, etc.). Note that Recreate (even if perfectly coded) can only do so much given missing backend files, and there will be something missing. A re-done backup would be cleaner.

Thanks for fighting the problem. If you get tired of not having a backup, you can probably export/import jobs. Duplicati can also run multiple copies of itself somewhat independently, although this can become confusing and isn’t usually advised because the web service port moves (e.g. service gets 8200, Tray Icon gets 8300). CommandLine executable runs no server inside. Keep the database areas separate if you decide to try this.


#8

Well, being a developer myself, I tend to dig in the sources if I know the language well enough. My main one is Delphi, but as I do C# from time to time, I’m not in unknown territory here.
However, I’m used to Visual Studio integrated debugger and because MonoDevelop won’t compile on this ancient beast that I’m using, I’m stuck with gdb which is really tough to understand.
I did manage to get the stack traces for all running threads, but I also managed to crash the process itself…
So I went back to the ugly old technique of outputting to the console and I found something interesting inside RepairHandler.cs, where it takes 55 seconds to do the Common part, then the Remote part never ends. In my case, there are only missing volumes and it gets stuck at the very first one which happens to be of the RemoteVolumeType.Blocks type.
At line 353, it iterates over the result of GetSourceFilesWithBlocks and that loop goes on for what seems forever. It’s not actually, and I stopped it after having read 15938851 blocks. With a block size of 100K, and because each block only ever has one source, this means 1520TB of data has been hashed…
Clearly, something is wrong here because my entire backup set is just around 500G!
To push things to the limit, I stored the result of GetSourceFilesWithBlocks inside a local variable and called the Count() method to output the number of items in the collection.
And to my not so great surprise, this lead after a while to an Arithmetic error exception. I believe this is due to the fact that the counter reached above the maximum value for an integer which is totally unrealistic.
It’s as if the collection iterator never says “I’m done”.
I had a look at the source code for GetSourceFilesWithBlocks which I found inside LocalRepairDatabase.cs and it seemed innocuous enough, simply running a SQL query that obviously cannot returned that many items.
But the loop on bs.Done seems to be the culprit here, and I got a bit lost trying to understand how it could possibly be set to true as it appears this is done inside the Sources getter which seems totally counter-intuitive to me.

Right now, I can live with this backup in the state it is, so I can investigate further if you give me directions as to what you want me to instrument.


#9

Well, a bit more details, it’s not the first volume that loops for ever, it’s the second. The first one ends with an error saying it is missing 511 blocks and that I should use the purge-broken-files command.
Then the second one loops forever and I modified a bit my instrumentation to output the name of the file that gets hashed by the code starting at line 366.
And as I guessed, it’s always the same file that gets processed over and over again. This explain the absence of disk I/O because the system has put its contents in its RAM cache.

But that does not explain why the loop gets stuck on that one and only file whose name has nothing special about it, no spaces, no apostrophe, no comma, just plain ASCII.


#10

I have submitted the following issue: