Really frustrated with Duplicati's lack of feedback

cyberfox · April 17, 2024, 2:35am

Greetings,
tl;dr - The lack of visibility and feedback is killing Duplicati and making it hard to use for large scale backups. The use of sqlite is also problematic, because it doubles down on the lack of visibility into the work actually being done.

Rant++ follows:

I started trying to use Duplicati recently, backing up a low-end dedicated server to BackBlaze B2. My first attempt was an unmitigated disaster. I gave it everything it should back up, it decided there was ~1.6TB when there’s only ~1.2TB, across ~5million files. Which is about right. But it backed up so incredibly slowly, that eventually…it…slowed…to…zero…bytes…and finally I had to kill it, because it was running 99.6% of one CPU, and there were no logs in the UI, no data transfers, no updates…nothing.

Fine, I read some folks saying that it can’t handle that, so I split it up into a bunch of smaller jobs. Okay, so the jobs run in order you created them, if they’re scheduled and past-due, and the first job was my old one, and so it failed again. I’m getting really used to systemctl restart duplicati. Around here is where I learned that I have to tell it not to immediately restart running jobs, and spend ~2 minutes paused once it starts up, otherwise it just locks up again.

Okay, I’ve descheduled all the jobs, and made them run-on-demand only. Great! One of the jobs seems to have finished…only…it…slowed…to…0 bytes per second…at…the…end. Again, no feedback, no information, no data transfers happening, just 99.6% cpu.

Okay. Let’s dig some more. I found out that doing analyze on the sqlite3 database might speed things up. Let’s try that. HEY! IT WORKED! That backup completed, albeit only ~13GB of data…but…the…second…job…did the same thing. Okay, so restart duplicati, analyze the database, and…hey, it went much faster!

Only now it sits at ‘finishing backup…’ with…you guessed it, no network traffic, no log files, no UI updates. So I bounced it, tried analyze again, and when it started up…

Completing previous backup …

Okay, what’s it doing? No. Idea.

Fine, let’s turn on logging to a file, set it to verbose, turn on profiling (whoops, that turns off verbose!) and see what it’s doing in detail. Only no, it’s not…logging…anything. Six. Hours. Nothing logged, >99% CPU usage on one core, and nothing logged.

We’re talking about one job, 126GB, 201,048 files, and it can’t be arsed to tell me what the hell it’s doing?

Last message I got before the 6 hour delay:

[Information-Duplicati.Library.Main.Operation.FilelistProcessor-KeepIncompleteFile]: keeping protected incomplete remote file listed as Temporary: duplicati-20240416T040635Z.dlist.zip.aes
[Information-Duplicati.Library.Main.Operation.Backup.UploadSyntheticFilelist-PreviousBackupFilelistUpload]: Uploading filelist from previous interrupted backup

Okay, but…there’s no network traffic. It’s not doing anything. It’s not logging anything. I don’t know what’s going wrong, because it’s not logged, sent to the UI, or tracked in any way.

I’ve been fighting with this for days, and I’m exhausted. gdb -p {pid} and thread apply all bt gives me this, among other threads:

Thread 16 (Thread 0x7f13b26dd700 (LWP 9284)):
#0  0x00007f13b2f9a174 in sqlite3VdbeExec () from /lib64/libsqlite3.so.0
#1  0x00007f13b2f9dcff in sqlite3_step () from /lib64/libsqlite3.so.0
#2  0x000000004187d570 in ?? ()
#3  0x00007f13ba5408d8 in ?? ()
#4  0x0000000000007530 in ?? ()
#5  0x00007f13ba540970 in ?? ()
#6  0x00007f137810ebf0 in ?? ()
#7  0x00007f13ba540970 in ?? ()
#8  0x00007f139c000fe0 in ?? ()
#9  0x0000000000000000 in ?? ()

which, given that all the other threads are various forms of do_futex_wait and pthread_cond_wait (i.e. waiting on other things to happen) and that’s the only thread that is obviously attempting to do something currently, that it’s slammed by some kind of a sqlite3 operation. If this was postgresql or something else, I could log in to the shared server and ask what the top queries are, but it’s not.

There’s no visibility. No feedback. The logging and UI are not helpful, because (apparently) sqlite just…locks up, and consumes a single core of CPU. Forever.

What can I do to help fix this? I like the UI, clunky as it is, and I like its intention, and everything else seems so much worse, but it’s just not able to handle large backups, and it needs to be able to, because the data folks store is only getting bigger.

– Morgan

Edit: For reference, it’s not running out of memory, or swapping. The Duplicati.server.exe process is running at 0.5% of memory, and I have over 20GB of RAM still unused on the server. It’s just in a truly awful, beyond terrible SQL query, that it somehow managed to not log before running it.

ts678 · April 17, 2024, 9:40pm

Welcome to the forum @cyberfox

Yes, it doesn’t scale well. Possibly the right developers can improve this, but they’re typically scarce.

Yes, sometimes a profiling log view in live log is necessary to figure out it has gotten into a slow query. Regarding SQLite, is the problem that the embedded DB resource use is hard to observe separately? Duplicati tries for a simple install without a separate database to install, maintain, maybe even backup.

Ah, you answered this later, going to the effort of running gdb to see where it was at, and commenting:

Recent comment from the primary developer:

and there’s plenty of other potential future stuff, however progress continues to be limited by resources.
Previously I would have said volunteers, but there’s now some company funding, so possibility of hiring.

Delivered as promised. I’ll try to reply, but some of this is known, yet limited by the available resources.

Yes, that’s one of the workarounds, and it’s effective against the block loading generated by lots of files.

A small number of large files can be kept under control with a large blocksize to keep block count lower. Unfortunately, blocksize cannot be changed on an existing backup, and there little guidance in advance.

The Options screen top item is “Remote volume size”, and it does try to steer anybody watching well to Remote Volume Size in the manual. Above that is a discussion of the block size, which could say more.

I’ve dropped hints that maybe block size deserves a similar visible question somewhere in job creation.

So there’s potential user manual work (for those who read it), potential GUI work (for those who read it), potential speedups (typically someone reports a slow spot, and maybe someone figures out what to do, workarounds after-the-fact as support (kind of late), and potential testing to steer all of the above things.

There is typically a desire to make things reliable, and making them work fast is sometimes less urgent.

is an exception, but it’s probably being driven by .NET Framework and mono (and libraries for them, etc.) starting to fade away. Duplicati migration to newer .NET adds speed (but maybe not in SQL) as a bonus.

I’m surprised it got to 0. Seeing it go up and down is normal. It’s average speed, and upload is occasional. During non-initial backups, you might see average slowly sink, then rise up again as remote volume goes.

Run PRAGMA optimize upon closing database connection #3745
is (I think) where things stand now, and there was a discussion of whether to analyze or rely on optimize.
PRAGMA optimize is (at least in current docs), also suggested every few hours. I don’t think we do though. Running it at close time also doesn’t run (I assume) if a process gets killed rather than closing normally, so maybe optimize should at least be run at open even if it’s normally redundant with one run by clean close.

This is pretty close to the rough rule of thumb of blocksize being tuned for 100 GB (1 million blocks), but it’s not something that’s been well explored, given the limits of equipment and available personnel to go explore. There are also special situations that are the exception to the general rule, so let’s see what’s coming next:

disable-synthetic-filelist

--disable-synthetic-filelist = false
If Duplicati detects that the previous backup did not complete, it will generate a filelist that is a merge of the last completed backup and the contents that were uploaded in the incomplete backup session.

If it’s an actual SQLite bug. their team seems quite responsive, but need an accurate description of bug, which is a good step even if it’s just Duplicati and a routine query that turns disastrous in query planning.

First step is a reproducible test case. Next is maybe a look at About → Show log → Live → Profiling, but behavior of that is a little unreliable for me. It skips output, e.g. during DB recreate this gets very obvious.

log-file=<path> log-file-log-level=profiling is a better start, and you can make an even huger file by adding

  --profile-all-database-queries (Boolean): Activates logging of all database
    queries
    To improve performance of the backups, frequent database queries are not
    logged by default. Enable this option to log all database queries, and
    remember to set either --console-log-level=Profiling or
    --log-file-log-level=Profiling to report the additional log data
    * default value: false

Once the query is known, it can potentially be isolated to run in something like sqlitebrowser for possible replication of the slow query, then one can run EXPLAIN QUERY PLAN to try to figure out why it got slow.

Sometimes one rewrites it, sometimes one adds indexes. I’m not an SQL guru, but we could use some.

If there’s something going wrong that’s not SQL, the profiling log (or a verbose log, or even less) can help because the GUI (thankfully?) isn’t throwing all that stuff at you all the time. But sometimes you need that.

EDIT 1:

A developer would have to weigh in on this, but a possible alternative to the constant flood of internal stuff which profiling log tends to create (especially with `profile-all-database-queries) would be logs-as-needed, meaning operations that take too long get noticed and some note made somewhere not in a job database, because those get rolled back on kills due to the lack of a clean commit if the job doesn’t finish completely.

I’m not sure of your technical background (looks better than average), but there are lots of chances to help.

EDIT 2:

Maybe something can also show in the GUI, but I’m not sure how to give meaningful info on internal status. Status bar could probably use a few more phases, but it likely can’t show every potentially slow SQL query. Putting timeouts in various spots likely to be slow or hang could also be handy for things like backend work.

EDIT 3:

github.com

duplicati/duplicati/blob/23065d44ef142c104d47ef84e083b51cdf5d54ff/Duplicati/Library/Logging/Timer.cs#L61-L68


      
          public void Dispose()

          {

              if (m_operation == null)

                  return;

          

              Log.WriteProfilingMessage(LOGTAG + ".Finished-" + m_logtag, m_logid, "{0} took {1:d\\:hh\\:mm\\:ss\\.fff}", m_operation, DateTime.Now - m_begin);

              m_operation = null;

          }

seems to be where the timing information comes from, but it just waits for Dispose instead of monitoring. While I’m not a C# developer, I’m pretty sure it would be possible to notice a slow query while it’s running.

github.com

duplicati/duplicati/blob/23065d44ef142c104d47ef84e083b51cdf5d54ff/Duplicati/Library/Main/Database/ExtensionMethods.cs#L194-L195


      
          using(writeLog ? new Logging.Timer(LOGTAG, "ExecuteReader", string.Format("ExecuteReader: {0}", self.GetPrintableCommandText())) : null)

              return self.ExecuteReader();

is an example of how a query gets into SQLite, and there are a finite number of them, all using this pattern. Results of the queries vary, and one might have to be careful of slow consumer looking like a slow query…

cyberfox · April 18, 2024, 7:53am

Greetings,
So I noticed this just before I’m going to sleep, unfortunately. Here’s a rough question; you seem to have some insight into the code…

When it logs the timing information, I see that it writes the ‘finished’ block after it’s done. That’s no help for queries that never end, and the server has to be killed. As a test, I’ve been letting the Completing previous backup run.

-rw-r--r--. 1 root root 26659 Apr 16 17:54 duplicati-piclog.log
[root@my ~]# date
Thu Apr 18 00:47:48 PDT 2024

So far it’s at around 32 hours, I think?

Anyway, what I was going to ask is this…a common mistake people make when using logging tools (or heaven forfend, writing their own) is that when you’re logging for debugging purposes like profiling, you have to flush after every write. Some logging frameworks allow for ‘unbuffered’ writing, which is just as good. Do you know if the log methods are unbuffered, or flush after each write?

My instinct tells me no, because I don’t see profiling timing in the log file until it’s done, which is too late for queries that have pathological behavior. My C# has been limited to some Unity development for fun, and helping convert away from it for work, so I don’t know enough about the core logging frameworks, presuming that’s what’s being used.

Any insights?

– Morgan

kenkendk · April 18, 2024, 8:46am

This is “roll-your-own” logging but it does flush after messages, so you can use tail -f <path> to see the progress. Using --log-level=profiling will include a starting message for all queries to catch this particular issue.

You should see a message like this:

Starting - ExecuteQuery: SELECT ....

This will be written and flushed before the actual query is executed.

I think it was a good choice some 10+ years ago because it keeps the memory usage low and performs relatively fast B+ tree lookups. But it does not scale well with +1TiB backups which are more common today. I am aware of the scaling issues, and intend to work on it soon-ish.

ts678 · April 18, 2024, 11:19am

As a side note, I suspect some query logging got lost in the profile-all-database-queries PR.
Seemingly, only LocalBackupDatabase.cs got updated to explicitly pass new writeLog parameter.
Regardless, try to find INSERT INTO "IndexBlockLink" in a profiling log, even with all queries on.
Lack of query logging impedes debug.

kenkendk · April 18, 2024, 12:17pm

It looks from the PR that the intention was that the code that does not pass the writeLog boolean parameter defaults to setting it true, so unless there is a codepath I am missing, it should be logged.

ts678 · April 18, 2024, 1:01pm

Optional arguments seems to say default can be given, but I’m having trouble seeing it.
Optional parameters and arguments gives other examples and talks about overloading.
Regardless, these topics are beyond me, so I’ll just mention the issue exists somehow.

ts678 · April 18, 2024, 3:00pm

I don’t know if Profiling live log is always able to come to life to see what has already happened, but if not, an out-of-the-way-and-already-user-unfriendly place to read might be About → System Info, where Server state properties is already one place to look for clues, e.g. to distinguish an upload backlog from slow production of files to upload. There are already lines that get far too long, and select goes away when screen refreshes, so capture method is to quickly select and copy the whole screen (e.g. Ctrl-A Ctrl-C) and paste into a text editor.

The current database query could probably just show up as a line there, and refresh may show it still running.

Ideally I’d prefer database queries to be identified by source file and line number, like C and C++ can do, but it sounds like C# needs pdb files around to pull that off. Lacking such info, one can guess using source search.

Being able to easily identify queries would also help with profiling, e.g. to add all times used by one query wth varying parameters. Right now, long queries stand out, but a million runs of a fast query don’t get a total time.

ts678 · April 18, 2024, 3:28pm

Did you happen to use some tool such as top that breaks that down into different components such as

           us, user    : time running un-niced user processes
           sy, system  : time running kernel processes
           ni, nice    : time running niced user processes

The pidstat command can do that too, and show other activity. On Windows, SQLite can start thrashing because the default cache size is too small. The OS can soak up the reads, and so the disk seems idle, however actually SQLite is getting really slow because it’s constantly churning through pages, invisibly…

On Windows, you can view this in Process Explorer. I’m not sure if Linux has similar statistics and tools.

SQLite has default cache size of 2 MB, so it pales in comparison to a big database. It can be raised with
CUSTOMSQLITEOPTIONS_DUPLICATI=cache_size=-20000 or similar in environment, e.g. in /etc/default/duplicati for a systemd service. Above raises it to 20 MB, but if you have lots of RAM, you can give it lots.

I’m not sure if cache is your situation though (because there’s so little info), but for some situations it is…

cyberfox · April 18, 2024, 10:14pm

Greetings,
Here’s the pidstat for the process:

[root@my ~]# pidstat -p 9275 1 5
Linux 3.10.0-1160.49.1.el7.x86_64 (my.cyberfox.com) 	04/18/2024 	_x86_64_	(8 CPU)

02:48:10 PM   UID       PID    %usr %system  %guest    %CPU   CPU  Command
02:48:11 PM     0      9275   90.00   10.00    0.00  100.00     1  mono-sgen
02:48:12 PM     0      9275   90.00   10.00    0.00  100.00     1  mono-sgen
02:48:13 PM     0      9275   91.00    9.00    0.00  100.00     1  mono-sgen
02:48:14 PM     0      9275   90.00   11.00    0.00  101.00     1  mono-sgen
02:48:15 PM     0      9275   91.00    9.00    0.00  100.00     1  mono-sgen
Average:        0      9275   90.40    9.80    0.00  100.20     -  mono-sgen

I will say that it has continued at some point. The point of discontinuity in the logs is this:

2024-04-16 17:54:20 -07 - [Information-Duplicati.Library.Main.Operation.FilelistProcessor-KeepIncompleteFile]: keeping protected incomplete remote file listed as Temporary: duplicati-20240416T040635Z
.dlist.zip.aes
2024-04-16 17:54:20 -07 - [Information-Duplicati.Library.Main.Operation.Backup.UploadSyntheticFilelist-PreviousBackupFilelistUpload]: Uploading filelist from previous interrupted backup
2024-04-18 14:14:10 -07 - [Information-Duplicati.Library.Main.Operation.Backup.RecreateMissingIndexFiles-RecreateMissingIndexFile]: Re-creating missing index file for duplicati-bcaed7c15a15b44f6a12b1
5da5225e90b.dblock.zip.aes

Between that message and

2024-04-18 14:29:40 -07 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Put - Started: duplicati-b6b2e149994454e9c8d60c5a4853141d8.dblock.zip.aes (111.75 KB)

some 15 minutes later, it dumped 759,344 timestamped lines into the log file. So ~800 lines per second. It feels like it was actually work done during the long-dark-teatime-of-the-sqlite-query, that only dropped to disk once the query was done. (All of them were of the form of:

2024-04-18 14:14:16 -07 - [Verbose-Duplicati.Library.Main.Operation.Backup.FileEnumerationProcess-IncludingPath]: Including path as no filters matched: /home/{filename}

So there’s details again, but no profiling information. (Which very well might be my fault!)

Anyway, it’s back to doing this:

and it never recognized that the backup completed. That said, if I’m reading it right, for the last ~40 minutes it’s been processing a 1.7GB production.log file, which entirely fairly probably should have been excluded.

But it’s stuck in sqllite world again, as far as I can tell:

Thread 1 (process 16498):
#0  0x00007f13b2f4a87d in sqlite3VdbeMemRelease () from /lib64/libsqlite3.so.0
#1  0x00007f13b2f6e6db in sqlite3VdbeMemFromBtree () from /lib64/libsqlite3.so.0
#2  0x00007f13b2f9bd00 in sqlite3VdbeExec () from /lib64/libsqlite3.so.0
#3  0x00007f13b2f9dcff in sqlite3_step () from /lib64/libsqlite3.so.0
#4  0x000000004187d570 in ?? ()
#5  0x00007f13ba70b980 in ?? ()
#6  0x0000000000007530 in ?? ()
#7  0x00007f13ba70c548 in ?? ()
#8  0x00007f137810ebf0 in ?? ()
#9  0x00007f13ba70c548 in ?? ()
#10 0x00007f13b12c69f8 in ?? ()
#11 0x0000000000000000 in ?? ()
(gdb)

I’ll see if I’ve screwed up the logging, since I think I have it set to both verbose and log profiling information, and something tells me those two are incompatible.

– Morgan

ts678 · April 18, 2024, 10:35pm

You can’t have both in one browser window (because it’s a dropdown), and I don’t know the internal coding.

There is a chance that a Duplicati expert can comment on live log, but SQLite internals may be too difficult.

I’m not familiar with pidstat behavior, but the fact that there’s some system time mixed in might mean I/O. Windows Process Monitor is good at seeing that. On Linux, I suppose strace can, but wow this is too deep.

Maybe we can get a comment from an expert on whether or not you should be able to get a profiling log up.

By the way, both the status bar phases and the log lines can unfairly blame something that names itself, as things that are just an announce of start without any announce of finish can catch activity until the next start.

2024-04-16 17:54:20 -07 - [Information-Duplicati.Library.Main.Operation.Backup.UploadSyntheticFilelist-PreviousBackupFilelistUpload]: Uploading filelist from previous interrupted backup

Having said that, I wonder if the synthetic file list took a really long time, or if was something run afterwards?

2024-04-18 14:14:10 -07 - [Information-Duplicati.Library.Main.Operation.Backup.RecreateMissingIndexFiles-RecreateMissingIndexFile]: Re-creating missing index file for duplicati-bcaed7c15a15b44f6a12b1
5da5225e90b.dblock.zip.aes

I think the above is pretty fast. A dindex compactly lists the contents of a dblock it’s associated with, so the database recreate ideally doesn’t need to download dblocks. If DB is good, it can replace a missing dindex, however sometimes the missing dindex has no recorded paired dblock, so the dindex is basically empty…

Another developer question. I think SQL runs on the main thread (maybe for serialization?), so it might stop other things that also need to do that. I’m really guessing about that, but maybe a developer can add light…

ts678 · April 19, 2024, 12:56pm

The live log seems to not run forever. Have you tried restarting Profiling from scratch, to look for activity?

kenkendk · April 19, 2024, 1:17pm

The internal logging actually supports multiple “sinks” each with different log levels, but you cannot set both on the commandline. Here the rule is “last option wins”.

I doubt it. The logging thread is separate from the SQLite thread so I cannot see any reason they should block each other. Most likely, after the query completed, the listing of files was processed, and that is more-or-less one log message pr. file as you see.

I would use the “Export as commandline” feature, then copy that command and add:

--log-file=<path> --log-file-log-level=Profiling

You can then run this from the commandline, which allows you to kill it in the start.
You should quickly be able to scan the log file to see if you get any profiling messages at all.

If you add --profile-all-database-queries=true then you should see a lot of profiling messages.
Those that are excluded are not expected to have any performance impact, so I would not think this was needed.

There are no internal SQLite logs possible. The way I would usually go about it, is first identify the problematic query (using the --log-file-log-level=Profiling option). Once I know what query is the problem, then load it into another tool, like SQLiteBrowser, and analyze the query.

That requires that one thread (that is not in SQLite) inspects the stack of another thread (that is in SQLite) and figures out that it has not just been repeating a query.

Alternative the code can be updated, so the caller updates some shared state with a timestamp and the query it is running, so a reporting thread can look at that state and report if it is still in the same (long-running) query.

There is no support for this in the current code, but it would be nice to add, even if it only highlights the problem.

ts678 · April 19, 2024, 1:33pm

I don’t follow. When Duplicati starts a query, start a timer. If timer ends before query ends, act on that. Possibly SQLite can repeat a query internally, but watching timing at Duplicati level still seems useful.

Regardless, just tossing out some possible ideas on how way to find these things for support or fixes.