If possible, it would be good to log these to see if they correlate with more problems later – sometimes.
If you clicked on my links to issues, you can see that I was starting to try a process-killing stress tester.
I’ve cut down on that due to lack of time (not enough forum volunteers), and difficulty of isolating issues.
As mentioned, developers want reproducible. That’s an additional level beyond initial random kill tests…
Historical and original design questions are hard to answer due to limited or no access to needed parties, however my impression is that Duplicati takes extensive steps to be able to recover. Not all were perfect, for example after an interrupted backup the next backup makes a “synthetic file list” composed of backup before the interrupted one, then whatever got done in the interrupted one. There was a coding bug before 2.0.6.x that caused that to fail. One area that is also improving is ability for Stop
button to drive shutdown. 2.0.5.x could probably do Stop after current file
safely. 2.0.6.x thinks that it made Stop now
safe…
Hard-kill safety is somewhat a different neighborhood, and HUP safety is little different, because systems often follow a HUP with a TERM or worse if system can’t close down fast. With uploads midway, it’s hard. Best you can do is try to pick up the pieces later, which Duplicati does try. Are there gaps? Quite possibly.
Last I tried looking, I didn’t see signal handling, but if one were to add that, what would one do to handle it? Transaction rollback at next backup is handled by SQLite (we hope), and last-second Duplicati commit
is potentially even worse than letting SQLite do a rollback. Transactions (or SQL) also aren’t in my expertise.
Good practice, and probably required some testing. Do you use a kill tester, then clean up any gaps seen? Testing isn’t every developer’s thing. Some are lucky enough to have test groups. Duplicati could use one. There’s some unit testing to try to help find regressions, but I’m not sure how “rough” it is on the software. Ideally the test code or test team would beat up the software pretty hard, just as the real world might do…
Some people are good at breaking software then characterizing well, to assist the developers. I try to help. You actually seem to have systems and logs, so are in a good position, but it does take a lot of work to do.
I think as Duplicati matures, it’s improved from “falls over all the time, all by itself” to getting pretty good at mainline backup (but having difficult with rarer compact
). Increasingly rare situations get increasingly hard.
How do you debug rare mystery problems? Here, we rely on logs and maybe databases, neither of which are as perfect as they could be. Sometimes logging at ExplicitOnly
or special-purpose logging is used.
Having a DB bug report compatible with production SQL uses would help, but sanitization renames tables. One time I wrestled with altering received DB, but it would be good to not need that. I might open an Issue
I’m a bit afraid of design bugs that might be hard to fix. For example, there are lots of active threads doing their things simultaneously. Concurrency is frequently confusing to humans (and even to debugging tools).
Some of the concurrent work involves the database, but the commit/rollback model seems to fit a simpler sequential model better. I am not a DB professional, and I suspect none have helped with general designs.
Specifically, I haven’t been able to understand the design well enough to know if a non-commit from thread A can be circumvented by a commit from thread B (because it’s time for thread B, even if not for thread A). This is where Duplicati needs someone with database skills plus at least a bit of C# reading to go and look.
Development work on issues has definitely gotten confused by the who-did-it question (e.g. if profiling logs show a certain SQL operation), but my question is deeper than that. I don’t know if anyone’s dug that far…
Did it seem good all the way to a clean typical end? I’m not sure what yours are like, but mine are different.
Never mind. Later on it looks less good. I can see that this progressive writing and reading will be bumpy.
I’m not removing questions asked that are answered later. Info is good, and we’re following the same line.
duplicati-b74ebc19b6b294681bf092e11e4969ec1.dblock.zip.aes
wasn’t named before, but now is. For these, one approach is to have had a running log (or pull together a set) to see prior history of that file.
You can also get a clue (up to the default 30 day deletion) from DB RemoteOperation table. Filter by name and don’t trust it totally for recent things, as DB rollback (if done) can also roll back the last few log entries.
For files that still exist (like this), backend date can give a very vague clue, i.e. when the file was uploaded.
For the now-perceived-as-missing files, I haven’t lined them all up, but I see some on the earlier delete list, so this might be one of those cases where some abnormal end in compact forgot some things it had done due to rollback. That might also explain the “Extra unknown” file. Is that uploaded earlier in log you quoted? Never mind, you showed it later from server view. I still am not sure what your normal end is in either log…
suggests you believe this is an atypical ending. Is Duplicati log also atypical? Offhand, the sequence looks strange earlier too. I thought the compact would be a set of downloads, then upload of the new dblock and dindex, then deletions of the files that are now obsolete because of repackaging of their still-in-use blocks.
I suspect rollback at next backup from a non-commit in the one before (that maybe got killed by SIGHUP?)
is a possible suspect for a commit that was delayed, and the delayer being backend.WaitForEmpty
fits
so if Duplicati jumped out of the compact that way, then this one wasn’t SIGHUP (but it still makes me nervous), possibly it never did the commit, which fits with it forgetting that it had deleted a load of files.
But if you say it should commit everything before the operations are done (and the backend handling is pretty complex and not my area of expertise), then that also seems like a logic mistake. So which way?
Regardless, I think we’re narrowing in on causes, but this does remind me of some of the issues I filed.
You might have noticed they’re sort of stalled, partly due to not having the exact steps to reproduce yet.
There is also always a lack of volunteers (anybody out there?), and some definite need of certain skills.
Not everything needs high skills, but there’s a lack of volunteers in general, even in helping in the forum.