FTP delete not atomic / verified / transactional (?)

If possible, it would be good to log these to see if they correlate with more problems later – sometimes.

If you clicked on my links to issues, you can see that I was starting to try a process-killing stress tester.
I’ve cut down on that due to lack of time (not enough forum volunteers), and difficulty of isolating issues.
As mentioned, developers want reproducible. That’s an additional level beyond initial random kill tests…

Historical and original design questions are hard to answer due to limited or no access to needed parties, however my impression is that Duplicati takes extensive steps to be able to recover. Not all were perfect, for example after an interrupted backup the next backup makes a “synthetic file list” composed of backup before the interrupted one, then whatever got done in the interrupted one. There was a coding bug before 2.0.6.x that caused that to fail. One area that is also improving is ability for Stop button to drive shutdown. 2.0.5.x could probably do Stop after current file safely. 2.0.6.x thinks that it made Stop now safe…

Hard-kill safety is somewhat a different neighborhood, and HUP safety is little different, because systems often follow a HUP with a TERM or worse if system can’t close down fast. With uploads midway, it’s hard. Best you can do is try to pick up the pieces later, which Duplicati does try. Are there gaps? Quite possibly.

Last I tried looking, I didn’t see signal handling, but if one were to add that, what would one do to handle it? Transaction rollback at next backup is handled by SQLite (we hope), and last-second Duplicati commit is potentially even worse than letting SQLite do a rollback. Transactions (or SQL) also aren’t in my expertise.

Good practice, and probably required some testing. Do you use a kill tester, then clean up any gaps seen? Testing isn’t every developer’s thing. Some are lucky enough to have test groups. Duplicati could use one. There’s some unit testing to try to help find regressions, but I’m not sure how “rough” it is on the software. Ideally the test code or test team would beat up the software pretty hard, just as the real world might do…

Some people are good at breaking software then characterizing well, to assist the developers. I try to help. You actually seem to have systems and logs, so are in a good position, but it does take a lot of work to do.

I think as Duplicati matures, it’s improved from “falls over all the time, all by itself” to getting pretty good at mainline backup (but having difficult with rarer compact). Increasingly rare situations get increasingly hard.

How do you debug rare mystery problems? Here, we rely on logs and maybe databases, neither of which are as perfect as they could be. Sometimes logging at ExplicitOnly or special-purpose logging is used.

Having a DB bug report compatible with production SQL uses would help, but sanitization renames tables. One time I wrestled with altering received DB, but it would be good to not need that. I might open an Issue

I’m a bit afraid of design bugs that might be hard to fix. For example, there are lots of active threads doing their things simultaneously. Concurrency is frequently confusing to humans (and even to debugging tools).

Some of the concurrent work involves the database, but the commit/rollback model seems to fit a simpler sequential model better. I am not a DB professional, and I suspect none have helped with general designs.

Specifically, I haven’t been able to understand the design well enough to know if a non-commit from thread A can be circumvented by a commit from thread B (because it’s time for thread B, even if not for thread A). This is where Duplicati needs someone with database skills plus at least a bit of C# reading to go and look.

Development work on issues has definitely gotten confused by the who-did-it question (e.g. if profiling logs show a certain SQL operation), but my question is deeper than that. I don’t know if anyone’s dug that far…

Did it seem good all the way to a clean typical end? I’m not sure what yours are like, but mine are different.
Never mind. Later on it looks less good. I can see that this progressive writing and reading will be bumpy.
I’m not removing questions asked that are answered later. Info is good, and we’re following the same line.

duplicati-b74ebc19b6b294681bf092e11e4969ec1.dblock.zip.aes wasn’t named before, but now is. For these, one approach is to have had a running log (or pull together a set) to see prior history of that file.
You can also get a clue (up to the default 30 day deletion) from DB RemoteOperation table. Filter by name and don’t trust it totally for recent things, as DB rollback (if done) can also roll back the last few log entries.
For files that still exist (like this), backend date can give a very vague clue, i.e. when the file was uploaded.

For the now-perceived-as-missing files, I haven’t lined them all up, but I see some on the earlier delete list, so this might be one of those cases where some abnormal end in compact forgot some things it had done due to rollback. That might also explain the “Extra unknown” file. Is that uploaded earlier in log you quoted? Never mind, you showed it later from server view. I still am not sure what your normal end is in either log…

suggests you believe this is an atypical ending. Is Duplicati log also atypical? Offhand, the sequence looks strange earlier too. I thought the compact would be a set of downloads, then upload of the new dblock and dindex, then deletions of the files that are now obsolete because of repackaging of their still-in-use blocks.

I suspect rollback at next backup from a non-commit in the one before (that maybe got killed by SIGHUP?)

is a possible suspect for a commit that was delayed, and the delayer being backend.WaitForEmpty fits

so if Duplicati jumped out of the compact that way, then this one wasn’t SIGHUP (but it still makes me nervous), possibly it never did the commit, which fits with it forgetting that it had deleted a load of files.

But if you say it should commit everything before the operations are done (and the backend handling is pretty complex and not my area of expertise), then that also seems like a logic mistake. So which way?

Regardless, I think we’re narrowing in on causes, but this does remind me of some of the issues I filed.
You might have noticed they’re sort of stalled, partly due to not having the exact steps to reproduce yet.

There is also always a lack of volunteers (anybody out there?), and some definite need of certain skills.
Not everything needs high skills, but there’s a lack of volunteers in general, even in helping in the forum.

I’ll check out VSS. I personally think that the HUP practically ever happens in case of FTPS hanging the time frame to run backups to run / compact time (which I run as separate steps) is very generous.

Yes, I do and just as I replied above, it’s very rarely occurs but it’s still in place just to be sure. I can check the VSS aspect.

Yes, I’ve seen that thread before. And I liked it even back then.

You really don’t believe how many times I’ve gone through that discussion during my career.

Agreed. Yet completely failing with that takes, just a very very tiny mistake. (I’ve done, sometimes it was painful to locate the issue. Some rare threading / timing / whatever issues are almost impossible to repeat. But I’ll get back to that bit later.

That’s good, I haven’t seen that there would be other polite way to tell the process to quit than using hup. Is there?

Agreed, but you don’t want to know how bad the situation is when you got N+1 systems broken down after sudden data center power loss. And then someone would need to manually start fixing those, from more or less random error states and corrupted data. No ffff-reaking way. It’s already bad enough that there’s downtime, but it’s worse if the downtime comes 100x longer because there’s nobody (or very limited competent resources) to fix the messed up systems.

Don’t worry, any production quality SQL (or other ACID database) will most likely work great, even with hardkills. That’s why there’s roll back / roll forward logs (in case of SQLite you can choose) and it’s very very reliable. It’s most likely when someone claims that there’s a database problem, they’re wrong. It’s possible, but it’s very low probability, compared to other code being broken. It’s just like CPU model being broken and executing code incorrectly. Sure it’s possible, but still very unlikely.

My personal saying is that it’s totally impossible and silly try to kill test these problems. It’s just like trying to make programs thread-safe by running the programs and seeing what kind of problems occur. It’s the losing way.

The correct way is to design programs correctly and then very very carefully handle the key points where state transitions happen. And on top of that, have good cleanup / recovery code if the running task X is marked as state when the program starts next time. Also the recovery code needs to take care of situations which actually aren’t created by the program itself. As example data packet X is ready to be transferred in outbound queue. Well, what if the X isn’t present anymore? In very early days, I had issues when someone modified something manually so that the program would try to recover from known state, which is impossible due to the manually made changes. Later I’ve changed all my programs so, that those very clearly log the state violation, but still resume. This makes recovery much more robust, in case someone decides to fiddle with something they shouldn’t touch.

Yet sometimes it’s hard to define what should be fatal and always abort, versus try to recover. As example if someone from devops team decides to delete some table, instead of clearing it. Yes, I’ll recover from that automatically nowadays, but I didn’t always.

I always remember how I messed up OpenBazaar at one point. I did clear the configuration data directory, I didn’t delete it. And then the program did see that the directory exists, and tired to read configuration, which of course failed. - Duh!

When you mentioned compacting reliability, downloads, uploads and deletes, means that the state recovery error I quoted from logs, did occur during data compaction. So that should be a good point to verify statically the code, that all the operations are properly journaled and recovered in case task won’t get completed.

That’s always very painful. I’ll make everything I can to avoid such issues. But when that happens, yes, extreme logging around the expected problem-area is used. Usually in temporary runtime log, which is deleted if everything passes correctly. And when things get very very painful and technically impossible to repeat in any sane way, then I’ll write secondary runtime validator code. Sometimes even double layer validation code. Of course at first that sounds and feels stupid, what do you mean, why do you check that 1 + 2 = 3, but after sometime you’ll start to realize that it’s necessary. That’s basically integration testing and monitoring code running in real-time. Doing lots of cross checks on data, input, outputs, generic logical data integrity validation and so on. Of course when something fails, then there’s the pain of deciding which of the outputs is actually correct, only thing you’ll know that there’s a mismatch, but what’s causing it and why, and how it should work is ahh, juicy. But bonus of that, is to get the operations which might be causing problems to clearly flagged for analysis. Without that it’s totally hopeless to see that we’ve processed 4 billion transactions and the input and outputs won’t clearly match. What’s wrong? Well, something somewhere.

But that’s very very important, worst part what happens with systems is that someone asks you to change one thing quickly. But because the whole stuff is a huge jiggy puzzle, you really can’t change one thing without affecting everything else.

The joke is that if someone says that specification is good. You’ll give the specification to three teams. They’ll complete developing the program according to the specification (using different tools, as example, one program written in Go, one in Java and one in Python). Then the Go code runs on ARM Linux, Java Code Runs on Intel Windows Server and Python code runs on ARM with different year older kernel version and all the input and outputs are compared. If the output from all three implementation matches, then it’s good, and if not, then it’s broken. Then we’ll get full-stack validation, up from hardware, compiler, runtime, etc level. - Expensive, hard? Sure. Results guaranteed, not 100%. But it’s still better than nothing.

True, all the fun race conditions and so on. With this hardware it works, but with this hardware it doesn’t, because code happens to execute bit differently. Been there done that.

With SQLite you can’t have write concurrency, if you’re having long write operations, you’ll end up with lock contention instantly. I’ve optimized several projects for that. The key is to read data, process it, then after that quickly check that state is still as expected and update database and commit. It guarantees global progress, even if some tasks might end up retrying repeatedly. Drawback if the same data is being constantly updated by faster and slower operations, the slower ones will persistently fail. But that’s clearly bad design then.

Of course with SQL Server and others you’ll can do exactly the same mistakes, lock contention is very classic problem.

Actually making that mistake is possible, because the commit / rollback is connection wide, depending how it’s implemented in Duplicati it’s totally possible to fail with that. Calling rollback will rollback stuff you didn’t think it would rollback. There’s are no cursor specific transactions. Simplest remedy with low database write loads is to start every updating operation with exclusive lock. Sure, very inefficient, but also very safe. Comparable to Python GIL.

Long time ago, I used to write several “worst case stress testers” for different database backends. No, I didn’t expect the database to fail, but I wanted to know all the ways my code can fail with specific database backend. I found quite many coding patterns for making code, which looks good, but doesn’t actually work atomically.

Yes, agreed. I just took the long philosophical path to the topic. Haha.

That’s why I’ll have the extreme debug logging always in separate file / database with explicit flush / sync, so data loss shouldn’t happen. Of course that makes performance suck in general terms. I’ve seen also some programmers to write logs when things go wrong. Which is of course ridiculous, when things go wrong enough, there won’t be any logs. Classic trap.

It seems that the program crashed with that empty path. And I’ve truncated the delete list, but because operations are sequential, it shows everything which is logically required. It did operations on remote storage. After crash recovery didn’t kick in correctly. Either the operation wasn’t being logged to be being completed and or the recovery code didn’t work correctly / activate.

Reminds me from a situation where devop guy used to delete database WAL because it was huge and roll forward would take so long. But you bleeping can’t do that. It’ll only make the situation much worse than just waiting for recovery to happen, you could even end up with torn pages, at least transaction safety is completely gone as well as data integrity.

I believe that in this case the program crashed with that null path error, and on top of that the atomicity of the operation didn’t work correctly. So my guess is that we’ve got two overlapping errors.

If the atomicity is working correctly, it doesn’t matter if the termination happened by whatever reason. But in this case it wasn’t hup because it was way too quick.

If so, it’s clear logic violation, which just needs to be fixed.

Partially agreed. But if the logic works correctly, it should have first written what it’s going to do, and commit it and then commit again when it’s done. If the operation was killed, aborted or whatever, it still should continue with correct recovery procedures when the program starts next time. - Actually from logical perspective this is incredibly simple, yet, I’ve discussed these issues with developers for decades. Worst part is that sometimes they’ll “fix something” by making it just “less likely to happen”, which of course isn’t a fix at all.

That’s right, from the status list you’ve nicely linked earlier, there should be statuses like “deleting” and “uploading”. And those should be committed. When recovery kicks in, it should see if the deletes are actually done and if the uploads are completed. And if possible either rollback and retry whole operation or if being very well written, even resume the operation where it was left of.

I’ve deal with this kind of situation earlier. Let’s say the data block would be huge, and connection would be bad. If you’re uploading 16 gigabyte file, and the connection cuts off randomly every 2 - 15 minutes and it’s a bad slow satellite link. You’ll never finish the upload if you’ll restart the operation from the beginning. In this case, when it says uploading file, it’s very important to check if the operation can be resumed.

Another issue are some RESTful APIs, which lack something obvious like transaction ID. If I’m supposed to send something exactly once, but the data doesn’t contain any unique identifier, how’s that going to work out? I can tell you, it’s never going to work. Yet I’ve seen plenty of solutions like this. Something was sent, never got reply. Should we retry? Risking delivering twice, or shoudln’t we retry, risking not delivering at all? - Boom! Correct way to fix this, is transaction ID and the remote being able to tell that we’ve already got that, or thanks, now we got it.

Only when the operation is completed, then it could be flagged as completed and as mentioned before, if there’s a need for extra safety and for efficiency, even the completed task can turn into verified to be completed and then removed. Ie. after it was said that delete was ok, then it’s confirmed that the file isn’t in the file list anymore.

Based on this, I would personally take steps to see and verify the code how the file transfer / recovery state is updated and committed. It’s very very likely it’s flawed, it’s almost obvious.

And to be clear, I haven’t been deleting the database .journal or -wal files just to troll you guys.

Edit, added the statement about problematic cases, where correct operation is impossible to guarantee, like badly designed RESTful APIs.

But because we’ve wasted (?) so much time already, debug data for the RemoteOperationTable…

Well this just confirmed what I said earlier… No surprises here.

Last event before the “magic list” appearance:
(21397, 1954, 1628723865, 'put', 'duplicati-i9cb931576b6c4b0690e86e65005260b5.dindex.zip.aes', '{"Size":225245,"Hash":"zn9/QI2uoeBx+yy1kj3EQ402gj9L9sPb7LWzLYUOmUU="}')

What’s missing here?!?

The file then magically appears in list operation:
(21398, 1955, 1628809239, 'list', ...

And the delete operation is here too:
(21412, 1983, 1629780078, 'delete', 'duplicati-b74ebc19b6b294681bf092e11e4969ec1.dblock.zip.aes', None)

So? Where’s the intention to upload and / or the actual upload? Depending of course if the completion is separate entry or if it just updates the intention to upload… But the end result is still the same … Boom, 100% confirmed to be broken …

Edit, added last operation with timestamp before the list op.

Due to seeming lack of signal handler, HUP probably just does the default action which is to term process.
I just tested this using strace to watch signal calls on SIGHUP, and it looked like pretty much an instant kill.
The only possible nicer way I know of is a third-party tool that drives the web UI from a CLI client in Python.
Pectojin / duplicati-client has an abort abort a task that might do it, but I don’t personally use the client.

I’m not worried about the database itself (thought it sometimes gets low-level corruption and can’t reopen), but more with the usage. A commit should be done at the “right” time per design, not on emergency stops.

If you mean from the start, there is no such luxury now. If you mean by learning design well, it takes time…

Although the entire design of the states could probably use some review to make sure it seems correct, and to understand exactly how to use them (in terms of what gets changed, and in what order), possibly having the DB record Deleting (and it’s not really recorded until a Commit) would mark the intention, so sudden death before destination delete was done could be finished at the next backup. A larger question involves all files in the compact cycle of downloads/upload/deletes, and much of it may be asynchronous which means a commit here while action is happening elsewhere might affect DB work done elsewhere, however not doing a commit leaves a hole in case of sudden death, so “just needs to be fixed” may need quite a bit of study, yet I think it’s a good lead that we’ve been building here (and thanks for your reporting).

Beyond file transfers, there’s a lot of book-keeping done in the compact because all of the blocks that had been in the old dblock files are now in the new one. There’s the chance that the new dblock will fail upload which means it gets retried in a different dblock name, but that also means there’s a new dindex file made because the dindex file has the name of its associated dblock file inside it. There’s some complexity here, however somebody (preferably a developer) will have to dig into current design and see if flaws are visible. Problem with that gets back to lack of volunteer developers. There’s not a lot of time for deep digs today…

Thanks for reference, if it would be possible, I could make that call instead of HUP.

I write and run my own software anyway. There’s no problem making API calls to close the Duplicati. But I also assume the Duplicati CLI version doesn’t run HTTP server. Yet I haven’t checked that. Ahh, annoying pointless speculation, I’ll check it right now.

Checked it, seems like the process at least by default, doesn’t listen any port or provide any http-api while running. Just as well there could be command like: Duplicati.CommandLine.exe stop, yet usually I don’t care how things are implemented, as long as done things are done right.

Strange, that really shouldn’t happen. Sounds strange and unlikely. So far I haven’t had any problems with SQLite3 even if the processed get terminated a lot daily. Because it’s hard kill safe, there’s no need to carefully avoid it. Are you sure, it’s corrupted SQLite3 database, and not logical high level corruption, which prevents program like Duplicati from starting up and reading it’s own data? Of course there’s ways to sabotage that, like removing file system sync barriers, disabling journal, allowing extremely long commit time in minutes or tens of minutes and enabling data writeback. Then it’s good luck with hard power off, but who would do that in production?

Exactly. Or even if the delete is completed, but response isn’t received (or whatever causing it to get not finished as expected.)

A larger question involves all files in the compact cycle of downloads/upload/deletes, and much of it may be asynchronous which means a commit here while action is happening elsewhere might affect DB work done elsewhere, however not doing a commit leaves a hole in case of sudden death, so “just needs to be fixed” may need quite a bit of study, yet I think it’s a good lead that we’ve been building here (and thanks for your reporting).

Exactly. Depending on how small segments the compact is committed to the local database, there can be simple and or more complex ways to recover. Either status is committed more often locally, so less analysis is required when recovering. Or it’s committed more seldomly, and then more advanced logic is required to recover. This is where it comes to efficiency, sometimes I write absurdly simple and often committing code, to keep the recovery simple. But at times, it ruins performance so badly, it requires just simple state logging, and more complex analysis in case there’s a need to recover from interruption.

Also at this point, it’s important to acknowledge if the files to be uploaded will be written to “temp” or “queue”. What’s the difference? Temp can be cleared at any time, without violating integrity, but queue can’t.

I would personally implement this in a way where: First N blocks are downloaded to temp, then those are compacted to new blocks, which are placed in outbound queue and synced to disk. This is the rollback section which can be reverted without any problems at any time. Then I would mark intent to upload new blocks and delete old blocks (in that specific order). This is the roll forward section, which will be retried on recovery and can’t be skipped. When this step completes, then the the uploads and deletes are marked as completed. And maybe at this stage it would make sense to verify the deletes and uploads, so we can safely update the local database with new state.

It’s not that hard, when you’ll just make it a habit to making things in very clear isolated blocks which can be reverted or retried.

With the steps above, even in the case someone clears the upload queue, then there’s no data loss. Yet it means that there’s a need to check if uploads were completed and if deletes were completed. If uploads weren’t completed, and the upload files are lost, then we’ll revert back failing the roll forward section. If uploads can be resumed or deletes redone, then it’s all good and we continue forward.

Simple solution would be of course just simply to always rollback unless deletes of old volumes already took place. - Simple solution instead of efficient solution. Then compaction will be just retried.

Sometimes, it’s necessary to create separate table, for “pending updates” just like there’s the queue directory and so on. Way too often developers modify the actual data directly, causing problems if commit is called (or something is rolled back). Also keeping transactions short and clean helps tremendously. It also reduces lock contention.

Search for “database disk image is malformed” at duplicati.com and github.com/duplicati/duplicati/issues
The problem with having 4 million backups per month run is that unlikely things arise every now and then.
How To Corrupt An SQLite Database File is SQLite’s official position, but diagnosing user issues is tough.

One time when I was buying a hard drive, I noticed an SMR drive’s data sheet not guaranteeing integrity if power was cut. I was avoiding SMR for other reasons, but I definitely don’t want any easy drive corruption.

This is more common. The “database disk image is malformed” is a low-level corruption of the database.

Thanks for the thought. I don’t know how that compares to the current design. Feel free to look if you wish.

This is one of my worries. Duplicati handles tables directly, and tries to control commit time, but a commit may come from a different thread. This is all my speculation though. I haven’t read through all of the code.

You can see the backup startup here, and Channel Pipeline helps understand the parts and the data flows.

Nice quote, I were going to refer to that same post earlier, when you were afraid of ACID database transaction malfunction / commit loss.

In my case luckily 99% of the backups are run using enterprise hardware with disk checksums triple disk data redundancy (using parity) and ECC RAM, etc. On top of that the servers run databases, do encryption and compression all the time. So if hardware would be broken, situation would be totally … Yeah, Duplicati is just very very small part of it.

From my logs, it’s pretty clear, that’s most likely the problem. State updates and commits / recovery isn’t working as it’s supposed to.

Still getting back to this once again, with now 100% confirmed fresh data. This is the content of the manifest files, which are lost… It’s totally … Yeah, totally fked…

{"Version":2,"Created":"20211028T053227Z","Encoding":"utf8","Blocksize":102400,"BlockHash":"SHA256","FileHash":"SHA256","AppVersion":"2.0.5.114"}

Just confirms that basic things are unfortunately fatally flawed. In other words: normal, broken, unreliable software. This kind of obvious logic flaws lead to data loss and corruption, and unrecoverable (or hard to recover) situations.

You lost some manifest files? From which type of files? How do you show what was lost?

Here’s yours:

{"Version":2,"Created":"20211028T053227Z","Encoding":"utf8","Blocksize":102400,"BlockHash":"SHA256","FileHash":"SHA256","AppVersion":"2.0.5.114"}

Here’s mine:

{"Version":2,"Created":"20211016T182135Z","Encoding":"utf8","Blocksize":102400,"BlockHash":"SHA256","FileHash":"SHA256","AppVersion":"2.0.6.100"}

I’m no manifest file wizard, but yours look OK. Can you clarify why you cite manifest files?

Manifest files were “lost” due compaction process being aborted. Duplicati itself deleted those. Just what we’ve been covering in the several threads. Repair recreates those and then next compaction deletes those again.

Yet that “issue” stops backups from working until repair is run. I’ve been considering adding auto repair to my script, but I generally would say that it’s extremely bad way of doing things and might lead some other more serious issues uninvestigated.

That’s exactly why there should and must be, a clear logical recovery process which does clean recovery from error situations. Yet I’ve seen countless programs to violate this basic rule, like Duplicati.

But if we see this way around, normal backup operations seem to be working very well now. It’s just the compaction which is slightly broken and afaik should get fixed. I think we’ve used to talk about this 10x the time compared to the time needed to fixing it when knowing the source and operation of the program. But this seems to be pretty common issue with many projects.

Just saying that compaction isn’t currently safe, and shouldn’t be used, if anyone cares about data integrity.

Let me see if I can puzzle out what “lost” means, or please clarify rather than throwing it in double quotes.

In this topic, manifest files suddenly jumped in yesterday. Manifest files are never standalone files deleted from destination the way a dblock, dindex, or dlist file would be. They’re a file included in all such file types. The other topic where you mentioned manifest also showed one here without any explanation about “lost”.

Sorry, I have no idea why you’re pointing to manifest files. Do you see a manifest file problem or a problem with some other type of file? Got error messages? Got logs? Got the time history of what happened when?

Repair should never recreate a manifest file without a containing file, and compact should never delete one. These are internal files. Have you found a problem at the file internals level, or is it a dblock/dindex problem which seems more likely based on the known issues? You can view a log file (live log is fine) for file names.

Repair also has a limited ability to recreate destination files. It can do dindex and dlist, but dblock is tougher.

If you kept a better log file, look in it. See if you see a complaint with “Missing file” in its text. If there is some history from the previous backup, post that to show how it ended. Issue might be found on the next backup.

If you see a manifest file named in your log, please post it. I don’t think you will though. They’re internal files.

Have you been leaving the problem system alone or already done things to it? A look by you (or bug report) would have been nice. If system has been left alone, please copy the DB or get a DB bug report snapshot. Saved log files are a more reliable view, but go well with a DB snapshot to look for inconsistencies such as dindex file actual delete that the DB then forgets about by rollback. That open issue was referenced before.

Confirmed with latest canary version. 100% obvious bug:

– Server side log –
File was first added

2022-02-24T23:24:57.136Z >> [FTP Session 7226] SIZE /duplicati-id32b2f754a1a4aa2bc1019c3a7962731.dindex.zip.aes
2022-02-24T23:24:57.214Z >> [FTP Session 7226] STOR duplicati-id32b2f754a1a4aa2bc1019c3a7962731.dindex.zip.aes
2022-02-24T23:24:57.870Z >> [FTP Session 7226] SIZE duplicati-id32b2f754a1a4aa2bc1019c3a7962731.dindex.zip.aes

Then compaction deleted the file
2022-03-14T00:43:45.826Z >> [FTP Session 8959] DELE duplicati-id32b2f754a1a4aa2bc1019c3a7962731.dindex.zip.aes
Then Duplicati hanged due to “missing files”, which is obviously a bug…
Then repair fixed the issue by uploading index file back…

2022-03-25T09:42:34.825Z >> [FTP Session 8295] SIZE /duplicati-id32b2f754a1a4aa2bc1019c3a7962731.dindex.zip.aes
2022-03-25T09:42:34.965Z >> [FTP Session 8295] STOR duplicati-id32b2f754a1a4aa2bc1019c3a7962731.dindex.zip.aes
2022-03-25T09:42:35.028Z >> [FTP Session 8295] SIZE duplicati-id32b2f754a1a4aa2bc1019c3a7962731.dindex.zip.aes

Very clearly broken by bad implementation, not a back-end durability issue.

Edit continued:

Checked more logs, there are actually similar problem with multiple backup sets as well. Which is well, just more proof to the fact that it’s logically and systemically broken.

Similar to below, but does the timing fit, meaning was the “missing files” just after compact?
You can also check job logs, system logs, maybe FTP logs, to try to look for an interruption.
Error during compact forgot a dindex file deletion, getting Missing file error next run. #4129

I know there are interruptions, there’s no need to especially look for those. My point still is that journal & journal recovery should take care of the “problems” interruption created when things resume as usual.

And as we concluded in the another thread, this problem is not FTP related, it seems to be generic flaw with compact & journal process. FTP (AFTP) itself seems to be working exactly as intended and designed.

Anyway, is there any soft way to ask for duplicati to stop? As far as I remember, it doesn’t mind friendly hup and requires hard kill… Is there any other way to signal please stop as soon as it’s safe.

I also have noticed two interesting aspects lately. For some reason Duplicati seems to leave one idle FTPS session during the backup operations. Yet, it’s closed when Duplicati finishes, so it’s certainly not a problem, just bit strange. I haven’t concluded why that happens. Maybe it’s used only for the main directory content checks. Because it opens FTPS session, and then uploads each file in separate session without closing the first session.

I also saw this in logs, which is quite interesting. Does the FTPS client failure to recognize that session is already active and logged in?

FluentFTP.FtpAuthenticationException: Already logged in. QUIT first.
   at Duplicati.Library.Main.AsyncDownloader.AsyncDownloaderEnumerator.AsyncDownloadedFile.get_TempFile()
   at Duplicati.Library.Main.Operation.CompactHandler.DoCompact(LocalDeleteDatabase db, Boolean hasVerifiedBackend, IDbTransaction& transaction, BackendManager sharedBackend)
   at Duplicati.Library.Main.Operation.CompactHandler.Run()
   at Duplicati.Library.Main.Controller.RunAction[T](T result, String[]& paths, IFilter& filter, Action`1 method)
   at Duplicati.Library.Main.Controller.Compact()
   at Duplicati.CommandLine.Commands.Compact(TextWriter outwriter, Action`1 setup, List`1 args, Dictionary`2 options, IFilter filter)
   at Duplicati.CommandLine.Program.ParseCommandLine(TextWriter outwriter, Action`1 setup, Boolean& verboseErrors, String[] args)
   at Duplicati.CommandLine.Program.RunCommandLine(TextWriter outwriter, TextWriter errwriter, Action`1 setup, String[] args)

In the GUI, you use the Stop button. Safest is “Stop after current file”. Newer way is so-called “Stop now” which doesn’t really mean right-now because in either case the files that are in progress finish uploading.

If you’re on the command-line client (and I think you are), then I know of no equivalent way to ask nicely, although it’s conceivable that the third-party Duplicati Client abort command is better than a process kill.

Process kill can be done using different signals, sending ctrl-c, sigterm, sigkill and so on. But again, correctly designed and implemented program should also deal with hard kill without problems.
taskkill can be used also with or without /force, and without the force it just tells program to quit as soon as it suits the program. As example with compact, I would assume it would mean almost immediately by rolling back latest changes (a few seconds maybe depending on remote), or bit slower, by finishing uploads and running deleted, but before starting next download cycle. Yet this can take quite a while, could be seconds to several minutes.

I agree, but let me quote myself from yesterday…

The posts are repeating and going in circles. One major block of dialog was from August 2021 above.
I’m not going to post the whole enormous conversation again, but I’ll make a couple of points instead.

If anyone wants a random kill tester to try, you’re welcome to my perl script which uses taskkill /f.
This is part of the “test” work where I keep asking for volunteers (also developers, writers, and more).
The theory is that testing is more easily done by less experienced people, who learn as they do tests.

The goal is to run with enough logging and debugging so as to figure out a step-by-step for developer investigation (file an issue on it, with steps). Then we need more developers capable of coding the fix.
This has been a gap lately. My own well-described interrupted compact issue is not fixed in two years.

Duplicati is a community effort, reliant on volunteers for all of its improvements, and its very existence.
Anybody should be capable of doing something positive, even if it’s occasional forum help. Please do.

1 Like

Today I saw three similar failures, I mean that Duplicati first deleted files and then says that THE FILES are missing. Well, what did you expect that would follow from deleting files? Very systemically, logically and obviously broken.

And this has absolutely nothing to do with backend durability. Interestingly all the files missing reported are dindex files.

Backup started at 30.3.2022 8.22.45
Checking remote backup ...
  Listing remote folder ...
Missing file: duplicati-**i6b91a0fac8524e8cbff58e80c2a9df64**.dindex.zip.aes
Missing file: duplicati-ib8905cf1660142558fc9c4b66aa08207.dindex.zip.aes
Missing file: duplicati-ia056ba6143b542d58e4a1df96011694d.dindex.zip.aes
Missing file: duplicati-i632c7a10e43d424a826cd6611fd51ed8.dindex.zip.aes
Missing file: duplicati-i73c682c718be4ef4aca4d6ddb669f4f4.dindex.zip.aes
Missing file: duplicati-ie5c2ff0c09684da6acb9dc98e7d6b72c.dindex.zip.aes
Missing file: duplicati-if2b8627f93504d25a719927c8605088b.dindex.zip.aes
Found 7 files that are missing from the remote storage, please run repair

And before that dupliati crashed with error message:

FluentFTP.FtpAuthenticationException: Already logged in. QUIT first.
   at Duplicati.Library.Main.AsyncDownloader.AsyncDownloaderEnumerator.AsyncDownloadedFile.get_TempFile()
   at Duplicati.Library.Main.Operation.CompactHandler.DoCompact(LocalDeleteDatabase db, Boolean hasVerifiedBackend, IDbTransaction& transaction, BackendManager sharedBackend)
   at Duplicati.Library.Main.Operation.CompactHandler.Run()
   at Duplicati.Library.Main.Controller.RunAction[T](T result, String[]& paths, IFilter& filter, Action`1 method)
   at Duplicati.Library.Main.Controller.Compact()
   at Duplicati.CommandLine.Commands.Compact(TextWriter outwriter, Action`1 setup, List`1 args, Dictionary`2 options, IFilter filter)
   at Duplicati.CommandLine.Program.ParseCommandLine(TextWriter outwriter, Action`1 setup, Boolean& verboseErrors, String[] args)
   at Duplicati.CommandLine.Program.RunCommandLine(TextWriter outwriter, TextWriter errwriter, Action`1 setup, String[] args)

Crash isn’t technically hard kill, but recovery can follow exactly the same protocol.

And before the crash, duplicati said, which almost makes me laugh…

  Deleting file duplicati-b58b6e364501b4225b449007843aa24dc.dblock.zip.aes (32,00 MT) ...
  Deleting file duplicati-**i6b91a0fac8524e8cbff58e80c2a9df64**.dindex.zip.aes (1,75 MT) ...
  Deleting file duplicati-b086a7738ee5644e8a0f76db19be4f409.dblock.zip.aes (24,33 MT) ...
  Deleting file duplicati-ib8905cf1660142558fc9c4b66aa08207.dindex.zip.aes (2,47 MT) ...
  Deleting file duplicati-bb8eedf9d1aac427e8fd33027c491189e.dblock.zip.aes (3,28 MT) ...
  Deleting file duplicati-ia056ba6143b542d58e4a1df96011694d.dindex.zip.aes (352,50 KT) ...
  Deleting file duplicati-b8c9a6ccf684449afbfb8525f3cb72632.dblock.zip.aes (2,51 MT) ...
  Deleting file duplicati-i632c7a10e43d424a826cd6611fd51ed8.dindex.zip.aes (347,67 KT) ...
  Deleting file duplicati-b35bdde309cd14959b6fc7396de9f4739.dblock.zip.aes (3,04 MT) ...
  Deleting file duplicati-i73c682c718be4ef4aca4d6ddb669f4f4.dindex.zip.aes (338,78 KT) ...
  Deleting file duplicati-b7c0b93b414f144a2b169b47cba35207c.dblock.zip.aes (2,84 MT) ...
  Deleting file duplicati-ie5c2ff0c09684da6acb9dc98e7d6b72c.dindex.zip.aes (350,29 KT) ...
  Deleting file duplicati-bb6c5652dd17641f8bd5033802560f1e8.dblock.zip.aes (3,30 MT) ...
  Deleting file duplicati-if2b8627f93504d25a719927c8605088b.dindex.zip.aes (317,39 KT) ...

Which actually proves all of my previous points right. Journal log isn’t working, isn’t being handled correctly, and or the recovery fails. + on top of that there are other code smells, which cause secondary problems. Like why the database rebuild fails… All are now 100% confirmed and proven issues.

Ok, nothing new here. Classic problems, I’ve seen all the similar failures with other projects and programs (including my own) countless times. Also usually these are problems like, ffffff sssss grauh, some rage and then frantic fixing, bit extra logging (of the transaction handling) adding few test cases and then the problem is gone forever.

** Stars with filenames are just to emphasize the filenames in log entry, it doesn’t really exist.