FTP delete not atomic / verified / transactional (?)

I think I’m seeing a pattern. I’m not sure, if it’s a single flaw causing this, or multiple overlapping issues. But with quite new Duplicati and FTPS server (Filezilla) I’m seeing a pattern, which is causing most of the trouble with my setup.

Backup runs as normal, but when compaction is happening or old files are being deleted, the delete operation clearly isn’t atomic / transactional, at least this is my long suspicion based on logs. I haven’t checked the source, but I’m pretty sure about the stuff I’m seeing.

  1. Duplicati is running as usual …
  2. Duplicati is downloading, reprocessing, uploading and deleting files during compact, etc.
  3. Something, like Updates, Admin, or network issue (whatever technically) causes the FTP server to be unreachable.
  4. Duplicati tries to delete file XXXYXXX, N times, and then gives up / gets terminated by max runtime Window.
  5. After that when Duplicati runs, it halts on “extra file”. Well, it’s not extra file, it’s file you just didn’t yet delete.

Logs:
Deleting file duplicati-20210804T210014Z.dlist.zip.aes (28,00 KT) …

Then for several days
Extra unknown file: duplicati-20210804T210014Z.dlist.zip.aes
Found 1 remote files that are not recorded in local storage, please run repair

And finally with repair …
Deleting file duplicati-20210804T210014Z.dlist.zip.aes (28,00 KT) …

Seems to be pretty systemic flaw. Yet, based on this, I can’t naturally say exactly what’s wrong, it could be Filezilla server side issue, or it can be Duplicati issue. At least it’s clear that the delete isn’t verified by duplicati, because the file clearly isn’t deleted.

I think by checking the source, it’s very easy to see if the code is broken and or add option to verify deletes (which I think shouldn’t be necessary after all).

  • Thank you

I’m glad to hear any comments / views on this.

Btw. This is so common, I’ve seen these flaws in countless systems over decades, ugh! Even in audited banking software…

If Duplicati Team says, it’s naturally Filezilla issue, I can setup a few tests and then create ticket on their side. But I truly hope Filezilla after all these years isn’t this badly broken.

But it is verified, right?

I’m not a core developer by any stretch, but from what I understand Duplicati can detect errors when it attempts to delete a file. That’s why it is retrying the deletion N times, as you noted above. Eventually it will give up.

Upon next job operation, it verifies the files it sees on the back end and knows there is an “extra” file. It makes you fix this before it will continue.

Perhaps you are suggesting a major change in how Duplicati behaves if it cannot delete a file? Should it roll back all of the changes that resulted in a file needing to be deleted? (Perhaps it was due to a compaction event, or data retention policy.) To me that doesn’t really seem like the answer. And what happens if it was deleting several files and only failed to delete the last one?

I’m interested in the root cause of why files weren’t deleted. From previous discussions on the forum, I’ve seen people have this issue because of permissions or restrictions at the back end. Is Filezilla denying deletions? Maybe the NTFS permissions are?

Some Duplicati users have intentionally set permissions so that files cannot be deleted or modified after they are uploaded to the back end, as extra insurance against backups getting damaged by human error or ransomware. Duplicati can operate in this type of mode if you use unlimited retention and disable compaction. It will never try to delete files with that setup.

Usually when talking about atomic operations, there are two types of segments:
Roll Back
Roll Forward

If error happens in rollback section, then issue is rolled back and all is good, no harm done. And similarly, if there’s an issue with Roll Forward section, it means it’ll be retried.

Now the state clearly didn’t record correctly the requirement of roll forward and system broke it self down. If there was an error with delete, and deletion is mandatory, then the deletion should have been retried (later).

In this case the delete retry clearly didn’t occur later, because it would have worked, there’s no issues with permissions. This seems to be happening quite often, yet I don’t know exactly why.

Verifying delete is different from trusting the server saying delete was successful. Yet as stated, I’ve seen so much broken software that I generally don’t trust any software, but sometimes it’s possible to guess the ratio of likeliness of something essential being badly broken.

Just for fun link to my blog, which addresses these issues because I’ve gotten sick’n’tired with this stuff:

Yet, I’m not using aftp, maybe I should just change bunch of backups to use it, and see if that fixes the issue.

Edit after around half an hour of thinking… I’ll disable av for the backups. I did know it’s pointless, but in this case it could be even harmful. If the file handle is being held due some other process. Also the other thread where I complained about unrestorable backups (which is dead serious issue, no kidding!), might be caused by i / b files being left in the directory.

I’ll do that, but it still changes noting I said above. It just might make it “less probable”.

I’m not familiar enough with this part of Duplicati, but I believe it has some method already to track files that it thinks should be deleted. I’m not exactly sure why it doesn’t try deleting the file again automatically at the start of the next backup operation.

Your blog posting is good and I don’t disagree with the approach. Unfortunately Duplicati cannot follow that recipe exactly. Duplicati is designed to be storage agnostic and can only use the most rudimentary operations that are common to all storage platforms, namely: put, get, delete, list files. It cannot do a rename. Object storage (like S3) doesn’t support rename, append, move, or other types of operations.

In any case, I think it’s a good idea to disable AV on the backup data. It really is pointless, and on Windows it may be causing file locks that prevent deletion.

The .... section above is probably where it broke. It would be nice to have more info about that gap.

Possibly you don’t have a log-file=<path> with at least log-file-log-level=retry set up to record the log.
This is often the case, but when actively trying to debug, logs (and maybe DB bug reports) may help.

One guess would be step 4 “Duplicati tries to delete file XXXYXXX, N times, and then gives up / gets terminated by max runtime Window.” but which one is it, and what does “max runtime Window” do? There’s no runtime limit that Duplicati provides, as far as I know. Is there an external monitor in use?

Because you ask, I think “I think by checking the source, it’s very easy to see …” is rarely how it goes, however it goes better if it’s less of a mystery and more of Steps to reproduce, as Issues requests.

After interrupted backup, next backup failed with remote files that are not recorded in local storage #4485
didn’t get any steps to reproduce, but you can see some of the chat about SQL DBs and transactions.
403 error during compact forgot a dindex file deletion, getting Missing file error next run. #4129
also gets into them. Duplicati uses SQL DB commit and rollback, among other things. Are you expert?

I’ll point to some code in case anyone wants to look. I’m not a C# or SQL expert, but my ideas here are:

You can see above that there’s tracking of the remote volumes using states. They are defined in code at:

Other commentary from the original author is at Recover automatically from interrupted backups #1243 .

It looks to me at first glance like an extra file must be unknown to Remotevolume table, not just odd state.
A theory that it deleted the file and forgot that it deleted it would have to explain how DB row got removed.
It would be nice to have clues from the log messages in a Retry level log, or even at Information level.

As you can see in the above code and comments, delete is supposed to be verified. Can you find a flaw?
For searching in source code, a lot of times one can find the state handling by searching for enum name.

I don’t know for sure that this failure is a Compact, but one can see the state change and later actions at:

and its requirement is to be able to list all files (see the backend.List() at top of RemoteListAnalysis) which might be more than one wants after every delete. There’s also the point that it shouldn’t be needed, referring to “(which I think shouldn’t be necessary after all).” in original post. It is verified, just not instantly, which possibly means you find out later. I’m not immediately blaming a server, but server logs might help.

I don’t want to pick further at the “Just for fun link” on file transfer, but while Duplicati can’t rename remote files, it does a similar operation on upload retries by uploading the same contents under a different name. When an upload failure is reported by the server, one never knows if that file made it up (all or part) or not. Rather than make assumptions, the idea seems to be to use state tracking to try to do the right thing later.

Odd results can occur. For example, here I’m seeing a file that actually made it promoted from Uploading.
Problem is, it’s a file that’s not needed now (because it was presumed lost) and it’s not hooked up right…

Remote file referenced … but not found in list, registering a missing remote file #4586

Check the details later in this post, I think there’s clear transactionality / atomicity failure. And it’s not about the file transfer it seems.

Agreed, and when I went to disable it, I found it had been already disabled. At one time I did saw lots of issues with “locked files” with different AV products. But I don’t remember that defender would have been ever causing problems with Windows 10.

This isn’t actually exactly the pattern I’ve seen in some cases, but this is a good example, and because I do have the server logs I might actually think this could lead to improvement / fix in future.

Yes, I’ll HUP Duplicati after specific runtime, which is usually very permissive, but it’s there to ensure that it won’t be keeping file locks when high activity starts. It tends to crash a few programs using the files. Also of course if files are modified during backup or groups of files, it’ll lead to dead serious data integrity issues on other level. But that’s different story. I’ve got reasons to make sure files aren’t locked and trying to maintain data integrity when backup runs.

Point of this statement was to generally see if specific situations, like termination of process, are designed to be dirty and non-recoverable by design. When I write my programs, I’ll usually make programs so that those always recover automatically. But sometimes when I write software which isn’t production quality, it’s well known that if it’s aborted or whatever happens, it breaks down and requires manual steps to fix.

Actually excellent reference, because I think this might be the same issue I’ve got here.

And for sure, I’ll do use MS SQL Server, SQLite, MongoDB, Redis and one proprietary database a lot in production, as well as write hybrid integrations which deal with all the nasty stuff, like moving data from database to files, NFS / SMB file-shares, email, FTP, web-services or whatever, atomically and transactionally. As mentioned, it requires very specific steps to make sure that everything works correctly. Especially in situations when some data needs to be sent once, and exactly once. Some of the solutions are nasty hybrids, where several separate technologies are used to process something and that’s exactly where the correctly designed and automated roll forward / backward stuff becomes incredibly important. All of my production integrations also handle fully automatic “hard kill at any time” recovery. - I love SQLite, it’s simple to use and extremely reliable.

Tracking with states is good, and I don’t think the FileListProcessor itself is broken. It correctly detects whats happened. But I think the state tracking logic is having some issues. I’ll open bit more with the logs I’ve got. But I’ll look at the states being used and exactly when those changes are done and committed.

Honestly, I have to say, I love that list. It’s exactly as it’s supposed to be. Now there’s clearly just some issue when the data is updated about those states. Ie states, deleting, deleted and uploading, uploaded and verified. Perfect!

So, clearly it’s not supposed to be broken by design. Which I were bit afraid of.

That’s exactly what I’m also coming to. But because you’ve asked for extra logs, I’ve got some and I think it pretty well explains what happened. I’ve cropped the logs from non-essential stuff which I personally think is all good, and left only the key rows.

Yes, I saw it, and loved it. Currently my best bet is that the states in local db aren’t being updated correctly and or committed correctly, which leads to “invalid state” or maybe it’s better to say it’s inaccurate state, in the database later.

It might be that it’s only compact that’s doing something slightly incorrectly.

This is more efficient and a good approach, I’ve got nothing against it. Yet if requires that the state in localdb is correct, which starts to look likely it isn’t. - Yet having temporary option to verify delete using list after delete could have been a good test thing to do, unless there’s a proper implementation. But because there’s more efficient and proper implementation writing such temporary hacks is pointless and just adds more code.

Yes, I know. Mostly the rename step is there, because in many cases the shared folder can be polled with another process(es) which might start processing the partial file as soon as it sees the filename in list. When the directory is only used by a single Duplicati process, then the rename step isn’t required. But it’s good to verify at least the file size in listing. But I’ve understood it’s being done.

Exactly.

But, after all this philosophical discussion to the facts:
Duplicati logs (essential stuff)
Running compact…

  Uploading file (32,00 MT) ...
  Uploading file (219,97 KT) ...
  Deleting file duplicati-b39e650a3b9a04f00a574b6e80f0f2020.dblock.zip.aes (32,00 MT) ...
  Deleting file duplicati-i696ee4c75dac485d8989c804d3578c2d.dindex.zip.aes (316,12 KT) ...
  Deleting file duplicati-b30226d623d4c4b018b38c6363f4c5805.dblock.zip.aes (31,91 MT) ...
  Deleting file duplicati-ic230dfa9f8574cb1a13b040a7a6d1048.dindex.zip.aes (1,02 MT) ...
  Deleting file duplicati-b78d5a6480700450d88f53fbfa4a1afcc.dblock.zip.aes (9,52 MT) ...
  Deleting file duplicati-i64167367696d40f7bb5df754755ab738.dindex.zip.aes (862,18 KT) ...
  Deleting file duplicati-bbb15de95676a419f9bc540dd0db3d08f.dblock.zip.aes (31,91 MT) ...
  Deleting file duplicati-ia8ea7ae5b6b54065ba1fd6d4656e6fe9.dindex.zip.aes (903,70 KT) ...
  Deleting file duplicati-b97e75daa567341a3a3a9a271c1774a0f.dblock.zip.aes (6,48 MT) ...
  Deleting file duplicati-i841e01e5afab4c0ab4dc23890d14bbf4.dindex.zip.aes (667,23 KT) ...
  Deleting file duplicati-b2e61b1d0e10b45d28dbab5382a40f49a.dblock.zip.aes (31,91 MT) ...
  Deleting file duplicati-ia7d9f1b2ce41405ba8f81690f74e9db7.dindex.zip.aes (1,02 MT) ...
  Deleting file duplicati-b1abacc4284d944f0b736d4b031b097f6.dblock.zip.aes (10,43 MT) ...
  Deleting file duplicati-i897cd07e155a4f0096a2241a360a3f3d.dindex.zip.aes (870,03 KT) ...
  Deleting file duplicati-b0053fc052f1d46ca9ee06330fc26b949.dblock.zip.aes (31,91 MT) ...
  Deleting file duplicati-ief3ae5cd6c304de38517ecd56572451c.dindex.zip.aes (1,02 MT) ...
  Deleting file duplicati-b8facb9d1bf7748879748555b39778282.dblock.zip.aes (10,09 MT) ...
  Deleting file duplicati-i26c8c0e7b0ca464ea01cee3f2e8c72d0.dindex.zip.aes (865,58 KT) ...
  Deleting file duplicati-bafe10150a69d4d31be438fdc1cacd901.dblock.zip.aes (31,90 MT) ...
  Deleting file duplicati-i848fed9439dd4224a4f100659793f243.dindex.zip.aes (1,02 MT) ...
  Deleting file duplicati-b37e3326b73da4736b6f64a01c7ea08db.dblock.zip.aes (10,20 MT) ...
  Deleting file duplicati-i7c2b688e1bb74cc98bd95976f84497ef.dindex.zip.aes (865,75 KT) ...
  Deleting file duplicati-b8682a5b224b94937bf3acc8483339bbf.dblock.zip.aes (30,78 MT) ...
  Deleting file duplicati-i0c0c9258c4bb4c2c847fd4d1f31c850f.dindex.zip.aes (1,81 MT) ...
  Downloading file (32,00 MT) ...
  Uploading file (32,00 MT) ...
  Uploading file (32,00 MT) ...
Update "2.0.6.3_beta_2021-06-17" detected

It seems good, but next log for the same instance, while trying to backup stuff:

Checking remote backup ...
  Listing remote folder ...
Extra unknown file: duplicati-b74ebc19b6b294681bf092e11e4969ec1.dblock.zip.aes
Missing file: duplicati-i696ee4c75dac485d8989c804d3578c2d.dindex.zip.aes
Missing file: duplicati-ic230dfa9f8574cb1a13b040a7a6d1048.dindex.zip.aes
Missing file: duplicati-i64167367696d40f7bb5df754755ab738.dindex.zip.aes
Missing file: duplicati-ia8ea7ae5b6b54065ba1fd6d4656e6fe9.dindex.zip.aes
Missing file: duplicati-i841e01e5afab4c0ab4dc23890d14bbf4.dindex.zip.aes
Missing file: duplicati-ia7d9f1b2ce41405ba8f81690f74e9db7.dindex.zip.aes
Missing file: duplicati-i897cd07e155a4f0096a2241a360a3f3d.dindex.zip.aes
Missing file: duplicati-ief3ae5cd6c304de38517ecd56572451c.dindex.zip.aes
Missing file: duplicati-i26c8c0e7b0ca464ea01cee3f2e8c72d0.dindex.zip.aes
Missing file: duplicati-i848fed9439dd4224a4f100659793f243.dindex.zip.aes
Missing file: duplicati-i7c2b688e1bb74cc98bd95976f84497ef.dindex.zip.aes
Missing file: duplicati-i0c0c9258c4bb4c2c847fd4d1f31c850f.dindex.zip.aes
Found 1 remote files that are not recorded in local storage, please run repair
Backend verification failed, attempting automatic cleanup => Found 1 remote files that are not recorded in local storage, please run repair

Yet, interestingly the first compact run also provided stderr output:

System.ArgumentNullException: Path cannot be null.
Parameter name: path
   at Duplicati.Library.Main.BackendManager.WaitForEmpty(LocalDatabase db, IDbTransaction transation)
   at Duplicati.Library.Main.Operation.CompactHandler.DoDelete(LocalDeleteDatabase db, BackendManager backend, IEnumerable`1 deleteableVolumes, IDbTransaction& transaction)
   at Duplicati.Library.Main.Operation.CompactHandler.DoCompact(LocalDeleteDatabase db, Boolean hasVerifiedBackend, IDbTransaction& transaction, BackendManager sharedBackend)
   at Duplicati.Library.Main.Operation.CompactHandler.Run()
   at Duplicati.Library.Main.Controller.RunAction[T](T result, String[]& paths, IFilter& filter, Action`1 method)
   at Duplicati.Library.Main.Controller.Compact()
   at Duplicati.CommandLine.Commands.Compact(TextWriter outwriter, Action`1 setup, List`1 args, Dictionary`2 options, IFilter filter)
   at Duplicati.CommandLine.Program.ParseCommandLine(TextWriter outwriter, Action`1 setup, Boolean& verboseErrors, String[] args)
   at Duplicati.CommandLine.Program.RunCommandLine(TextWriter outwriter, TextWriter errwriter, Action`1 setup, String[] args)

That says absolutely nothing to me, except of course the exception type is clear. But why?

At this point, I think the local db state is messed up and the automatic recovery doesn’t work correctly. Which naturally shouldn’t happen, whatever the reason for crash was. It could have been as well unplugging power cord or hard restarting server or whatever.

Server end logs are all good, and just confirm the previous statement, nothing fishy at the server end.

... deletes as expected, all good ...
12.8.2021 2.17.56 - DELE duplicati-i0c0c9258c4bb4c2c847fd4d1f31c850f.dindex.zip.aes
12.8.2021 2.17.56 - 250 File deleted successfully
12.8.2021 2.18.47 - RETR duplicati-b5975d108bfb24a74a2d119efa43b91a2.dblock.zip.aes
12.8.2021 2.18.53 - 226 Successfully transferred "/duplicati-b5975d108bfb24a74a2d119efa43b91a2.dblock.zip.aes"
12.8.2021 2.20.16 - STOR duplicati-b74ebc19b6b294681bf092e11e4969ec1.dblock.zip.aes
12.8.2021 2.20.20 - 226 Successfully transferred "/duplicati-b74ebc19b6b294681bf092e11e4969ec1.dblock.zip.aes"
12.8.2021 2.20.20 - LIST duplicati-b74ebc19b6b294681bf092e11e4969ec1.dblock.zip.aes
... And the second run ...
13.8.2021 2.00.43 - LIST
It only retrieved the list and decided that things are messed up, no more entries.

The question remains, why the client is unaware about the successful deletes and files it has sent to the server? State update or some key commit missing? (Sure, I’ve done similar mistakes myself countless times, can’t deny it). Seems like pretty basic logic mistake.

This case wasn’t the situation where delete command was issued several times, probably due to the FTP server going down during the process. But I think this is very good and clear example. Because there’s interest to see what’s wrong, Next time when I’ll see something interesting, I’ll dig the logs up to begin with.

Edit, added the 226 status rows to the filezilla log… Sorry if quotes are messed up, didn’t really focus on those.

Quite a lengthy response! I don’t have the time at the moment to respond to many of your comments but this one caught my eye:

I would really recommend not HUP’ing Duplicati if you can avoid it. Depending on where it was in the operation, it may cause issues. Have you considered an alternative solution? Using VSS will totally avoid the issue of Duplicati locking files.

If possible, it would be good to log these to see if they correlate with more problems later – sometimes.

If you clicked on my links to issues, you can see that I was starting to try a process-killing stress tester.
I’ve cut down on that due to lack of time (not enough forum volunteers), and difficulty of isolating issues.
As mentioned, developers want reproducible. That’s an additional level beyond initial random kill tests…

Historical and original design questions are hard to answer due to limited or no access to needed parties, however my impression is that Duplicati takes extensive steps to be able to recover. Not all were perfect, for example after an interrupted backup the next backup makes a “synthetic file list” composed of backup before the interrupted one, then whatever got done in the interrupted one. There was a coding bug before 2.0.6.x that caused that to fail. One area that is also improving is ability for Stop button to drive shutdown. 2.0.5.x could probably do Stop after current file safely. 2.0.6.x thinks that it made Stop now safe…

Hard-kill safety is somewhat a different neighborhood, and HUP safety is little different, because systems often follow a HUP with a TERM or worse if system can’t close down fast. With uploads midway, it’s hard. Best you can do is try to pick up the pieces later, which Duplicati does try. Are there gaps? Quite possibly.

Last I tried looking, I didn’t see signal handling, but if one were to add that, what would one do to handle it? Transaction rollback at next backup is handled by SQLite (we hope), and last-second Duplicati commit is potentially even worse than letting SQLite do a rollback. Transactions (or SQL) also aren’t in my expertise.

Good practice, and probably required some testing. Do you use a kill tester, then clean up any gaps seen? Testing isn’t every developer’s thing. Some are lucky enough to have test groups. Duplicati could use one. There’s some unit testing to try to help find regressions, but I’m not sure how “rough” it is on the software. Ideally the test code or test team would beat up the software pretty hard, just as the real world might do…

Some people are good at breaking software then characterizing well, to assist the developers. I try to help. You actually seem to have systems and logs, so are in a good position, but it does take a lot of work to do.

I think as Duplicati matures, it’s improved from “falls over all the time, all by itself” to getting pretty good at mainline backup (but having difficult with rarer compact). Increasingly rare situations get increasingly hard.

How do you debug rare mystery problems? Here, we rely on logs and maybe databases, neither of which are as perfect as they could be. Sometimes logging at ExplicitOnly or special-purpose logging is used.

Having a DB bug report compatible with production SQL uses would help, but sanitization renames tables. One time I wrestled with altering received DB, but it would be good to not need that. I might open an Issue

I’m a bit afraid of design bugs that might be hard to fix. For example, there are lots of active threads doing their things simultaneously. Concurrency is frequently confusing to humans (and even to debugging tools).

Some of the concurrent work involves the database, but the commit/rollback model seems to fit a simpler sequential model better. I am not a DB professional, and I suspect none have helped with general designs.

Specifically, I haven’t been able to understand the design well enough to know if a non-commit from thread A can be circumvented by a commit from thread B (because it’s time for thread B, even if not for thread A). This is where Duplicati needs someone with database skills plus at least a bit of C# reading to go and look.

Development work on issues has definitely gotten confused by the who-did-it question (e.g. if profiling logs show a certain SQL operation), but my question is deeper than that. I don’t know if anyone’s dug that far…

Did it seem good all the way to a clean typical end? I’m not sure what yours are like, but mine are different.
Never mind. Later on it looks less good. I can see that this progressive writing and reading will be bumpy.
I’m not removing questions asked that are answered later. Info is good, and we’re following the same line.

duplicati-b74ebc19b6b294681bf092e11e4969ec1.dblock.zip.aes wasn’t named before, but now is. For these, one approach is to have had a running log (or pull together a set) to see prior history of that file.
You can also get a clue (up to the default 30 day deletion) from DB RemoteOperation table. Filter by name and don’t trust it totally for recent things, as DB rollback (if done) can also roll back the last few log entries.
For files that still exist (like this), backend date can give a very vague clue, i.e. when the file was uploaded.

For the now-perceived-as-missing files, I haven’t lined them all up, but I see some on the earlier delete list, so this might be one of those cases where some abnormal end in compact forgot some things it had done due to rollback. That might also explain the “Extra unknown” file. Is that uploaded earlier in log you quoted? Never mind, you showed it later from server view. I still am not sure what your normal end is in either log…

suggests you believe this is an atypical ending. Is Duplicati log also atypical? Offhand, the sequence looks strange earlier too. I thought the compact would be a set of downloads, then upload of the new dblock and dindex, then deletions of the files that are now obsolete because of repackaging of their still-in-use blocks.

I suspect rollback at next backup from a non-commit in the one before (that maybe got killed by SIGHUP?)

is a possible suspect for a commit that was delayed, and the delayer being backend.WaitForEmpty fits

so if Duplicati jumped out of the compact that way, then this one wasn’t SIGHUP (but it still makes me nervous), possibly it never did the commit, which fits with it forgetting that it had deleted a load of files.

But if you say it should commit everything before the operations are done (and the backend handling is pretty complex and not my area of expertise), then that also seems like a logic mistake. So which way?

Regardless, I think we’re narrowing in on causes, but this does remind me of some of the issues I filed.
You might have noticed they’re sort of stalled, partly due to not having the exact steps to reproduce yet.

There is also always a lack of volunteers (anybody out there?), and some definite need of certain skills.
Not everything needs high skills, but there’s a lack of volunteers in general, even in helping in the forum.

I’ll check out VSS. I personally think that the HUP practically ever happens in case of FTPS hanging the time frame to run backups to run / compact time (which I run as separate steps) is very generous.

Yes, I do and just as I replied above, it’s very rarely occurs but it’s still in place just to be sure. I can check the VSS aspect.

Yes, I’ve seen that thread before. And I liked it even back then.

You really don’t believe how many times I’ve gone through that discussion during my career.

Agreed. Yet completely failing with that takes, just a very very tiny mistake. (I’ve done, sometimes it was painful to locate the issue. Some rare threading / timing / whatever issues are almost impossible to repeat. But I’ll get back to that bit later.

That’s good, I haven’t seen that there would be other polite way to tell the process to quit than using hup. Is there?

Agreed, but you don’t want to know how bad the situation is when you got N+1 systems broken down after sudden data center power loss. And then someone would need to manually start fixing those, from more or less random error states and corrupted data. No ffff-reaking way. It’s already bad enough that there’s downtime, but it’s worse if the downtime comes 100x longer because there’s nobody (or very limited competent resources) to fix the messed up systems.

Don’t worry, any production quality SQL (or other ACID database) will most likely work great, even with hardkills. That’s why there’s roll back / roll forward logs (in case of SQLite you can choose) and it’s very very reliable. It’s most likely when someone claims that there’s a database problem, they’re wrong. It’s possible, but it’s very low probability, compared to other code being broken. It’s just like CPU model being broken and executing code incorrectly. Sure it’s possible, but still very unlikely.

My personal saying is that it’s totally impossible and silly try to kill test these problems. It’s just like trying to make programs thread-safe by running the programs and seeing what kind of problems occur. It’s the losing way.

The correct way is to design programs correctly and then very very carefully handle the key points where state transitions happen. And on top of that, have good cleanup / recovery code if the running task X is marked as state when the program starts next time. Also the recovery code needs to take care of situations which actually aren’t created by the program itself. As example data packet X is ready to be transferred in outbound queue. Well, what if the X isn’t present anymore? In very early days, I had issues when someone modified something manually so that the program would try to recover from known state, which is impossible due to the manually made changes. Later I’ve changed all my programs so, that those very clearly log the state violation, but still resume. This makes recovery much more robust, in case someone decides to fiddle with something they shouldn’t touch.

Yet sometimes it’s hard to define what should be fatal and always abort, versus try to recover. As example if someone from devops team decides to delete some table, instead of clearing it. Yes, I’ll recover from that automatically nowadays, but I didn’t always.

I always remember how I messed up OpenBazaar at one point. I did clear the configuration data directory, I didn’t delete it. And then the program did see that the directory exists, and tired to read configuration, which of course failed. - Duh!

When you mentioned compacting reliability, downloads, uploads and deletes, means that the state recovery error I quoted from logs, did occur during data compaction. So that should be a good point to verify statically the code, that all the operations are properly journaled and recovered in case task won’t get completed.

That’s always very painful. I’ll make everything I can to avoid such issues. But when that happens, yes, extreme logging around the expected problem-area is used. Usually in temporary runtime log, which is deleted if everything passes correctly. And when things get very very painful and technically impossible to repeat in any sane way, then I’ll write secondary runtime validator code. Sometimes even double layer validation code. Of course at first that sounds and feels stupid, what do you mean, why do you check that 1 + 2 = 3, but after sometime you’ll start to realize that it’s necessary. That’s basically integration testing and monitoring code running in real-time. Doing lots of cross checks on data, input, outputs, generic logical data integrity validation and so on. Of course when something fails, then there’s the pain of deciding which of the outputs is actually correct, only thing you’ll know that there’s a mismatch, but what’s causing it and why, and how it should work is ahh, juicy. But bonus of that, is to get the operations which might be causing problems to clearly flagged for analysis. Without that it’s totally hopeless to see that we’ve processed 4 billion transactions and the input and outputs won’t clearly match. What’s wrong? Well, something somewhere.

But that’s very very important, worst part what happens with systems is that someone asks you to change one thing quickly. But because the whole stuff is a huge jiggy puzzle, you really can’t change one thing without affecting everything else.

The joke is that if someone says that specification is good. You’ll give the specification to three teams. They’ll complete developing the program according to the specification (using different tools, as example, one program written in Go, one in Java and one in Python). Then the Go code runs on ARM Linux, Java Code Runs on Intel Windows Server and Python code runs on ARM with different year older kernel version and all the input and outputs are compared. If the output from all three implementation matches, then it’s good, and if not, then it’s broken. Then we’ll get full-stack validation, up from hardware, compiler, runtime, etc level. - Expensive, hard? Sure. Results guaranteed, not 100%. But it’s still better than nothing.

True, all the fun race conditions and so on. With this hardware it works, but with this hardware it doesn’t, because code happens to execute bit differently. Been there done that.

With SQLite you can’t have write concurrency, if you’re having long write operations, you’ll end up with lock contention instantly. I’ve optimized several projects for that. The key is to read data, process it, then after that quickly check that state is still as expected and update database and commit. It guarantees global progress, even if some tasks might end up retrying repeatedly. Drawback if the same data is being constantly updated by faster and slower operations, the slower ones will persistently fail. But that’s clearly bad design then.

Of course with SQL Server and others you’ll can do exactly the same mistakes, lock contention is very classic problem.

Actually making that mistake is possible, because the commit / rollback is connection wide, depending how it’s implemented in Duplicati it’s totally possible to fail with that. Calling rollback will rollback stuff you didn’t think it would rollback. There’s are no cursor specific transactions. Simplest remedy with low database write loads is to start every updating operation with exclusive lock. Sure, very inefficient, but also very safe. Comparable to Python GIL.

Long time ago, I used to write several “worst case stress testers” for different database backends. No, I didn’t expect the database to fail, but I wanted to know all the ways my code can fail with specific database backend. I found quite many coding patterns for making code, which looks good, but doesn’t actually work atomically.

Yes, agreed. I just took the long philosophical path to the topic. Haha.

That’s why I’ll have the extreme debug logging always in separate file / database with explicit flush / sync, so data loss shouldn’t happen. Of course that makes performance suck in general terms. I’ve seen also some programmers to write logs when things go wrong. Which is of course ridiculous, when things go wrong enough, there won’t be any logs. Classic trap.

It seems that the program crashed with that empty path. And I’ve truncated the delete list, but because operations are sequential, it shows everything which is logically required. It did operations on remote storage. After crash recovery didn’t kick in correctly. Either the operation wasn’t being logged to be being completed and or the recovery code didn’t work correctly / activate.

Reminds me from a situation where devop guy used to delete database WAL because it was huge and roll forward would take so long. But you bleeping can’t do that. It’ll only make the situation much worse than just waiting for recovery to happen, you could even end up with torn pages, at least transaction safety is completely gone as well as data integrity.

I believe that in this case the program crashed with that null path error, and on top of that the atomicity of the operation didn’t work correctly. So my guess is that we’ve got two overlapping errors.

If the atomicity is working correctly, it doesn’t matter if the termination happened by whatever reason. But in this case it wasn’t hup because it was way too quick.

If so, it’s clear logic violation, which just needs to be fixed.

Partially agreed. But if the logic works correctly, it should have first written what it’s going to do, and commit it and then commit again when it’s done. If the operation was killed, aborted or whatever, it still should continue with correct recovery procedures when the program starts next time. - Actually from logical perspective this is incredibly simple, yet, I’ve discussed these issues with developers for decades. Worst part is that sometimes they’ll “fix something” by making it just “less likely to happen”, which of course isn’t a fix at all.

That’s right, from the status list you’ve nicely linked earlier, there should be statuses like “deleting” and “uploading”. And those should be committed. When recovery kicks in, it should see if the deletes are actually done and if the uploads are completed. And if possible either rollback and retry whole operation or if being very well written, even resume the operation where it was left of.

I’ve deal with this kind of situation earlier. Let’s say the data block would be huge, and connection would be bad. If you’re uploading 16 gigabyte file, and the connection cuts off randomly every 2 - 15 minutes and it’s a bad slow satellite link. You’ll never finish the upload if you’ll restart the operation from the beginning. In this case, when it says uploading file, it’s very important to check if the operation can be resumed.

Another issue are some RESTful APIs, which lack something obvious like transaction ID. If I’m supposed to send something exactly once, but the data doesn’t contain any unique identifier, how’s that going to work out? I can tell you, it’s never going to work. Yet I’ve seen plenty of solutions like this. Something was sent, never got reply. Should we retry? Risking delivering twice, or shoudln’t we retry, risking not delivering at all? - Boom! Correct way to fix this, is transaction ID and the remote being able to tell that we’ve already got that, or thanks, now we got it.

Only when the operation is completed, then it could be flagged as completed and as mentioned before, if there’s a need for extra safety and for efficiency, even the completed task can turn into verified to be completed and then removed. Ie. after it was said that delete was ok, then it’s confirmed that the file isn’t in the file list anymore.

Based on this, I would personally take steps to see and verify the code how the file transfer / recovery state is updated and committed. It’s very very likely it’s flawed, it’s almost obvious.

And to be clear, I haven’t been deleting the database .journal or -wal files just to troll you guys.

Edit, added the statement about problematic cases, where correct operation is impossible to guarantee, like badly designed RESTful APIs.

But because we’ve wasted (?) so much time already, debug data for the RemoteOperationTable…

Well this just confirmed what I said earlier… No surprises here.

Last event before the “magic list” appearance:
(21397, 1954, 1628723865, 'put', 'duplicati-i9cb931576b6c4b0690e86e65005260b5.dindex.zip.aes', '{"Size":225245,"Hash":"zn9/QI2uoeBx+yy1kj3EQ402gj9L9sPb7LWzLYUOmUU="}')

What’s missing here?!?

The file then magically appears in list operation:
(21398, 1955, 1628809239, 'list', ...

And the delete operation is here too:
(21412, 1983, 1629780078, 'delete', 'duplicati-b74ebc19b6b294681bf092e11e4969ec1.dblock.zip.aes', None)

So? Where’s the intention to upload and / or the actual upload? Depending of course if the completion is separate entry or if it just updates the intention to upload… But the end result is still the same … Boom, 100% confirmed to be broken …

Edit, added last operation with timestamp before the list op.

Due to seeming lack of signal handler, HUP probably just does the default action which is to term process.
I just tested this using strace to watch signal calls on SIGHUP, and it looked like pretty much an instant kill.
The only possible nicer way I know of is a third-party tool that drives the web UI from a CLI client in Python.
Pectojin / duplicati-client has an abort abort a task that might do it, but I don’t personally use the client.

I’m not worried about the database itself (thought it sometimes gets low-level corruption and can’t reopen), but more with the usage. A commit should be done at the “right” time per design, not on emergency stops.

If you mean from the start, there is no such luxury now. If you mean by learning design well, it takes time…

Although the entire design of the states could probably use some review to make sure it seems correct, and to understand exactly how to use them (in terms of what gets changed, and in what order), possibly having the DB record Deleting (and it’s not really recorded until a Commit) would mark the intention, so sudden death before destination delete was done could be finished at the next backup. A larger question involves all files in the compact cycle of downloads/upload/deletes, and much of it may be asynchronous which means a commit here while action is happening elsewhere might affect DB work done elsewhere, however not doing a commit leaves a hole in case of sudden death, so “just needs to be fixed” may need quite a bit of study, yet I think it’s a good lead that we’ve been building here (and thanks for your reporting).

Beyond file transfers, there’s a lot of book-keeping done in the compact because all of the blocks that had been in the old dblock files are now in the new one. There’s the chance that the new dblock will fail upload which means it gets retried in a different dblock name, but that also means there’s a new dindex file made because the dindex file has the name of its associated dblock file inside it. There’s some complexity here, however somebody (preferably a developer) will have to dig into current design and see if flaws are visible. Problem with that gets back to lack of volunteer developers. There’s not a lot of time for deep digs today…

Thanks for reference, if it would be possible, I could make that call instead of HUP.

I write and run my own software anyway. There’s no problem making API calls to close the Duplicati. But I also assume the Duplicati CLI version doesn’t run HTTP server. Yet I haven’t checked that. Ahh, annoying pointless speculation, I’ll check it right now.

Checked it, seems like the process at least by default, doesn’t listen any port or provide any http-api while running. Just as well there could be command like: Duplicati.CommandLine.exe stop, yet usually I don’t care how things are implemented, as long as done things are done right.

Strange, that really shouldn’t happen. Sounds strange and unlikely. So far I haven’t had any problems with SQLite3 even if the processed get terminated a lot daily. Because it’s hard kill safe, there’s no need to carefully avoid it. Are you sure, it’s corrupted SQLite3 database, and not logical high level corruption, which prevents program like Duplicati from starting up and reading it’s own data? Of course there’s ways to sabotage that, like removing file system sync barriers, disabling journal, allowing extremely long commit time in minutes or tens of minutes and enabling data writeback. Then it’s good luck with hard power off, but who would do that in production?

Exactly. Or even if the delete is completed, but response isn’t received (or whatever causing it to get not finished as expected.)

A larger question involves all files in the compact cycle of downloads/upload/deletes, and much of it may be asynchronous which means a commit here while action is happening elsewhere might affect DB work done elsewhere, however not doing a commit leaves a hole in case of sudden death, so “just needs to be fixed” may need quite a bit of study, yet I think it’s a good lead that we’ve been building here (and thanks for your reporting).

Exactly. Depending on how small segments the compact is committed to the local database, there can be simple and or more complex ways to recover. Either status is committed more often locally, so less analysis is required when recovering. Or it’s committed more seldomly, and then more advanced logic is required to recover. This is where it comes to efficiency, sometimes I write absurdly simple and often committing code, to keep the recovery simple. But at times, it ruins performance so badly, it requires just simple state logging, and more complex analysis in case there’s a need to recover from interruption.

Also at this point, it’s important to acknowledge if the files to be uploaded will be written to “temp” or “queue”. What’s the difference? Temp can be cleared at any time, without violating integrity, but queue can’t.

I would personally implement this in a way where: First N blocks are downloaded to temp, then those are compacted to new blocks, which are placed in outbound queue and synced to disk. This is the rollback section which can be reverted without any problems at any time. Then I would mark intent to upload new blocks and delete old blocks (in that specific order). This is the roll forward section, which will be retried on recovery and can’t be skipped. When this step completes, then the the uploads and deletes are marked as completed. And maybe at this stage it would make sense to verify the deletes and uploads, so we can safely update the local database with new state.

It’s not that hard, when you’ll just make it a habit to making things in very clear isolated blocks which can be reverted or retried.

With the steps above, even in the case someone clears the upload queue, then there’s no data loss. Yet it means that there’s a need to check if uploads were completed and if deletes were completed. If uploads weren’t completed, and the upload files are lost, then we’ll revert back failing the roll forward section. If uploads can be resumed or deletes redone, then it’s all good and we continue forward.

Simple solution would be of course just simply to always rollback unless deletes of old volumes already took place. - Simple solution instead of efficient solution. Then compaction will be just retried.

Sometimes, it’s necessary to create separate table, for “pending updates” just like there’s the queue directory and so on. Way too often developers modify the actual data directly, causing problems if commit is called (or something is rolled back). Also keeping transactions short and clean helps tremendously. It also reduces lock contention.

Search for “database disk image is malformed” at duplicati.com and Issues · duplicati/duplicati · GitHub
The problem with having 4 million backups per month run is that unlikely things arise every now and then.
How To Corrupt An SQLite Database File is SQLite’s official position, but diagnosing user issues is tough.

One time when I was buying a hard drive, I noticed an SMR drive’s data sheet not guaranteeing integrity if power was cut. I was avoiding SMR for other reasons, but I definitely don’t want any easy drive corruption.

This is more common. The “database disk image is malformed” is a low-level corruption of the database.

Thanks for the thought. I don’t know how that compares to the current design. Feel free to look if you wish.

This is one of my worries. Duplicati handles tables directly, and tries to control commit time, but a commit may come from a different thread. This is all my speculation though. I haven’t read through all of the code.

You can see the backup startup here, and Channel Pipeline helps understand the parts and the data flows.

Nice quote, I were going to refer to that same post earlier, when you were afraid of ACID database transaction malfunction / commit loss.

In my case luckily 99% of the backups are run using enterprise hardware with disk checksums triple disk data redundancy (using parity) and ECC RAM, etc. On top of that the servers run databases, do encryption and compression all the time. So if hardware would be broken, situation would be totally … Yeah, Duplicati is just very very small part of it.

From my logs, it’s pretty clear, that’s most likely the problem. State updates and commits / recovery isn’t working as it’s supposed to.

Still getting back to this once again, with now 100% confirmed fresh data. This is the content of the manifest files, which are lost… It’s totally … Yeah, totally fked…

{"Version":2,"Created":"20211028T053227Z","Encoding":"utf8","Blocksize":102400,"BlockHash":"SHA256","FileHash":"SHA256","AppVersion":"2.0.5.114"}

Just confirms that basic things are unfortunately fatally flawed. In other words: normal, broken, unreliable software. This kind of obvious logic flaws lead to data loss and corruption, and unrecoverable (or hard to recover) situations.

You lost some manifest files? From which type of files? How do you show what was lost?

Here’s yours:

{"Version":2,"Created":"20211028T053227Z","Encoding":"utf8","Blocksize":102400,"BlockHash":"SHA256","FileHash":"SHA256","AppVersion":"2.0.5.114"}

Here’s mine:

{"Version":2,"Created":"20211016T182135Z","Encoding":"utf8","Blocksize":102400,"BlockHash":"SHA256","FileHash":"SHA256","AppVersion":"2.0.6.100"}

I’m no manifest file wizard, but yours look OK. Can you clarify why you cite manifest files?

Manifest files were “lost” due compaction process being aborted. Duplicati itself deleted those. Just what we’ve been covering in the several threads. Repair recreates those and then next compaction deletes those again.

Yet that “issue” stops backups from working until repair is run. I’ve been considering adding auto repair to my script, but I generally would say that it’s extremely bad way of doing things and might lead some other more serious issues uninvestigated.

That’s exactly why there should and must be, a clear logical recovery process which does clean recovery from error situations. Yet I’ve seen countless programs to violate this basic rule, like Duplicati.

But if we see this way around, normal backup operations seem to be working very well now. It’s just the compaction which is slightly broken and afaik should get fixed. I think we’ve used to talk about this 10x the time compared to the time needed to fixing it when knowing the source and operation of the program. But this seems to be pretty common issue with many projects.

Just saying that compaction isn’t currently safe, and shouldn’t be used, if anyone cares about data integrity.

Let me see if I can puzzle out what “lost” means, or please clarify rather than throwing it in double quotes.

In this topic, manifest files suddenly jumped in yesterday. Manifest files are never standalone files deleted from destination the way a dblock, dindex, or dlist file would be. They’re a file included in all such file types. The other topic where you mentioned manifest also showed one here without any explanation about “lost”.

Sorry, I have no idea why you’re pointing to manifest files. Do you see a manifest file problem or a problem with some other type of file? Got error messages? Got logs? Got the time history of what happened when?

Repair should never recreate a manifest file without a containing file, and compact should never delete one. These are internal files. Have you found a problem at the file internals level, or is it a dblock/dindex problem which seems more likely based on the known issues? You can view a log file (live log is fine) for file names.

Repair also has a limited ability to recreate destination files. It can do dindex and dlist, but dblock is tougher.

If you kept a better log file, look in it. See if you see a complaint with “Missing file” in its text. If there is some history from the previous backup, post that to show how it ended. Issue might be found on the next backup.

If you see a manifest file named in your log, please post it. I don’t think you will though. They’re internal files.

Have you been leaving the problem system alone or already done things to it? A look by you (or bug report) would have been nice. If system has been left alone, please copy the DB or get a DB bug report snapshot. Saved log files are a more reliable view, but go well with a DB snapshot to look for inconsistencies such as dindex file actual delete that the DB then forgets about by rollback. That open issue was referenced before.

Confirmed with latest canary version. 100% obvious bug:

– Server side log –
File was first added

2022-02-24T23:24:57.136Z >> [FTP Session 7226] SIZE /duplicati-id32b2f754a1a4aa2bc1019c3a7962731.dindex.zip.aes
2022-02-24T23:24:57.214Z >> [FTP Session 7226] STOR duplicati-id32b2f754a1a4aa2bc1019c3a7962731.dindex.zip.aes
2022-02-24T23:24:57.870Z >> [FTP Session 7226] SIZE duplicati-id32b2f754a1a4aa2bc1019c3a7962731.dindex.zip.aes

Then compaction deleted the file
2022-03-14T00:43:45.826Z >> [FTP Session 8959] DELE duplicati-id32b2f754a1a4aa2bc1019c3a7962731.dindex.zip.aes
Then Duplicati hanged due to “missing files”, which is obviously a bug…
Then repair fixed the issue by uploading index file back…

2022-03-25T09:42:34.825Z >> [FTP Session 8295] SIZE /duplicati-id32b2f754a1a4aa2bc1019c3a7962731.dindex.zip.aes
2022-03-25T09:42:34.965Z >> [FTP Session 8295] STOR duplicati-id32b2f754a1a4aa2bc1019c3a7962731.dindex.zip.aes
2022-03-25T09:42:35.028Z >> [FTP Session 8295] SIZE duplicati-id32b2f754a1a4aa2bc1019c3a7962731.dindex.zip.aes

Very clearly broken by bad implementation, not a back-end durability issue.

Edit continued:

Checked more logs, there are actually similar problem with multiple backup sets as well. Which is well, just more proof to the fact that it’s logically and systemically broken.

Similar to below, but does the timing fit, meaning was the “missing files” just after compact?
You can also check job logs, system logs, maybe FTP logs, to try to look for an interruption.
Error during compact forgot a dindex file deletion, getting Missing file error next run. #4129