Any way to recover backup (repair and purge-broken-files don't help)?

That section is in the documentation since at least 2010 (feel free to find the commit that added it if you want to be sure, lang.in is the relevant file). I think that should be old enough, and a page from 2008 made no comment about selection within a group, so it may or may not have been supported for longer.

The GetBlocklists function is supposed to return all the blocklists stored in a certain volume. I don’t see any relevance for a check here, that would be done in VerifyConsistency. The blocksets are not really duplicates, they only share some data which is intended. The check in repair is to make sure that the generated blocklist matches with the expected hash, which will still be done.

That won’t work because BlocksetID and Index need to be from the same row.

Bad phrasing on my part. I guess it would be unique blocksets (and Duplicati will insist on that) with shared blocklist defining shared data at their ends. Alignment also matters due to fixed block dedup.

One of my pet peeves about the code is comments on function inputs/process/outputs can be lacking.
Looking awhile at the code and caller code can get there, but it’d be nicer if it were a little less difficult.

I guess the argument (which seems to hold true in bug report DB) is that there’s no point in checking a duplicate (how do I say this concisely?) blocklist usage for correctness by looking into its tables. It isn’t much of a gain (if any) because the block hashes in the second column should be the same in order to generate the same blocklist hash, because that’s how hashes work, but is it possible for BlocksetEntry table to have a problem undetected elsewhere that would show up if each blockset use were checked?

Possibly this is too ambitious, and I’m not too sorry if it gets lost. Just saying it might be a worse check.

6. Aggregate Queries Can Contain Non-Aggregate Result Columns That Are Not In The GROUP BY Clause in “Quirks, Caveats, and Gotchas In SQLite” has some interesting notes on SQLite behavior.

Option to make SQLite stricter with non-aggregate columns is a recent debate on the pros and cons.

As far as I can tell, the only other option would be to move more logic into C#, distinguish by blockset id + index (because the same blocklist could appear multiple times in a blockset, also maybe a fun test case), and use C# to remove duplicate hashes from the result enumerable.

Printing out the BlocksetID made it very easy to see the run-together 13 at the end, but as I think you’re suggesting, it’s more than just looking for contiguous repeats. In the topic DB, the first of 14 is far before repetitions start at the end. I’m not sure it’s worth checking reuse, especially since the common case will probably be what’s seen here – no bizarre stuff like sticking a new blocklist’s worth of data in the middle.

I’m tempted to favor a somewhat SQLite-flavor query, though other databases have other ways to do it, and see what happens. If this error keeps coming up, then we’ll know some other damage is happening.

We’re not hearing much from a key person yet, but I still like the idea of a Canary for this and VACUUM.
PRAGMA secure_delete might also work, according to the page, and would need less temporary space.

EDIT:

I don’t know all the checks of BlocksetEntry table, but VerifyConsistency tests blockset sanity and more.

There’s something to be said for code to stick to its job. A dindex rebuild isn’t meant to be a full check.

I want to look at this query / discussion more later in case I have some ideas but right now what has occurred to me is that ‘GROUP BY hash’ might be extremely risky?

I am making a lot of assumptions here (as I don’t know how Duplicati operates internally) but – in a fully general case – hash collisions are a thing that happens.

So running ‘group by hash’ might eliminate valid results – can someone comment on whether that would be a concern?

I doubt it. It would be if Duplicati used, say, a 16 bit checksum like an old protocol like TCP uses.
Hash collision thinks MD5 can be attacked quite quickly, but we’re not in an attack situation here.
The article for some reason stops at 160 bit SHA-1, but SHA-2 is better. Duplicati uses SHA-256.
SHA-3 exists, but sounds like it’s mainly to have something different as a just-in-case alternative.

You can look at the references to hash in that, or read schema with a tool, or refer to Schema.sql.
Every hash there is a SHA-256, so if collisions are a problem (that has never been found), a huge number of things are prone to break. Interestingly, there’s a claim that SHA-512 might work faster:

but is a conversion impossibility for an existing backup. There’s also more storage space required.
You can do research to see if a SHA-256 collision has ever been found. I haven’t found that it has.

Having said that, though, there are times when Duplicati also checks lengths for unknown reasons.

Maybe it is faster with Size in the index? I don’t know, but maybe we could match index better here.

Regarding performance, my non-SSD system takes 6 or 7 minutes (before caches warm up) to do a query such as we’ve been playing with on the bug report, which is a reason I like a speed-up chance.

Technically there’s probably another one which is to not use a star wildcard in the subquery because 2023-05-16 (3.42.0) stopped computing unused columns in queries. 2023-09-11 (3.43.1) fixed a bug:

Distinct subquery with UNION ALL gives unexpected result
and I hope random SQLite in some Linux distro doesn’t bite us on that someday. I guess we’ll find out. Meanwhile we might be losing some performance by computing things we don’t need, but it’s not a lot. One never knows for sure what the effect will be on the entire query, and no, I haven’t benchmarked it.

From a style point of view, it would seem better to say what columns you want, and it avoids questions like which Hash column does the user of the subquery get when they say “A”.“Hash”? There are three:

but admittedly (if one looks a bit down) there’s a WHERE which makes sure their values are the same.

While I do not know why, my suspicion is that a hash collision probability could be higher when the size of the block is not always the same.

If one had an attacker trying to produce a collision, letting them change the data size would make it easier, however I’m not finding that anyone has ever found a SHA-256 collision. Its search space is:

>>> print(2 ** 256)
115792089237316195423570985008687907853269984665640564039457584007913129639936

(first time I’ve tested the Python 3 claim of unlimited integer precision – which is sometimes desired)

Opinions that I can find on your thought include the following. I don’t know who’s really expert or not:

Are hash collisions with different file sizes just as likely as same file size? seems to vary from “no” to perhaps reducing the above astronomically large number by 5 digits for a default 102400 block size.

When hashing, do longer messages have a higher chance of collisions? is in security context, where longer passwords are generally viewed as safer. A 1 character password is too easy to brute-force…

Why haven’t any SHA-256 collisions been found yet? is from a crypto group, and probably makes it irrelevant whether hash collisions are higher or lower given varying block size. It’s low risk either way.

EDIT 1:

Although I think we’re more in a random-chance than an attack situation attempting to find a collision,
Rainbow table is how one finds weak short passwords really easily, and why salted hash is preferred. Beyond that, people go out of their way to use many rounds of slow algorithms to deter the guessing:

How many rounds of hashing is enough for a password manager?

EDIT 2:

Why doesn’t Bitwarden officially recommend Argon2? is another example of algorithm versus rounds.
Fortunately, Duplicati just uses the SHA-256 (maybe augmented by Size) as its way to identify things.
The exception to that is the GUI login password, but that’s quite far out of scope for a dindex blocklist.

My first hunch was that relying on hash properly, uniquely identifying a huge block of data is a terrible idea and I should consider stopping trusting Duplicati completely (this comes e.g. from a background of using HashMaps for ages).

But I did check the links you’ve posted (huge thanks, btw!) and now I think that you/Duplicati may well be right and it’s perfectly fine in a ‘real world’ due to size & methodology of SHA-256 calculation.

there is also this one:

commit 796459df5d20718fa89d045da9ba781d39251411
Author: Kenneth Skovhede kenneth@hexad.dk
Date: Wed May 14 14:44:56 2014 +0200

Fix for hash collisions with different size inputs

I’d say that it pushes toward the hash collision hypothesis for this code. I don’t know if that makes it more true, however the intention seems clear.

I’ve never used one, but I’ve used languages that store key-value pairs in a hash table, where the goal is spreading the data into some number of available buckets fast, and dealing with any collisions that occur because you were trying to map into a small number of buckets. Hash usage here is completely different.

Interesting finding. For a block, one can get extremely short ones, down to 1 byte (technically there’s a hash for the zero byte block that doesn’t really get stored), but blocklist or a hash are at least 32 bytes.

I don’t think it’s relevant to the bug we’re trying to fix, but can you comment on proposal and its rollout?

Not yet, except of course that it’s a great search and an excellent proposal. I’d need to look more closely at the code to see if it would be best to change the C# code with the Sql, like said by @Jojo-1000.
The only nit that I would do is that I don’t like the use of ‘*’ in this SQL.
If I had any part in a fix, the first thing that I would do is to replace it with BlocklistHash.“Index” but it’s a detail, a style thing.

was my picking on that, with a few more notes than that nit actually required.

Thank you, and it seems like we have some time to consider things even more.
Despite four people going at the SQL, we might still have missed something, so
giving any release some test exposure before going to Beta seems a good plan.

The increased difficulty with bug reports also impedes debug until that’s in Beta,
although we could ask for a Beta hotfix just to get that put out if we really had to.
Meanwhile, I’m pretty sure there’s an awkward dance one can do to do vacuum.

Basically, this potentially triggers the release planning on Canary(ies?) and Beta.

The code intention must be seen in context.

(and the fix looks like a string concatenation)

might be a bug fix to the prior memory cache:

On the way out (maybe it wasn’t worthwhile?):

I’m sticking with the idea that SHA-256 has no collision problem, until solid evidence appears.
If anyone says 256 bits is too few, look at big number. Three similar programs chose 256 bits.

it seems so:

commit 7e10ebbc07eb2fa8ba232c7805dc41643e410721 (tag: v2.0.0.67-2.0.0.67_preview_2014-10-31)
Author: Kenneth Skovhede <kenneth@hexad.dk>
Date:   Fri Oct 31 15:21:54 2014 +0100

    Version bump to v2.0.0.67-2.0.0.67_preview_2014-10-31

commit ef04c2652350e0a7f356cf11fb872b747e83ff3c
Author: Kenneth Skovhede <kenneth@hexad.dk>
Date:   Fri Oct 31 15:14:30 2014 +0100

    Disabling caches pr. default, use the temporary option --old-lookup-memory-defaults to re-enable caches

commit e6f68eac67335b7b2fa3d8034ee729e786b00fc1
Author: Kenneth Skovhede <kenneth@hexad.dk>
Date:   Fri Oct 31 11:01:23 2014 +0100

    Fixed one more speed issue when running with low-memory, it is now only marginally slower when not using memory caches.

commit d05f6f13246923eccd586d7b303e5cc2ad79b474
Author: Kenneth Skovhede <kenneth@hexad.dk>
Date:   Thu Oct 30 22:53:15 2014 +0100

    Improved performance when running with all in-memory caches disabled by a factor of 16.
    On a smaller dataset, the performance is now only a little more than twice as slow with a low memory footprint compared to the default with all memory caches enabled.

I wonder if tests for ‘marginally slower’ involved a really significant number of blocks.
If you remember I tested that database inserts were really slowing down performance even with a SSD over one Terabyte backed up. On the other hand maybe it was not realist anyway to have a memory cache for this kind of backups.

Here are a few thoughts on things one may try when plotting out how they work together.

Right now the SQL leaves the C# to try to figure out when it’s got a full or a new blocklist.
An inability to properly recognize a new in the run-together case is the bug we have here.
SQL has better information, so could pass hints to reduce complicated analysis in the C#.

but then one gets a lot of BlocksetID that one has to wade through to see what’s changed.
One can use the result column to more directly flag when SQL moved to another blocklist:

WHEN “B”.“Index” % 32 = 0 THEN “A”.“BlocksetID” END

and if a NULL is a problem, produce 0, and if that’s somehow a worry, use -1 or something.
Maximum number of block hashes in a blocklist is 32 for the test case. At defaults it’s 3200.

Unfortunately it looks like SQLite does not come out-of-the-box with Base64. This impedes
attempting to use GROUP BY and group_concat to build a blocklist hash in the SQL code.

One could still try building it as a JSON or other list of Base64 strings if that might help any.
Recognizing exact duplicate blocklists is easier when it’s a string and not, say, 3200 strings.

One could even boil the string down to a SHA-256 hash if desired, creating a blocklist hash
out of the ordered list of Base64 block hashes, instead of the 32 byte versions after decode.

I’m still not sure if it’s worth the time and effort to check every blocklist use in BlocksetEntry.
There’s already pretty good sanity checking, although it’s not aimed exactly at this situation.

Even though I’m putting out lots of ideas here, extensive change is likely a riskier approach.

was talking about the simple plan. A bigger change merits more test, maybe even in Canary.
As proven by this bug, real-world usage provides edge cases that we can’t easily just create.

I did try a test case of changing character 32769 to a “b”, and the dindex got a double entry:

I’m using 7-Zip because it shows me the Size better than Windows, and it has a CRC to see
whether the blocklist is (barring CRC collisions that I don’t really worry about) the same one.

It would seem a bit of work (although Duplicati does far worse) to need to collect all of them,
then figure out what the duplicates are. Possibly an ORDER BY could gather them together.

As two more style nits, comma join is harder to read than JOIN, and join conditions in WHERE
is harder to read than using ON for join conditions, and leaving WHERE to be used for filtering.

    AND "B"."Index" >= ("A"."Index" * 3200)
    AND "B"."Index" < (("A"."Index" + 1) * 3200)

Isn’t this always equivalent (but longer and maybe slower) to:

"B"."Index" / 3200 = "A"."Index"

and integer division goes nicely with proposed integer modulo.

To me, though, the need is to fix a bug without breaking things. Beyond that, the tradeoffs get worse. Although we might have some time to work this particular topic, there’s also this problem happening:

so I’m still hoping for a quick path to a Canary for testing, then to actually get a fix into Beta release.
Ditto for the VACUUM, and for me that works well, whereas PRAGMA secure_delete worked worse.

That’s strange but I am unable to repro this with current (trunk) Duplicati.

You cited the original test case with a hex editor change first byte from 0x00 to 0x01.
Make sure you don’t change file’s length. It should remain at its original 32769 bytes.

Similar note for text editor case. Editor might add end-of-line, but keep length steady.

If you’re doing either version of the test case, there’s also SQL query output to match.
If you have line endings, it gets a little harder as we’d have to get output for that case.

I don’t think it’s been fixed, but you could certainly see what happens in released one.

EDIT 1:

It’s also useful to save the original dindex for reference, to see if Repair one is similar.

EDIT 2:

There may also be a chance that exact data and line endings matter, as there’s sorting.

EDIT 3:

If using text editor, there are directions around on how to suppress final line ending, e.g.

How to stop vim from adding a newline at end of file?
How to stop Gedit, Gvim, Vim, Nano from adding End-of-File newline char?

and I didn’t want to get into line endings in the test case because they’re OS-dependent.

I made this C# test case to reproduce it, if you want to try that:

If you want to go with the GROUP BY fix, I can make a PR for that.