Recreate Database Inconsistancy - Unable to Repair

rcmaniac25 · August 8, 2020, 3:01am

Note, I had a bit of a screwup and never got around trying to resolve it… so I had a backup job that was last updated Feb 2020, and then something happened, there was an error with the DB, and we all know that Duplicati doesn’t try to auto-repair those.

Cynicism aside, I (now) went to try and do a repair… no luck. So I went to do a rebuild. It takes about 6 days to do a rebuild (not 100% sure why, there’s 18 versions, ~180GB of backed up data, and some people have much larger backups/more versions and it runs faster).

Upon finishing rebuilding, and usually I come back to it simply telling me that database repair failed to complete. But after 3 more rebuild attempts, I finally caught it. VerifyConsistency (called in the RecreateDatabaseHandler) fails with “Found inconsistency in the following files while validating database: <file, actual size ##, dbsize ##, blocksetid: ##> (2 files). Run repair to fix it”.

I went to run repair and it failed with “Failed to repair, after repair 1 blocklisthashes were missing”.

I’ve tried to dive into the SQL and figure out why repair can’t repair the DB, but it’s over my head (never was very good with SQL). What is something I can do, hints on maybe what code I should try to poke, in an effort to repair this without spending another 6 day rebuild process that will end in failure. I’ve already looked at the code and saw that repair/rebuild functions have not been touched between the latest beta and the latest canary, so unless there’s something I missed, I don’t see anything improving by running something newer.

Ideas? I am capable of code changes to run temporarily-modified libraries, but am not good with SQL.

Windows 10 x64 (Version 1909)
ServerVersionName : - 2.0.5.1_beta_2020-01-18
BaseVersionName : 2.0.3.3_beta_2018-04-02
CLRVersion : 4.0.30319.42000
CLROSInfo : {“Platform”:“Win32NT”,“ServicePack”:"",“Version”:“10.0.18363.0”,“VersionString”:“Microsoft Windows NT 10.0.18363.0”}

ts678 · August 11, 2020, 3:07pm

Then you’re better at building this than I am. Neither of us is good with SQL, but browsing DB is easier.

Any more info, such as how progress bar moved? If it slowed way down around 90%, that’d mean yours went into final search for missing blocks, and was downloading all the dblock files to search. Activity can easily be seen from the logs if you have live log or log-file at Verbose level to watch the Recreate at work.

If progress never got that far, then it was just reading dlist and dindex files, and DB might have been slow.

I hope that means it took awhile to code-catch. Hopefully the rebuild attempts at least worked consistently. You’re highly determined (and not in a hurry, and can afford to download) if you did 4 attempts (24 days?!).
That might be good because getting into manual work is likely to be a pretty tough task, without guarantee.

You should probably make an exact backup copy of the destination files to ensure they don’t get damaged. Database backup is less critical, but is easy, and might allow you to try different things without a Recreate.

The good news is that you seem to have only a couple of trouble spots (2 on Recreate, 1 on Repair), and that might be possible to locate further and fix manually by either purging a few broken spots with tools, or attempting to do so directly with destination file or database manipulation. The latter are advanced work…

Disaster Recovery shows method of fixing by purging a few bad spots with tools. The trick may be to find the bad spots. You might have them on your Recreate errors. The Repair error probably needs a DB look, using DB Browser for SQLite. If you prefer, Creating a bug report will let somebody else start, but privacy sanitization will mean only you can look up the actual filenames, e.g. if you want to try a specific file purge.

Is the intent of this to get a long-running backup with valuable history backing up again? Just recovering its files is much easier, and there are two level of tools that could be used. Fresh backup starts are also easy.

rcmaniac25 · August 12, 2020, 9:47pm

Already ahead. I have experience from a prior time trying to fix a DB where I download the whole backup destination to my NAS, run restore locally (I keep 2 backups on the NAS, one for operating on, one as untouched in case I need to restore the active one) so I don’t have to pay anything for operations (I use Backblaze B2) and let it go.

Given the length of time, I usually don’t pay attention. But my NAS gives me file-access logs and I know that it got to dblocks.

Well, it means I caught the actual error instead of coming back to a impractical number of warning/error logs that I don’t actually know the real error, just the consequential error (like: “my computer won’t go on” is the consequential error to “I have a power outage” which is the real error).

I may have to play with purging to see if I can eliminate the files with issues (I’m fine with those loosing history), but I’d otherwise like to keep backup history (even if it’s quite old now). I feel 75% of doing a backup is keeping the files, 25% is the history… but that’s still a big number. I’ve been using DB Browser to look at things. I’ll see what I can do on a bug report (the DB is ~800mb, so I don’t think Github would be to happy with that being uploaded).

I’ll read into the Disaster Recovery, as I probably glazed over it. My goal is to have a backup target running again with history as intact as possible. I should’ve done this months ago, but life happens.

ts678 · August 12, 2020, 11:26pm

The SQL, while hard for me to read, is easier to read if it’s not squashed on one line. I think the query is:

SELECT *
FROM (
	SELECT "N"."BlocksetID"
		,(("N"."BlockCount" + 512 - 1) / 512) AS "BlocklistHashCountExpected"
		,CASE 
			WHEN "G"."BlocklistHashCount" IS NULL
				THEN 0
			ELSE "G"."BlocklistHashCount"
			END AS "BlocklistHashCountActual"
	FROM (
		SELECT "BlocksetID"
			,COUNT(*) AS "BlockCount"
		FROM "BlocksetEntry"
		GROUP BY "BlocksetID"
		) "N"
	LEFT OUTER JOIN (
		SELECT "BlocksetID"
			,COUNT(*) AS "BlocklistHashCount"
		FROM "BlocklistHash"
		GROUP BY "BlocksetID"
		) "G" ON "N"."BlocksetID" = "G"."BlocksetID"
	WHERE "N"."BlockCount" > 1
	)
WHERE "BlocklistHashCountExpected" != "BlocklistHashCountActual"

courtesy of http://poorsql.com/. Commenting out the last line with -- shows output when no errors exist. BlocksetID can be looked up in File view to see what Path might be in trouble and need to be purged…

rcmaniac25 · August 13, 2020, 4:29am

I ran said query and got nervous because it said I had 304 issues. Double checked (as I had run the command on my own at one point) and realized the block sizes were different. Once adjusted (3200 instead of 512), I got my 2 blocksets again.

Note: my blocksize is 100kb, and uses SHA265. So block count = (100 * 1024) / (265 / 8) = 3200

My issue now is that trying to run purge via command line results in the DB trying to verify it’s consistency before doing a purge… which doesn’t help as it fails verification.

I ran list-broken-files just in case, but it says there are no broken files.

I also ran repair with verbose logging and it was able to repair one of the two (but doesn’t tell me what). Since it didn’t fix everything, it fails and doesn’t commit. I’ll remove the exception so I have at least one file fixed and figure out which one it is. Might make it easier to repair one then two.

rcmaniac25 · August 13, 2020, 4:49am

Err, maybe not… I don’t have the mental time to get around the signature-verified update mechanism. All I want is to run a local copy of a build, not steal someone else’s backup. I’ll figure this out another time.

ts678 · August 13, 2020, 2:53pm

Grumble. I hate it when the damage-repair tools won’t run because the backup is damaged, but I guess it makes some sense because unexpected conditions could make things worse. That’s possibly why most consistency check failures before routine operations such as backup bail out rather than keep on going…

rcmaniac25 · August 21, 2020, 8:39am

I’m back… and I got my backup back too!

So, the whole time I kept trying to do all the fixes and whatnot with the web GUI’s Commandline options. This caused some issues as the updater does signature verification, and doesn’t like my changed files. While I get the safety precaution, sometimes it’s good to give a user the ability to destroy their backup so long as they know that it’s a risky option (Ex. --do-dangerous-thing=true => "warning, if you really want to do this, set the value to ‘understandrisk’ => --do-dangerous-thing=understandrisk).

Upgrade to VS2019, switch to the proper tag (2.0.5.1_beta), and it was time to run Commandline from, well, the command line. First up was learning about AUTOUPDATER_USE_APPDOMAIN env. variable because the debugger would only attach to the host process and not the updated-spawned subprocess (and I didn’t see any obvious “just run this, don’t use the updater” options. According to the debugger there’s like 170 of them or something, not including the “internal” one(s)).

First up: repair

The pain thing I did was change FixMissingBlocklistHashes's final sanity check before commiting to the DB from a UI exception to a warning message.

First run through, I went from 2 mis-sized block sets to 1. Normally, that would fail. But my change meant it commited the change. I then ran it again, and went from 1 mis-sized block set to 0. Why it didn’t get both in one run, no idea. I still have the DB backups but not sure what a dev would want so I don’t provide to much info.

After checking for missingblocksets (RunRepairCommon) it did RunRepairRemote and found no errors.

Two
Now, that doesn’t fix verification errors before backups, so time to remove the 2 problem files I had (and still showing up).

As mentioned before, purging was failing in verification. As commandline doesn’t produce a “filtercommand”, just a “filter”, it runs consistency verification and fails. Simple: comment the verification out.

Followed the whole execution (first via dry-run, then for real), it found all the versionsets that had the files, removed the files, updated the DB, deleted the files, checked if compact should be run (it didn’t) and… all was good.

My backup is working again and I didn’t loose any history (that wasn’t removed by the first runthrough of the retention policy).

Currently in the process of syncing all the files from the local NAS back to B2 before I switch back to that. I still have the backs of the DB (at different stages too), so if needed, can provide them if some dev can tell me how to sterialize the DB (have DB Explorer, as mentioned earlier).

I’m wondering if the purging of invalid files should be possible with “purge-broken-filesets” though it may be a bit smaller then a full fileset.

ts678 · August 22, 2020, 4:02pm

I was not part of the original design, and its author seems too busy to comment on forum, however a GitHub pull request may be a way to get a yes/no opinion on things. Pull requests for any issues are welcomed too.

rcmaniac25 · September 6, 2020, 7:24am

I’m not near a PR, but an Issue will probably be good enough.

rcmaniac25 · September 6, 2020, 8:17am

Issue created: Inconsistant Files Should be considered Broken and be purge-able with purge-broken-filesets · Issue #4310 · duplicati/duplicati · GitHub