Backup issues following NAS crash - Hash mismatch

Hi all,

I have been backing up faithfully to a FreeNAS 11.1 box for some time now, using SSH as the transport. All has been working really well, but due to an unknown issue (possibly faulty RAM) the FreeNAS server has rebooted unexpectedly, once in the middle of a Duplicati backup.

Since the last unexpected reboot, I now get this error when my backup job runs:

Failed: Hash mismatch on file "/tmp/dup-03cbf397-7527-4c2a-9558-2b7bcf84a1ea", recorded hash

I have followed though a few forum posts on this message, and I ran a verify of my backups according to How to test / verify your backups

This results in the following output:

ErrorID: FilesetDifferences

Unexpected difference in fileset version 5: 27/09/2018 02:00:02 (database id: 109), found 19602 entries, but expected 19603

Return code: 100

That moves me forwards but I’m not quite sure where to go next. I believe I probably have a corrupt ZIP file - I’m presuming an upload got truncated because of the hard reboot of the NAS. I tried running a database repair and the next scheduled backup ran fine. Then the following one errored with the hash mismatch error again, so the problem (I believe) still exists.

What is my best plan to deal with this? I accept that something will probably be lost over this and that’s fine, especially if I’m right and this is a truncated upload. I want my backups to continue from here on without erroring out each time, so at least current changes are being backed up.

Any advice appreciated.

Thanks!

Ouch! I hope you were able to verify (and replace) that bad memory!

At this point I think you pretty much just need to purge the bad version (in your case 5). I could be wrong, so feel free to wait for a second opinion - but if I’m wrig then hopefully this will help:


I notice your “broken” version has a fairly old date (09/27/2018) on it - it’s odd that something other than the most recent backup (version 0) would be involved in any files transfers UNLESS a compact was going on - do you think that could have been the process running during the reboot?

Thank you so much for your quick response - just running through that process now (have kicked off a list-broken-files).

In answer to your question, I’m not sure about a Compact operation. I have Smart Retention set for this backup so I presume that on an almost nightly basis it has to prune the tree somehow given the need to keep the requested number of versions. However whether that counts as the kind of operation that could have caused this I’m not experienced enough in Duplicati to say. Thank you however for this amazing piece of software and the support you give through these forums - that I can say I’m truly grateful for!

Ok this is not what I expected - I was sure I had a bad file in there given all the forum posts I’ve read so far, and it would even have made sense given the crash of my NAS. However I just ran a list-broken-files from the shell with --full-result and this was the result:

MainOperation: ListBrokenFiles
BrokenFiles: []
ParsedResult: Success
Version: 2.0.4.4 (2.0.4.4_canary_2018-11-14)
EndTime: 15/11/2018 22:07:27 (1542319647) 
BeginTime: 15/11/2018 22:04:46 (1542319486)
Duration: 00:02:41.1848130

How should I proceed from here?

Hmm, I’m not sure - but I did just notice your initial hash mismatch error was in the tmp folder. You’re running Duplicati on Linux with FreeNAS as the destination, right?

Yes that’s correct - Duplicati running as root on Ubuntu 16.04 server, destination FreeNAS 11.1-U6 over SSH.

Ok, then I think the failure scenario was that the server went down while Duplicati was downloading a file (to the /tmp folder) for verification or compacting. This could cause the hash mismatch.

I suspect it’s unrelated to the FilesetDifferences error…

After looking at other cases of this in the forum, I think this is probably the least disruptive way to fix a limited problem (although we’re not sure how limited this one is…), but you probably meant The DELETE command specifying a version which sometimes takes special techniques to determine because the error message has verson numbers that count up with time, whereas the rest of the UI has versions where 0 is the latest version. Some people have reported success with database Recreate but that can run awhile. How big is the backup?

Although I wish there was a definitive guide to the capabilities and limitations of the tools, the list-broken-files and purge-broken-files commands are, I think, primarily based on database queries to check for files lost due to things like dindex and dblock file issues. There’s an induced lab experiment of those at Disaster Recovery.

And, yes, the above Unexpected difference in fileset issue may or may not relate to Hash mismatch which is a complaint about a downloaded file not having the expected content. A question is when it went bad, and an interrupted download seems less likely unless this is an old download somehow still getting complaints. Does that /tmp file date/time shed any clues? Is it known if that exact file somehow is downloaded repeatedly?

Another possible explanation is the sampled test described below. If a compact ran from the regular backup, there might be signs in the job log. Here’s part of my log from this morning, when the automatic compact ran:

CompactResults:
DeletedFileCount: 26
DownloadedFileCount: 13
UploadedFileCount: 4

Some people focus on current backups, while others focus on keeping old versions. Getting current backups going is easy if one doesn’t care about old ones. If there are some especially critical files, those can even be backed up first, and having several smaller backups can even be good long-term, for scheduling and uptime. Possible backend damage is very slow to detect because the only way to detect damage that escapes a fast look at file sizes is to download everything to inspect. The TEST command can do that, and such a sampled test runs routinely after each backup, as configured by the –backup-test-samples option, but forcing all files on the backend to be sampled can only be suggested indirectly by giving a huge count to the test command.

Difficult to say. None are ideal, but if you could describe “best” in terms of user experience, it may help guide. Some may also require quite a bit of work, and technical skill. Depending on equipment, a hybrid approach is possible where a fresh backup begins, and we move the old one to another Duplicati to see what we can get.

Thanks for your detailed response - a check of /tmp reveals nothing - there are no files in there relevant to duplicati so whatever is happening is transient. I don’t encrypt the backups as they are only stored locally at present, so following another forum post I delved into the shell and ran this on my backup store on the FreeNAS box:

find . -name '*.zip' -exec echo -n '{}' \; -exec unzip -t -q {} \;

That resulted in one file corrupt about of about 12k 50M zip files:

./duplicati-b46dc352cec934fcfab1f65862a176c6b.dblock.zipunzip: 1 checksum error(s) found.

In terms of “best” - I agree it’s subjective - and I don’t think there’s anything from an old version that I need to keep hold of - I suppose I’m trying to learn good practise when working with Duplicati, and for me the “best” is to resolve the issue with minimal loss of historic backups. For example the post that recommended the “find” command above suggested you could move the file aside and more or less continue (obviously losing whatever was in the corrupt file if I understood correctly). In order of preference, I guess it would be:

  1. Lose a specific zip file and carry on, losing whatever material was in it.
  2. Lose a whole historic version and carry on.
  3. Lose the whole backup and restart.

As an aside, is there a way I could have detected the corrupt zip file from within Duplicati? I am aware if I’d encrypted the backup, I couldn’t have used “unzip -t” to detect it.

When you did your test / verify above did you use the --full-remote-verification=true parameter?

My guess is the test process found the fileset different it reported and stopped checking at that point - so if there was no fileset error it would have moved on to the file testing step and found the issue then.

As for moving forward, keep in mind that you don’t have to “loose” the whole backup. You could could declare this backup frozen and keep it around (broken file and all) for it’s historical contents and just start a new one moving forward…

If it was encrypted you could first try using either SharpAESCrypt.exe program shipped with Duplicati or the AES Crypt for Linux program. The integrity check sounds like it should be able to find any corruption. ZIP may vary. Duplicati.CommandLine.RecoveryTool.exe happily downloaded and unzipped a zip file that I’d corrupted, until I chopped a big chunk out of the middle, instead of just altering bytes – first at the end and then various. The 7-Zip Test caught even the first one. perhaps ZIP programs have both permissive and pedantic modes? You could, if you like, repeat my test to see how sensitive your test is. If good, you could try the “move aside” (and move it far enough aside that Duplicati cannot even see it – preferably move it totally out of that folder).

If the hash problem was a random old file being tested, it could come up again. Too bad you can’t see it now to see if it’s like the file unzip -t found. I’m uncertain about Unexpected difference in fileset future.

–full-remote-verification isn’t documented to say it runs any further, but simply looks deeper into sample files. Documentation can be wrong, and I haven’t looked at code. Still, I wonder if the word “full” is a bit misleading, like “repair” and “list-broken-files” mean in a limited way. See earlier lament about capabilities and limitations.

I actually haven’t looked at the documentation for it - I’m strictly going by @kenkendk’s post. :slight_smile:

That post agrees with the documentation that it makes it a deep check. What we needed here was a wide one, and best would be one that doesn’t download all files. See How often does duplicati test your backup archives? although this does hit the limitation that not all storage types can hash for you. See survey info.

I’m glad that FreeNAS had enough tools to do a wide check (however “deep” that check is) of the zip files, however even deeper would be --full-remote-verification if we could get it to do all files, but now I see how:

Should test / verify commands & –backup-test-samples support percentage parameter? had @JonMikelV saying “all”, and I think I tracked the code to here where I see that or “everything” just sets a huge number.

Testing “all” and “everything” on two very corrupted (but same size) dblock files gave two “Hash mismatch” errors, so I think that refutes the stopped-early theory. We now can do “full” “wide” AND “deep” verification, provided “slow” is also permissible, and we both commented on how to start fresh while assessing damage.

Heh - that just makes me think of my favorite “scoping” quote.
“Fast, Right, or Cheap - choose two.” :smiley:

Yes - I believe you are correct DELETE allows one to delete a specific version while my incorrectly suggested PURGE is for removing non-restorable items from the backup. Though in this instance one would hope that PURGE would have been enough.


@jamesfreeman959, sorry to get distracted there… :slight_smile:

So at this point we know there is a corrupt file in your destination but in theory that shouldn’t stop your backups from running. Are you getting any errors during a normal backup right now?

If so, let’s see if we can pin those down otherwise we can look at getting that corrupt dblock file out of there.

It’s been quiet. Are things back to normal backups? I wasn’t even certain that the corrupt dblock would hurt, because there are times (e.g. when a system is shut down or a network connection fails) where a partial file might be left behind (if the storage type behaves that way), and there are cleanups for interrupted backups.

Looking at the bad dblock date may say whether it’s an old one or something that got hurt in the NAS crash. Looking for it in Job --> Show log --> Remote might also show history. Database also has harder-to-get info.

If current backups are running, there might still be some latent issues to trip over during random file testing, however any problem with filesets might eventually get removed, unless retention setting keeps all backups.

Thanks so much for checking in - I really do appreciate it! The corrupt dblock file seemed to be causing backups to fail (that was my understanding as backup status e-mails gave a status of “Fatal” - I could of course have been wrong in that understanding) so I thought as an initial test I would move the file aside and re-create the database so Duplicati knew exactly what it still had.

That of course might have been a mistake - the creation of a new database has now been running for 4 days and isn’t complete yet. Of course whilst it is running, backups are not, so I am beginning to wonder if this was actually such a good idea, and that perhaps I might have been better simply to have either moved the backup aside and started afresh, or tried to delete version 5 (which seemed to contain the errant file). I’m not sure what the database rebuild time should be like for my backup given an SSH transport over a local network - the progress bar in the Duplicati GUI is most of the way over to the right, but proceeding ever slower so it’s difficult to judge the actual time remaining.

Any thoughts on how to proceed? I presume the database rebuild is simply a matter of patience, but without a way to judge estimated time remaining I’m a little edgy.

Yeah…database rebuilds should be a last resort, and we really should add a “Warning - this might take a LONG time” message to the process. :slight_smile:

Generally it is an exercise in patience - I suspect the reason it’s so long is because as Duplicati has to check the database to see if an incoming block hash already exists. As the database grows, this step takes longer and longer - which would explain the apparent stalling near the end of the process.

Thanks @JonMikelV - yes I have noticed this before and unfortunately conveniently forgot it when it came to triggering this one. According to the forum it’s been 8 days since your reply, and the re-creation is still running. This of course means well over a week without a backup, so I begin to suspect that I might actually be better off simply archiving off this backup and starting again. In truth I can’t think of anything that I have deleted/damaged in the backup set so I could probably afford to lose it entirely if I get really honest - it’s more about having a spare copy of data but archival is always nice for those “oops” moments (e.g. you delete your Duplicati database… :wink:

Well, it’s been 3 weeks since you posted your 8 day response to my last post so I’m hoping the re-create has finished (or was aborted) by now. :slight_smile:

If you started a fresh backup, did you restructure it at all? For example, a few smaller backups would likely be faster (at backing up and database recreates) than a single monolithic job.

And as you said, an abandoned backup isn’t useless - it can always be archived off somewhere (spare USB drive, cold storage etc.) and can still be restored from in the future (using “Direct restore from backup files”) if it is ever needed.

Actually, now that I think about it - using a decent retention policy is a great way to make sure backups don’t get too huge over time. But a similar method could be to just start a new one (say every year) and relegate the previous year(s) as archival backups.

And of course, making a snapshot of your destination files at ANY time (not during a backup process of course) allows for creation of a usable (even without local database) backup. :slight_smile:

Thanks for checking in @JonMikelV - really appreciate that! Yes I archived it off and have started afresh. I used it as an excuse to upgrade to Duplicati 2.0.4.5 at the same time, which seems to be working but also throws an error after the backup because it’s trying to back it’s own database up (which of course changes during the backup!). This was automatically excluded in 2.0.3.3 where I came from, and a quick trawl of Google search results suggests this is an issue on 2.0.4.4 and above for a number of users though it’s difficult for me to say how many.

Snapshots are a great idea! I wish I had thought of that - they are going to FreeNAS so that is an ideal target platform! Also good to know I don’t need the database for a restore of an archived backup - that’s a big time saver!