Backup Test block selection logic

Good evening, I have been successfully using the backup-test-percentage for a little while and it’s been working perfectly fine.
I picked 4% for a backup that happens every 2 weeks, which means that roughly the equivalent in volume of almost everything should be checked over a year.

4% x 52 weeks per year / 2-week backup = 104% equivalent per year

One thing though is that I cannot find anything that discusses how the blocks are selected every time the backup is done. Would anyone know? I’m assuming it’s pseudo-random, not sequential, which means that technically not all data would be checked.

It is random, but the list of files to test is selected from three groups in this preference order:

  1. Files that have never been tested
  2. If more samples need testing (depending on settings), then select from files that have been tested before, with preference towards ones that have been tested less
  3. If still more samples are needed to test, then choose from files that have been tested the most.

Here’s the code if you’re curious:

Interesting, thanks for sharing the logic and code. Haven’t read code in a while! I can see how if someone chose to test the entire library every backup, then the logic makes sense. One could also choose a value to test more blocks than what the backup contains, where the logic would stop once all blocks have been tested.

I set up the logic mentioned in my first post for cloud storage thinking that it would possibly be like a single hard drive where errors can occur and blocks could get damaged. I have since spent a bit of time researching how safe data is in cloud storage. I realized that covering 100% of the backup in a year is statistically overkill for cloud storage, especially when there are costs to download the data for testing.

The following Backblaze posts helped in defining durability:

Here is the math for blocks in Duplicati terms or “objects” in cloud terms:

  • As they stated in the third link, for 11 nines durability (99.999999999%), it would take a backup with 1 million objects (which would be 50TB with 50MB blocks) in B2 about 10 million years to have a block go bad.
  • A backup that contains 100000 blocks (closer to what I have), it would take one million years before one block becomes corrupt.

Either way, testing becomes futile I guess, unless I misunderstood something.

The strategy Duplicati uses by default is back end agnostic. So yes I could see it being less useful for back ends with very high durability. That being said there aren’t many downsides to doing the testing.

Usage statistics for Duplicati graphs of Backend type show most usage is not cloud storage, however one can configure the tested amount however high or low one feels comfortable with for their storage. Sometimes it’s hard to find out how statistically reliable one’s storage is, but there’s more (see below).

Most Duplicati storage looks to be local folder, although sometimes that’s actually remote (e.g. SMB), which adds some additional failure modes (and they are hit, though other checks might notice these).

It could be corrupt the instant it lands, if Duplicati puts it up wrong at the start (was a bug at one time).
Testing blocks in a more end-to-end fashion is more reliable because it covers more of the code path.

Interesting stats indeed, I might end up leaving it as-is then, thanks!