Verification after backup - High download volume

Bart80.2 · April 28, 2021, 7:14pm

Hi,
I have multiple backup jobs from my NAS to JottaCloud.
For most of them I have set backup-test-percentage to 5 (in the general settings), for some of the backup jobs I have put it to 10 (in the individual backup job’s settings).
However, when looking at how much is downloaded, and comparing it to the total backup volume, it seems to me that the percentage that is downloaded is more or less double of what should be downloaded. I am checking this by comparing the value for “BytesDownloaded” in the full log (under “Testresults”,“BackendStatistics”) versus the backup size under the job in the main screen.
It seems consistent - for the jobs where I put backup-test-percentage to 5%, about 10% is downloaded and for those where I put it to 10% about 20% is downloaded. It is also consistent with the download volume shown in my ISP’s traffic statistics.
Backup-test-samples is not set so this could not be the cause.
I am running Duplicati - 2.0.5.1_beta_2020-01-18 on a Synology DSM 216+ via Docker.

Is this expected behaviour? How much data is effectively being tested - the percentage that I put or the double?

ts678 · May 1, 2021, 2:10am

Welcome to the forum @Bart80.2

The TEST command says what a sample is (it’s not one file – it’s one each of three types).
Backup Test block selection logic describes the detailed method by which files are chosen.

You can look at the size of your different backup files. The default size of a dblock is 50 MB.
There’s a smaller dindex for each dblock, and the dlist file size varies with backup file count.

One question is – what does it mean to test 10 percent of the backup? One view might say
that it means testing 10 percent of the dblock files, and because it’s a set, some other ones.
That would be the sample set count. To compute it roughly, take total file count * percent / 2.
The divide-by-two is because of dindex for every dblock. The question is – did code divide?

github.com

duplicati/duplicati/blob/beaf03562fdcf4425e962085bdf7175d6a465f49/Duplicati/Library/Main/Operation/BackupHandler.cs#L299-L300


long remoteVolumeCount = m_database.GetRemoteVolumes().LongCount(x => x.State == RemoteVolumeState.Verified);

long samplesToTest = Math.Max(m_options.BackupTestSampleCount, (long)Math.Round(remoteVolumeCount * (m_options.BackupTestPercentage / 100D), MidpointRounding.AwayFromZero));

Add backup-test-percentage in addition to backup-test-samples [$25] #3296 asked for a “percentage of your backup”, but what does that mean, given the sample set idea? I’m thinking code should divide by 2.

How about if you look at backend statistics and post your KnownFileCount and FilesDownloadedCount
on a backup that doesn’t also do a Compact (which would add downloads) to see how the math lands?

Bart80.2 · May 2, 2021, 11:11am

Hi, thanks for your answer.
As a sample is 1 dblock, 1 dindex and 1 dlist file; all dblock files are ± same size and the sizes of dindex and dlist files are lot smaller (if not negligible if compared versus the dblock files), the percentage should reflect more or less the total size.
I did what you requested as a test. I enabled the option “no-auto-compact” and had a backup job running of a total size of 18,55 GB (backup data):
“KnownFileCount”: 773
“FilesDownloadedCount” does not appear in the log, instead I find “FilesDownloaded”: 83.
“BytesDownloaded”: 2046355815 => Amounts to about 2 GB which is a bit over 10% of the backup size (backend). However, percentage is set to 5.
So, still with “no-auto-compact” enabled, the downloaded size is still double.

Could there be another cause?

Bart80.2 · May 16, 2021, 1:25pm

I did some further analysis on it, checking the contents of the “remotevolume” table in the backup’s SQLite DB. I analyzed the contents of the table before and after the backup job (knowing that the data to be backed up would not have changed so essentially the only interesting thing that would happen would be the verification).
I made the assumption that a change in the “VerificationCount” column would mean that the volume would be verified (its value could jump from 0 to 6 but I knew that, if a volume is verified for the first time, the verificationcount would be set to the maximum verificationcount over all remote volumes - I suppose that would be to avoid new blocks being overproportionally advantaged by the logic that decides what volumes would be tested - so nothing to be surprised about here).
I found out that of the 1081 volumes of type “Blocks”, 1083 of type “Index” and 5 of type “Files”; 108 volumes of type “Blocks” + 108 “Index” + 5 “Files” remotevolumes are verified. This is in line with the “FilesDownloaded” number in the logs + the total size of the downloaded volumes is exactly the same as what is in “BytesDownloaded”.
Interesting to know: number of files and downloaded data is ± 10% of the total backup data, but interestingly enough, the value for “Backup-test-percentage” is 5%.

I see likewise behaviour for backup jobs where the “Backup-test-percentage” is 10% - here the filecounts and downloaded data amounts to ± 20% of the total backup data size.

It seems to me that this is structural: if you set a “backup-test-percentage”, the verified size is systematically ± equal to (backup data size) * percentage * 2. I don’t see where the *2 comes from or why it is.

Now that I know it, I can live with it but I doubt if this is desired behaviour…

ts678 · May 16, 2021, 9:21pm

Explained earlier, including some actual source code. It doesn’t multiply by two directly at this step, but

because there’s a later multiply by 2 or 3 because sample size of 1 should be at least 1 dindex 1 dblock:

The TEST command

A sample consists of 1 dlist, 1 dindex, 1 dblock.

however I’m assuming it’s smart enough not to re-verify the same file (maybe a rare dlist) multiple times.

So if your files were, say, 1000 dblock with 1000 dindex (one per dblock) plus a negligible number of dlist, the two lines of code would compute the 10% sample as 10% of 2000 = 200. By data volume, 200 dblock downloads, 200 dindex downloads, and maybe all of the dlist files would be dominated by 200 dblock files which would then make you ask why 20% of the destination stored bytes got downloaded, instead of 10%.

So if you say backup-test-percentage shall be 10 percent, you would download 20% if you check volume, and looking at file counts, you might download roughly 40% – ignoring the dlist downloads in this example.

The help text says:

Use this option to specify the percentage (between 0 and 100) of files to test. If the backup-test-samples option is also provided, the number of samples tested is the maximum implied by the two options.

So if an unusually careful user wants to check exactly all their files, they could try that by giving a sample size of 1000. If one argues that the user should be able to give percentage, I’d say 100% should do same, however saying 100% of 2000 files gives 2000. I’m pretty sure excess is ignored, but it’s double the need.

Assuming above makes sense (and you are also encouraged to test to see if theory matches behavior) a useful path to follow would be to open a GitHub Issue making the case. This forum does not track issues, and the person who would know whether or not the behavior is as desired can be best reached in GitHub.

Add option to specify percentage of samples for verification #3582 is the developer work. Ask developers. See original feature request which was also cited previously (right under the code). You got my opinion…