Compare about backup sizes of different versions?

vbs · June 2, 2024, 3:49pm

I want to understand better how the backup space is used regarding the aspect of incremental backups. As far as I understand it, Duplicati tries to reuse identical blocks of binary data in the source files as good as possible.
So, only when chunks of data in the source files are modified (or added), it will take up actual more space on the backend. Otherwise, the existing blocks of data will be reused.

So, is it possible to show (by a command line or something) stuff like this:

Between two backups versions, how much data is identical and reused? This would also answer, how much actual backend space was used when creating the later backup version.
Similar question regarding specific files: Let’s say I have a large 10 GB file in my backup. Is it possible to show how much data was additionally used for it in compared to an existing older backup?
I have a large ZIP file and I would be interested to learn how well block reusage is working for that (does the whole ZIP change if a tiny portion of the included data changes?)

I played around with the compare command, and it showed which files were added and/or modified, but I could not see any data regarding actual HDD space consumption.

Or are there any other possibilities to get some insight besides the mentioned questions?

Thank you!

ts678 · June 2, 2024, 6:49pm

Some of this probably falls into the difficult level of “something”. Database has a lot of raw data on files, however I’m not sure what a reasonable presentation would be if a feature request was created for this.

Generally people are probably just happy that it works, rather than trying to dig deep into exact behavior.

Simple path to that is approximated by job log Complete log. number for "BytesUploaded", however a Compact (shown in the regular job log) might be able to throw that off if it repackages the existing blocks.

Not that I know of, unless you want to go into the database, however if you have a repetitive pattern, doing backup of just that file would let you see how much data is uploaded. Basically, do the first method above.

This is a question for whoever designed your .ZIP program. According to the below, its handling may vary:

If I delete a something from a zip file, can it still be recovered?

ZIP (file format) (Wikipedia) was cited in the answer above, but ultimately it likely depends on zip program.
You could certainly test it, e.g. backup a bunch of files, then see if size shrinks if you delete a few of them.
If it tries to shrink the file, that might be tough to deduplicate because its blocks will have moved around…

vbs · June 3, 2024, 4:34pm

Hm, a possibly (?!) easy to add information in the backup result report would be to have uploadedBytes for every file separately? So, it would be possible to see (for larger files) which amount of the file had to be uploaded and how much could be reused from existing blocks. For easy reading maybe add the source file size to it.
The report would become quite big because every file in the backup would have to be mentioned. But it could be made optional.

Yeah, I understand, usually you don’t have to question it if it just works. In my case, my Google Drive storage is a bit limited (200 GB) and I would like to know if I could optimize the backup size by possibly easy measures.

kenkendk · June 4, 2024, 6:23am

Generally, things that are reported “pr file” tends to blow up, due to the number of files.

The local database keeps track of what remote dblock files are created for a given operation. Using this information, you can see which new dblock files are created. The blocks in these can then be mapped back to the files they belong to.

It is a bit of SQL gymnastics, but I think it would be a straightforward task for someone proficient with SQL. I would prefer if this was a commandline feature as it looks like it will only be used for diagnostics.

ts678 · June 4, 2024, 11:29am

The value is from a section:

    "BackendStatistics": {
      "RemoteCalls": 29,
      "BytesUploaded": 112855294,

which deals with Destination files. I don’t think it has Source file details. There might be a similar issue gathering statistics earlier in the processing, but I’m not certain. If you look at Channel Pipeline, it has a number of processing steps, and what one would like would be one that knows each file path, and also added (not currently in backup) blocks resulting from the file path. I don’t think (not sure) that this exists.

Better developer documentation would be nice, but the last time I looked, I thought main backup path is:

FileEnumerationProcess.cs → SourcePaths → MetadataPreProcess.cs → ProcessedFiles → FilePreFilterProcess.cs → AcceptedChangedFile → FileBlockProcessor.cs → StreamBlock → StreamBlockSplitter.cs → OutputBlocks → DataBlockProcessor.cs → BackendRequest

which shows how BackendStatistics is pretty far removed from file paths. I think DataBlockProcessor.cs

github.com/duplicati/duplicati

Duplicati/Library/Main/Operation/Backup/DataBlockProcessor.cs

78ea50c83


      
          var newBlock = await database.AddBlockAsync(b.HashKey, b.Size, blockvolume.VolumeID);

          b.TaskCompletion.TrySetResult(newBlock);

          

          if (newBlock)

          {

              blockvolume.AddBlock(b.HashKey, b.Data, b.Offset, (int)b.Size, b.Hint);

is the first spot that knows which blocks are new, and at this time it’s just blocks. Does it still know path?

github.com/duplicati/duplicati

Duplicati/Library/Main/Operation/Backup/DataBlock.cs

78ea50c83


      
          await channel.WriteAsync(new DataBlock() {

              HashKey = hash,

              Data = data,

              Offset = offset,

              Size = size,

              Hint = hint,

              IsBlocklistHashes = isBlocklistHashes,

              TaskCompletion = tcs

          });

looks like “no” to me, but an expert developer (they are rare, which limits ability to add features) may differ.

One easy way to reduce backup size is to backup fewer folders and files, e.g. based on a Source analysis, using any number of easily available tools. This works worse if you have a lot of exclusions they don’t know. Duplicati’s reports can find big files after excludes, but they’re not very good at adding up total of small files.

Until you run The PURGE command, versions still retained will hold whatever files they originally backed up.

vbs · June 5, 2024, 7:24pm

Ahh thanks, that’s interesting. I will try to explore the DB a bit and I will see if I can make sense of it.

Ok, thanks, too bad. I analyzed my data a bit now and, I could reduce the source size quite a bit. Let’s see how it turns out in the final size of the full backup set.

ts678 · June 5, 2024, 9:25pm

Some help:

Local database format

Database and destination internals

I can probably answer specific questions, but basically Fileset has simple version info, FilesetEntry has Files in a given Fileset, File view maps that to a Blockset, BlocksetEntry has Blocks for you to compare somehow.

vbs · June 6, 2024, 11:14am

Oh that’s nice. Thanks alot! I did already fire up DBeaver and looked around in the tables and I had problems to understand what I see. I searched for some documentation regarding the structure, but I couldn’t find it, so your links are highly appreciated!

vbs · June 8, 2024, 8:13am

We are getting somewhere:

Fetching block IDs for files for operation 114...
100%|███████████| 30863/30863 [00:00<00:00, 38331.36it/s] 
Operation 114 has 489961 blocks

Fetching block IDs for files for operation 109...
100%|███████████| 30712/30712 [00:00<00:00, 38453.04it/s] 
Operation 109 has 489583 blocks

Searching for shared blocks...
100%|███████| 489961/489961 [00:00<00:00, 5508176.00it/s] 
Operation 114 share amount: 363915/489961 (74.3%)

Retrieving block size information for all shared blocks...   
100%|█████████| 363915/363915 [00:07<00:00, 49727.92it/s] 

Cumulated size of shared blocks in operation 114: 34866.73 megabyte

Krisco · October 31, 2024, 6:42pm

You’re spot on about how Duplicati reuses data blocks for incremental backups! I’ve found the CLI pretty useful for digging into this stuff. When I want to see what’s changed between backups, I use the list command—it gives me a good overview of what’s new or modified without overwhelming me with data. For specific files, I’ve played around with the compare command, especially for large files like ZIPs, and it’s interesting to see how a tiny change can impact the overall backup size. The web interface is also handy; I usually check the “Show Files” option to see what’s included in each backup version. It’s not always exact on space used, but it definitely gives me a clearer picture.

vbs · November 1, 2024, 7:46am

Here is also this UI tool now to inspect backups: