Blockset hash collisions?

crypdick · January 3, 2021, 7:28pm

I’m working on a dashboard for tracking the lineage of files. I’ve joined Duplicati sqlite3 tables (File, Metadataset, Blockset, FilesetEntry) to make a table such that each row is a unique version of a file.

I expect each row, then, to have unique combinations of (FilesetEntry.FilesetID, File.Path, Blockset.ID, Blockset.FullHash**). However, on inspection, I’m seeing thousands of non-duplicate, non-empty files being mapped to the same Blockset hash (some up to 5-8k times). These clashing files have different File.ID, filetypes, contents, but they seem to all have the exact same FilesetEntry.Lastmodified timestamp.

Am I doing it wrong? Am I misunderstanding what a BlockSet is? **Based on “Choosing Sizes - The Block Size”, it sounds like each file is stored in its own Blockset, each set being made of <=100kb blocks/file chunks. So, each unique file version should have it’s own unique Blockset hash.

ts678 · January 3, 2021, 9:32pm

Welcome to the forum @crypdick

Sounds interesting. Can you say more about what that means? It would clarify what you’re trying to do. Typically I think of ancestry when I think of lineage, but I have a hard time conceptualizing that to files…

Are you doing initial development on a big backup? It would be easier to start on a small well-known one.

Hard to say without any SQL posted (and after that, it might take a better SQL person than I am, to look).

EDIT: FilesetEntry.FilesetID looks a little suspicious, because you can get a lot of these, if you just backup an unchanging file again and again. It will have the same LastModified, FileID, and on from there.

kenkendk · January 3, 2021, 11:20pm

What you write makes sense. Each unique “data stream” has its own BlockSet. If you are seeing that many unrelated files point to the same blockset, my guess is that you are somehow getting the “metadata stream”, meaning a JSON representation of the items metadata. The metadata is also stored as a blockset, and it is not uncommon for many entries to have identical metadata.

The database format has a little documentation, that should make it easier to figure out what-goes-where: https://github.com/duplicati/duplicati/blob/master/Duplicati/Library/Main/Database/Database%20schema/Schema.sql

There is also a picture with the internal relations drawn:

Local database format · duplicati/duplicati Wiki · GitHub

crypdick · January 4, 2021, 12:07am

Sure. My ultimate vision is to use Duplicati for de-duplicated “cold storage”. For instance, I want to be able to upload movies to S3, delete them locally, and then retrieve them at will. However, based on this conversation:

you have to pick the right [backup] when restoring. If you make two backups and delete files inbetween then you won’t be able to see the deleted files in the newest snapshot.

So, the goal is to address my content based on the file path, rather than by snapshot: for each unique file path, I want to see the full revision history, regardless of backup date.

Yes, I am using real backups. I’ll clean up my code a bit and post later.

crypdick · January 4, 2021, 12:09am

Ah! Wish I had that ER diagram earlier, I pieced it together the hard way

ts678 · January 4, 2021, 12:15am

I guess you haven’t found my posts advising against that. Duplicati is a backup program, not an archiver intended to hold your one copy of presumably precious data. Especially don’t do that with Beta software. Usually part of the post says it’s hard to find files later, but I guess you’re trying to work around that issue.

EDIT: Regarding “de-duplicated”, it’s not likely to help much with video, unless you have exact duplicates.

ts678 · January 4, 2021, 1:40am

One question would be how much videos get revised. If you edit them, I’m not sure how well deduplication would help. Deduplication works on fixed block boundaries from the start, so early insert would defeat it…

On same note, do you ever change timestamps separate from content? OS typically forces a timestamp change when content changes. Duplicati will consider any such change a change, but you might not care.

Minimal output would seem to map path to its backup versions, but file timestamp might also be useful, if files tend to get changed after they first show up. That seems like the basics for path, version, and where.

Sorry, I thought it was just a chunk of SQL. I’ve been playing in DB Browser for SQLite on a test database.

kenkendk · January 4, 2021, 8:14am

There is something along those lines in Duplicati (the find command):

github.com

duplicati/duplicati/blob/master/Duplicati/CommandLine/help.txt#L97-L114


      
          > duplicati.commandline.exe help find

          > duplicati.commandline.exe help list

          

          Usage: find <storage-URL> ["<filename>"] [<options>]

          

            Finds specific files in specific backups. If <filename> is specified, all occurrences of <filename> in the backup are listed. <filename> can contain * and ? as wildcards. File names in [brackets] are interpreted as regular expression. Latest backup is searched by default. If entire path is specified, all available versions of the file are listed. If no <filename> is specified, a list of all available backups is shown.

          

            --time=<time>

              Shows what the files looked like at a specific time. Absolute and relative times can be specified.

            --version=<int>

              Shows what the files looked like in a specific backup. If no version is specified the latest backup (version=0) will be used. If nothing is found, older backups will be searched automatically.

            --include=<string>

              Reduces the list of files in a backup to those that match the provided string. This is applied before the search is executed. 

            --exclude=<string>

              Removes matching files from the list of files in a backup. This is applied before the search is executed. 

            --all-versions=<boolean>

              Searches in all backup sets, instead of just searching the latest

If you scroll to the right (linebreaks are missing), you can read that it will list all versions of a file, if you give it a path.
You can then get the file version you want with the restore command and setting --version to the version you want.

But, as @ts678 notes, Duplicati was not designed for fast random access to individual files. Due to blocks being chunked into .zip containers, Duplicati will need to download more data than just the right blocks, and will need to do some decompression as well.

crypdick · January 4, 2021, 8:26pm

Thanks for all the great info everyone. My files are:

largish (0.7-5GB) files
infrequently accessed
not precious
immutable (I do not plan to edit these videos)

Over the years I have created multiple HDD backups and some large files have multiple copies. I suspect each time I pasted files the timestamps got updated. @ts678 is there a way for Duplicati to ignore timestamps when hashing files?

@kenkendk I’m assuming that Duplicati tries to group blocksets from the same file in the same containers. So if I was fetching a 5GB file stored in 50Mb containers, in the worst case scenario I’ll have to fetch <100Mb of unrelated chunks, right?

ts678 · January 4, 2021, 9:38pm

Possibly not just as you want, but question is a bit vague.

Timestamps affects decision of whether or not to read through the file to look for changes, and hash it. Sometimes files are kept open (databases might do this), so timestamp might not reflect the contents.
This case has an option to scan it anyway, but if you mean don’t scan an unknown new file, that would result in an incomplete backup as files are added. New files need to be processed. An exact duplicate would be fully deduplicated, so would not need any new content blocks. It might need new metadata…

Testing with File Explorer on Windows showed a copy and paste kept Modified time, got new Created.

You can experiment with things yourself. Here’s a test where I copied a file and made another backup:

File table
ID  Path                                BlocksetID  MetadataID
2   C:\backup source\short.txt          4           2
3   C:\backup source\short - Copy.txt   4           3

FilesetEntry table
FilesetID   FileID  Lastmodified
3           2       637453191320521051 2021-01-04 01:05:32
4           2       637453191320521051 2021-01-04 01:05:32
4           3       637453191320521051 2021-01-04 01:05:32

Fileset table
ID  OperationID VolumeID    IsFullBackup    Timestamp
3   3           6           1               1609722340 2021-01-04 01:05:40
4   4           10          1               1609793048 2021-01-04 20:44:08

It’s the same content, so it can use the same blockset. Created time changed, so metadata changed, therefore it gets new entry in File view, however FilesetEntry only has Modified, so keeps original time.

A simple presentation format for your viewer could perhaps not worry about blockset hash or Creation, simply showing the path, modified time, and backup time. Or if immutable and paths never reused, the modified time becomes less interesting too, but it’s your dashboard, so you get to decide the display…

EDIT:

Here’s a line showing decision information leading to deciding that a file needs to be opened for exam:

2021-01-04 15:44:08 -05 - [Verbose-Duplicati.Library.Main.Operation.Backup.FilePreFilterProcess.FileEntry-CheckFileForChanges]: Checking file for changes C:\backup source\short - Copy.txt, new: True, timestamp changed: True, size changed: True, metadatachanged: True, 1/4/2021 1:05:32 AM vs 1/1/0001 12:00:00 AM

Here’s the metadata for the new file. To find it, match File MetadataID to ID in Metadataset table, match BlocksetID to BlocksetID in BlocksetEntry table, match BlockID to ID in Block table, and open it in .zip

.{“win-ext:accessrules”:“”,“CoreAttributes”:“Archive”,“CoreLastWritetime”:“637453191320521051”,“CoreCreatetime”:“637453895932425055”}

kenkendk · January 4, 2021, 9:46pm

The way I understand the question, you can set --skip-metadata which will not store any attributes, permissions, or other metadata. The metadata has very little impact on the storage, but can create “new” versions of the files over time.

That would be the best case. Duplicati does not really do anything smart, it just adds the blocks. But since Duplicati scans files sequentially, the blocks tend to be stored sequentially in the same .zip. For video data, there is very little chance that two files will share a block, so you would most likely always hit the “best case” (at most 50MiB - 1 byte overhead).

Worst case would be blocks spread over multiple .zip files, giving you 99% overhead for each block.

ts678 · January 14, 2021, 9:54pm

It’s been quiet lately, but here’s a chunk of SQL I wrote that might do something like what you want:

SELECT "Path"
	,"Lastmodified"
	,"Timestamp"
FROM "File"
JOIN "FilesetEntry" ON "File"."ID" = "FilesetEntry"."FileID"
JOIN "Fileset" ON "FilesetEntry"."FilesetID" = "Fileset"."ID"
GROUP BY "Path"
	,"Lastmodified"
ORDER BY "Path"