Blockset hash collisions?

I’m working on a dashboard for tracking the lineage of files. I’ve joined Duplicati sqlite3 tables (File, Metadataset, Blockset, FilesetEntry) to make a table such that each row is a unique version of a file.

I expect each row, then, to have unique combinations of (FilesetEntry.FilesetID, File.Path, Blockset.ID, Blockset.FullHash**). However, on inspection, I’m seeing thousands of non-duplicate, non-empty files being mapped to the same Blockset hash (some up to 5-8k times). These clashing files have different File.ID, filetypes, contents, but they seem to all have the exact same FilesetEntry.Lastmodified timestamp.

Am I doing it wrong? Am I misunderstanding what a BlockSet is? **Based on “Choosing Sizes - The Block Size”, it sounds like each file is stored in its own Blockset, each set being made of <=100kb blocks/file chunks. So, each unique file version should have it’s own unique Blockset hash.

Welcome to the forum @crypdick

Sounds interesting. Can you say more about what that means? It would clarify what you’re trying to do. Typically I think of ancestry when I think of lineage, but I have a hard time conceptualizing that to files…

Are you doing initial development on a big backup? It would be easier to start on a small well-known one.

Hard to say without any SQL posted (and after that, it might take a better SQL person than I am, to look).

EDIT: FilesetEntry.FilesetID looks a little suspicious, because you can get a lot of these, if you just backup an unchanging file again and again. It will have the same LastModified, FileID, and on from there.

What you write makes sense. Each unique “data stream” has its own BlockSet. If you are seeing that many unrelated files point to the same blockset, my guess is that you are somehow getting the “metadata stream”, meaning a JSON representation of the items metadata. The metadata is also stored as a blockset, and it is not uncommon for many entries to have identical metadata.

The database format has a little documentation, that should make it easier to figure out what-goes-where: duplicati/Schema.sql at master · duplicati/duplicati · GitHub

There is also a picture with the internal relations drawn:

3 Likes

Sure. My ultimate vision is to use Duplicati for de-duplicated “cold storage”. For instance, I want to be able to upload movies to S3, delete them locally, and then retrieve them at will. However, based on this conversation:

you have to pick the right [backup] when restoring. If you make two backups and delete files inbetween then you won’t be able to see the deleted files in the newest snapshot.

So, the goal is to address my content based on the file path, rather than by snapshot: for each unique file path, I want to see the full revision history, regardless of backup date.


Yes, I am using real backups. I’ll clean up my code a bit and post later.

Ah! Wish I had that ER diagram earlier, I pieced it together the hard way :grimacing:

I guess you haven’t found my posts advising against that. Duplicati is a backup program, not an archiver intended to hold your one copy of presumably precious data. Especially don’t do that with Beta software. Usually part of the post says it’s hard to find files later, but I guess you’re trying to work around that issue.

EDIT: Regarding “de-duplicated”, it’s not likely to help much with video, unless you have exact duplicates.

One question would be how much videos get revised. If you edit them, I’m not sure how well deduplication would help. Deduplication works on fixed block boundaries from the start, so early insert would defeat it…

On same note, do you ever change timestamps separate from content? OS typically forces a timestamp change when content changes. Duplicati will consider any such change a change, but you might not care.

Minimal output would seem to map path to its backup versions, but file timestamp might also be useful, if files tend to get changed after they first show up. That seems like the basics for path, version, and where.

Sorry, I thought it was just a chunk of SQL. I’ve been playing in DB Browser for SQLite on a test database.

There is something along those lines in Duplicati (the find command):

If you scroll to the right (linebreaks are missing), you can read that it will list all versions of a file, if you give it a path.
You can then get the file version you want with the restore command and setting --version to the version you want.

But, as @ts678 notes, Duplicati was not designed for fast random access to individual files. Due to blocks being chunked into .zip containers, Duplicati will need to download more data than just the right blocks, and will need to do some decompression as well.

Thanks for all the great info everyone. My files are:

  • largish (0.7-5GB) files
  • infrequently accessed
  • not precious
  • immutable (I do not plan to edit these videos)

Over the years I have created multiple HDD backups and some large files have multiple copies. I suspect each time I pasted files the timestamps got updated. @ts678 is there a way for Duplicati to ignore timestamps when hashing files?

@kenkendk I’m assuming that Duplicati tries to group blocksets from the same file in the same containers. So if I was fetching a 5GB file stored in 50Mb containers, in the worst case scenario I’ll have to fetch <100Mb of unrelated chunks, right?

Possibly not just as you want, but question is a bit vague.

Timestamps affects decision of whether or not to read through the file to look for changes, and hash it. Sometimes files are kept open (databases might do this), so timestamp might not reflect the contents.
This case has an option to scan it anyway, but if you mean don’t scan an unknown new file, that would result in an incomplete backup as files are added. New files need to be processed. An exact duplicate would be fully deduplicated, so would not need any new content blocks. It might need new metadata…

Testing with File Explorer on Windows showed a copy and paste kept Modified time, got new Created.

You can experiment with things yourself. Here’s a test where I copied a file and made another backup:

File table
ID  Path                                BlocksetID  MetadataID
2   C:\backup source\short.txt          4           2
3   C:\backup source\short - Copy.txt   4           3

FilesetEntry table
FilesetID   FileID  Lastmodified
3           2       637453191320521051 2021-01-04 01:05:32
4           2       637453191320521051 2021-01-04 01:05:32
4           3       637453191320521051 2021-01-04 01:05:32

Fileset table
ID  OperationID VolumeID    IsFullBackup    Timestamp
3   3           6           1               1609722340 2021-01-04 01:05:40
4   4           10          1               1609793048 2021-01-04 20:44:08

It’s the same content, so it can use the same blockset. Created time changed, so metadata changed, therefore it gets new entry in File view, however FilesetEntry only has Modified, so keeps original time.

A simple presentation format for your viewer could perhaps not worry about blockset hash or Creation, simply showing the path, modified time, and backup time. Or if immutable and paths never reused, the modified time becomes less interesting too, but it’s your dashboard, so you get to decide the display…

EDIT:

Here’s a line showing decision information leading to deciding that a file needs to be opened for exam:

2021-01-04 15:44:08 -05 - [Verbose-Duplicati.Library.Main.Operation.Backup.FilePreFilterProcess.FileEntry-CheckFileForChanges]: Checking file for changes C:\backup source\short - Copy.txt, new: True, timestamp changed: True, size changed: True, metadatachanged: True, 1/4/2021 1:05:32 AM vs 1/1/0001 12:00:00 AM

Here’s the metadata for the new file. To find it, match File MetadataID to ID in Metadataset table, match BlocksetID to BlocksetID in BlocksetEntry table, match BlockID to ID in Block table, and open it in .zip

.{“win-ext:accessrules”:"",“CoreAttributes”:“Archive”,“CoreLastWritetime”:“637453191320521051”,“CoreCreatetime”:“637453895932425055”}

The way I understand the question, you can set --skip-metadata which will not store any attributes, permissions, or other metadata. The metadata has very little impact on the storage, but can create “new” versions of the files over time.

That would be the best case. Duplicati does not really do anything smart, it just adds the blocks. But since Duplicati scans files sequentially, the blocks tend to be stored sequentially in the same .zip. For video data, there is very little chance that two files will share a block, so you would most likely always hit the “best case” (at most 50MiB - 1 byte overhead).

Worst case would be blocks spread over multiple .zip files, giving you 99% overhead for each block.

1 Like

It’s been quiet lately, but here’s a chunk of SQL I wrote that might do something like what you want:

SELECT "Path"
	,"Lastmodified"
	,"Timestamp"
FROM "File"
JOIN "FilesetEntry" ON "File"."ID" = "FilesetEntry"."FileID"
JOIN "Fileset" ON "FilesetEntry"."FilesetID" = "Fileset"."ID"
GROUP BY "Path"
	,"Lastmodified"
ORDER BY "Path"