Why the heck CAN'T we change the blocksize?

ts678 · September 26, 2021, 5:44pm

In theory this is possible, but cumbersome and would need new skilled developers with lots of time.
Volunteers are very much encouraged in all areas including helping on the forum, but there are few.

Run time would also be a big problem because one would basically have to reconstruct all versions
of all files (basically undoing the savings of deduplication), then back it up again with new block size.

An actual restore does not download an entire dblock file every time it needs one block from that file.
Instead, it downloads required dblock files and puts needed blocks from it into various restoring files.

I’m not sure offhand how one could efficiently do the sort of block-level repackaging you’re asking for.
If you or anybody else has a great idea, feel free to mention it. Even better, any volunteers to write it?

Throwing hardware at it (the size of the destination after decompression) might allow something like keeping every block on the destination accessible quickly on local storage, basically files named with Base64 of block SHA-256 hash. This would offer faster retrieval than opening up individual .zip files
which actually use that format of a file-per-block, but might also have compression and encryption…

How the backup process works
How the restore process works
Developer documentation
Local database format

You can’t change the blocksize on an existing database because the idea of a fixed blocksize is highly ingrained in the design and code. As you can see in the documents above, a file larger than one block
is represented by listing the blocks it contains. That’s done using a concatenation of the block hashes.

Blocks are used during backup by looking to see if a block that “looks” (by hash) like what was read is already in the backup. If new block was a different size, it wouldn’t match blocks of older smaller size, which would hurt deduplication, although over time, smaller older blocks might be data-churned away.

Regardless, as long as there are different block sizes in one backup, there’s a new need for tracking.
This means changes to the destination file format, the database design, its SQL, and much C# code.

Suggest better values for dblock-size and block size #2466 is a 2017 comment from original author on dynamic block sizes which was considered too ambitious for available time, but I’m not sure of details.

Personally I’m thinking that if nothing else, this should be raised. A rough rule of thumb I use (from a test where database performance started to drop) is 1 million blocks per backup, meaning over 100 GB gets potentially slow. If this was bumped even to 1 MB default for new backups, a 1 TB backup might do well.

Deduplication might get slightly worse. Possibly compression will work a little better. They might offset…

There is no performance lab with large hardware, and no performance team to benchmark. Volunteers?

Maybe some database volunteer could also help see if some SQL can be made to run faster. Anybody?
That probably wouldn’t help the database size, which can also be large on big backup with small blocks.