Logic and usecase for remote subfolders

As a non-coder, I guess this shouldn’t be too complicated to do. Maybe it will be a bit easier to use only this part of the sequence:

“1/1/1”, “1/1/2”, “1/1/3”, “1/2/1”, “1/2/2”, “1/2/3”, “1/3/1”, “1/3/2”, “1/3/3”

So folder 1 only contains subfolders “1”, “2” and “3” (no files). I suppose this will make the algorithm a bit easier to code.

However, I see a few issues using this folder structure:

  • There is no relation between the filenames and the location where the files are stored. The only way to search for a file is listing the contents of each folder and check if it’s there.
  • To avoid extensive folder list operations, an extension of the local db structure is required. Personally I think storing information in the local databse should be avoided where possible: currently most issues in Duplicati are related to the local db, every addition of new fields could make things worse.
  • If a new file has to be uploaded, the only way to calculate the location is using the total number of files at the backend. However, when applying a retention policy or a compact operation, an unknown number of files (DBLOCK and DINDEX) will be deleted, decreasing the total number of files. When using the new total filecount for calculating the subfolder for the next upload, probably a subfolder is chosen that already contains the maximum number of uploaded files. A workaround should be implemented to avoid this, I have no idea how to do this, other than store folder information (number of files for every folder) in the local db. I’m not sure if this is recommendable (see previous item).
  • DBLOCK and DINDEX both have a random filename. We cannot be sure that a DBLOCK/DINDEX pair are stored in the same folder. So for each file (both DBLOCK and DINDEX), folder location should be stored somewhere. However they belong together, the location of one file cannot be calculated using the filename/location of the other file.

I still prefer a mechanism where the location can be calculated using the filename. No extra information needs to be stored in the database, and theoretically migration back and forth between using subfolders or not should be possible (making it possible to restore using a third party tool to do a restore).

Because filenames are completely random, we have to cheat a bit when generating the filename for a new file to be uploaded. The best way I can think of, is calculating a few characters of the filename to match a particular checksum. That checksum can be used as folder name (see “Approach 3” in the first post).

1 Like

Would we really have to cheat and manipulate filenames? How about we let Duplicati generate filenames like it always does, but then run the name through a hash function. Take the the last 2 or 3 characters (depending on how many folders you want to end up with) to determine where the file should be placed on the back end.

Example:

echo 'duplicati-b04f7ec9562cb47ba9e992c87e90b5da3.dblock.zip.aes' | sha256sum

92bf201ff334a1fac711fcc6f102a4c4dc888da2b7c78c6c3c81b1579533adf3

this file could be placed in /f/3

echo 'duplicati-b4cbd825ee4454ec88f068d4ab88ec7b8.dblock.zip.aes' | sha256sum

dcd7b14a006735546fb0f47551691be1f8d559cbd27fd74e09aa402eea459228

And this file would be placed in /2/8

Nature of hashing functions is that the distribution should be even. The files would end up be splitting into 16 ^ X folders, where X is the number of characters you pull off the end of the hash.

Two characters = 256 folders
Three characters = 4096 folders

Edit - well maybe if they are completely random already it’s unnecessary to run them through a hash function. Just use the last 2 or 3 chars of the original filename lol - before the .dblock.zip.aes of course.

Edit2 - perhaps it would still be useful for dlist files which do not have random names.

I think I now have the logic and code worked out for the directories.

I was thinking it would be a benefit to be able to calculate the path using the filename. I did this in my first iteration of creating subfolders.

Though now I’m not sure what benefit of being able to calculate the existing path. The reason being that I’m storing the paths are in the database. This entire calculation only occurs when a volume file is being uploaded to the backend. At all other times the files are looked up in the database.

For retention, there is little to be done to avoid directories becoming sparse. But a fix for that may be a job which consolidates files from sparse directories. This job could be part of the retention checking.

This would result in thousands of folders that contain only one file. If the number of folders would be limited (using only the last 2 characters of a filename, max 256 folders) would limit the total backup size too much.
I was thinking of a way to create about 3000 folders that fill up to about 3000 files, using no more folders than needed.
A random filename would distribute the files over too many folders.

One benefit is that no changes are required to the local db structure.
Another benefit could be that filling up sparse folders could be done during backup operations (just generate a filename that “fits” in that folder and upload it, instead of downloading and re-uploading during a compact operation). But I guess somehow that could be implemented using your idea too.

Good point! You’ve clearly thought about this a lot more than I have.

I don’t need to modify the db structure. The path is added to the filename.

Any subfolder solution will need to deal with sparse folders. Given that the db holds all the filenames and paths we can calculate where to place additional files and target sparse folders, if we want that as a solution.

The important part, to me, I have the subfolders coded and working; and it works with existing backups. The algorithm to place folders can be easily modified or replaced even with existing backups. So later on any other ideas for file placement can just be added in.

I think I have the algo done for max filling folders with a flat structure.

I’m going to start with 2000 files per folder and 1000 folders per folder.

The first 2000 files go in the root “/” directory. The next 2000 files go in “/1/”.

The 25,654,396 file goes in “/12/826/”

So this seems to reach the goals. It is pretty flat, works with the limit of 5000 items per folder, and supports a lot of files.

I just ran a test with 10 files per folder and 10 folders per folder on 6GB of data at 50MB blocks.

2 Likes

Excellent!

Looks like this approach works great, even for existing backups without subfolders. Great work, love it!

Can I make a few suggestions?

  • To get backend files and folders sorted nicely, leading zeroes could be added to folder names: 0001 instead of 1.
  • In Duplicati, filenames have a prefix, a dash and a single character to indicate the filetype. All files are preceeded with duplicati-b or duplicati-i by default.
    To make clear that a folder contains Duplicati backup files, the folders could have the same prefix: <prefix>-d (example: duplicati-d0001).
  • Isn’t the 2-level tree a bit overkill? If you just stick to 2000 folders in the backend root with 2000 files per folder, you can backup up to 100 TB per backup job when using the default 50MB DBLOCK size. If you increase this to 4000 files per folder, the max is 200 TB (2000 folders * 4000 files * 25MB average file size). I guess this should be enough for any backup job.
  • You could make a 1-level structure (see above) default and introduce an advanced option: --folder-depth=n (default 1). Folder depth could be set to 0 for no subfolders and 2 for an extra tree level for superlarge backups.
  • Other useful advanced options that could be added:
    --max-files-per-remote-subfolder=n (default 2000 or maybe more). Can be used to set the number of files in every folder.
    --max-folders-per-remote-subfolder=n (default 2000). Can be used to specify how many subfolders are created before a new parent folder is created.

Thanks. And the options are definitely a good idea. I used your suggested parameters.

“2-level” well… the logic is already done. And it isn’t just 2-level. It will go indefinitely. The folder structure will not be a bottle-neck.

For most users they’ll likely never branch out of the 1st level of folders.

Regarding file sorting and 0001, why even the need to have them sorted?

The filenames are pretty long, which is helpful to keep them unique in a single folder. But at 2000 files per directory the filenames can likely be shortened and thereby reduce the database size. We would want to try and keep the filenames long enough to not have collisions since we might want to reorganize the files in a future process. Though perhaps the filenames could actually grow in length as the number of files increase since I already have the total file count handy in the process. Something to think about further.

Not a special reason, it could be handy when accessing the remote files using a file browser instead of the Duplicati software. For example, if you want to download the files to/from local storage for migration to another backend, of for disaster recovery tasks.
Apart from that, it just gives a cleaner look.

The filenames contain 32 random characters, because it’s a standard GUID.
But all characters (except the 13th) are random, so the filenames could be truncated to save some space in the local DB. I don’t have a clue how how much the db size would decrease because of shoreter filenames.

Agreed.

So when a backup runs, it is not creating these files in an incredibly excessive speed. It isn’t a bottle-neck. So I think we can shorten the name, run a check of the database for a name collision, and just use shorter names. Thoughts?

Either way… it isn’t really a pressing issue… but a nice-to-have.

PR#3873 has been added as a work-in-progress as backends get modified for subfolder support.

Nice work… Look forward to checking it out!

Is this still in progress ?
Pull request was closed.

Welcome to the forum @Andrey_Nikolaev

The note at PR close was:

Closing this due to lack of activity. We can revisit this later if necessary.

It seems it’s on hold. Its developer stopped, so it may need a restart. You?
I don’t think the need has gone away, but there are lots of competing ones.

I would be glad to, but I’m mainly code for web, and have almost 0 experience in C#

I think (but am not sure) that the current developer team is lower on web skills than C#.

https://github.com/duplicati/duplicati/issues?q=is%3Aissue+is%3Aopen+label%3AUI
has some issues that a skilled web person (I’m not) could probably resolve fairly easily.

Password saver may autofill too many places, getting --auth-password is not supported #4102
may or may not be easy, but bothers me because it bites so often. Here’s today’s case:

Restore: Invalid password or corrupted data AND missing file

I think there are some ways to keep autofill out of fields. One issue is we do want stars.
The problem possibly comes from being a single-page web app. It confuses browsers.

Simple web UI design study proposed a UI revamp while noting our AnglularJS is aging.

UX category and to some extent Features category have other things needing web skill.
At the moment, there are resource limits in pretty much all skills, but I hope it improves.
Being all-volunteer, Duplicati progress relies on whatever its volunteers are able to do…

This feature appears to also be very useful for non-remote devices. I have an external hard drive with a FAT32 file system (that’s how it came) and Duplicati backup usually stops working after 11 backups, for all I know because the maximum number of files in the directory is reached.

So this feature would help me a lot. This would also mean that the sub folder functionality should also be implemented for non-remote targets.

Looks like FAT32 supports 65536 entries in a single directory. From what I can see a file takes 1 entry plus 1 more for every 13 characters in the filename. Duplicati filenames (with default prefix and with encryption) are up to 58 characters long, which I guess means they would consume up to 6 directory entries each. So maybe about 10k files maximum? Does that sound like what you have in your FAT32 target before Duplicati stops working? How large is the source data being protected?

The source is about 300GB. One backup that stopped working has 10924 files, so your assessment might be right. The other stopped at about the same number of backups, so it could be a similar number of files. So yes, that could be the issue. I got different errors with the first stalled backup and the second, gut that could also be because of different versions of Duplicati.

Does source files area change heavily from backup to backup? You can get a sense for that in the growth of the backup on its home page overview where it has backup size and number of versions. More is in job logs, such as Source files statistics in main page, and various BackendStatistics.in Complete log.

On Options screen 5, you can limit your versions or thin with age. More importantly, you can increase the Remote volume size probably quite a bit. The default 50 MB might suit typical home Internet connections however with a local drive your transfer rate is probably far higher, so you can worry less about larger files.

Choosing sizes in Duplicati

If you have large datasets and a stable connection, it is usually desirable to user a larger volume size.

You can change this size (unlike the blocksize) whenever you like, but existing volumes aren’t immediately grown. If I recall correctly, the decision to compact will decide the old volumes are underfilled, so will swing into action to cure that. This may take awhile to do all the downloading and uploading, so pick a good time.