Logic and usecase for remote subfolders

Nice writeup and clever solution! It sounds great to me.

Any updates on implementing one of these solutions?

It’s an issue I’m actively looking to now. No ETA.

For context: Box.com max number of files per folder limitation.

My question would be which will overload first – the remote files-per-folder or the Duplicati database? Improving scaling in one area works until you hit some scaling problem in the next area. So it goes…

Without doubt the remote files-per-folder.

For OneDrive, the max files per folder is still 5000. Using the default DBLOCK size of 50 MB, the maximum amount of data per backup is about 125 GB (2500 DBLOCK and 2500 DINDEX files, 2500 * 50 MB = 125 GB). The number of backup versions reduces this (one DLIST file for every version). The amount of modified data in the available backup versions reduces it further. Uncompacted DBLOCK files could reduce the maximum backup data even further.
So with OneDrive’s 5000 files limit, I guess the maximum protected source data will be somewhere around 80 GB, which is way too small.

This limit forced me to increase the DBLOCK size to 2 GB for my backup job that protects 4 TB of videos and uses OneDrive as storage provider.
3000 subfolders with 3000 files per folder would increase the 125 GB limit to 200 TB, which is virtually unlimited.

For larger backups, I agree that you should choose a larger DBLOCK size. Even more important is to think about the internal block size (in my videos backup job, I increased the default 100 KB th 10 MB). Increasing the block size will drastically reduce the local database size (database of my 4 TB backup is about 80 MB, takes about 10 minutes to recreate!).

Is OneDrive v2 still doing that 5000 file limit for you? Back when there was a choice, a couple of people switched from OneDrive original (now gone) to the v2 version (Graph API) and the problem went away.

Results have been kind of mixed, but that might also because “OneDrive” is a very broad MS branding.

This is getting at what I was talking about, but I agree that if there’s still a 5000 file limit, it’ll likely hurt first.

Choosing sizes in Duplicati for anyone who wants to learn more about this sort of configuration tuning…

Interesting, I’ll play with the internal block size. I don’t expect many changes to existing files for our backups but instead lots of media files being added. I think 10MB should drastically improve performance.

Keep in mind that you cannot change the internal block size for existing backup jobs.

Right, I’ll have to blow things up.

Can anyone help with an algorithm to generate the follow series?

“1”, “2”, “3”, “1/1”, “1/2”, “1/3”, “2/1”, “2/2”, “2/3”, “3/1”, “3/2”, “3/3”, “1/1/1”, “1/1/2”, “1/1/3”, “1/2/1”, “1/2/2”, “1/2/3”, “1/3/1”, “1/3/2”, “1/3/3”

As a non-coder, I guess this shouldn’t be too complicated to do. Maybe it will be a bit easier to use only this part of the sequence:

“1/1/1”, “1/1/2”, “1/1/3”, “1/2/1”, “1/2/2”, “1/2/3”, “1/3/1”, “1/3/2”, “1/3/3”

So folder 1 only contains subfolders “1”, “2” and “3” (no files). I suppose this will make the algorithm a bit easier to code.

However, I see a few issues using this folder structure:

  • There is no relation between the filenames and the location where the files are stored. The only way to search for a file is listing the contents of each folder and check if it’s there.
  • To avoid extensive folder list operations, an extension of the local db structure is required. Personally I think storing information in the local databse should be avoided where possible: currently most issues in Duplicati are related to the local db, every addition of new fields could make things worse.
  • If a new file has to be uploaded, the only way to calculate the location is using the total number of files at the backend. However, when applying a retention policy or a compact operation, an unknown number of files (DBLOCK and DINDEX) will be deleted, decreasing the total number of files. When using the new total filecount for calculating the subfolder for the next upload, probably a subfolder is chosen that already contains the maximum number of uploaded files. A workaround should be implemented to avoid this, I have no idea how to do this, other than store folder information (number of files for every folder) in the local db. I’m not sure if this is recommendable (see previous item).
  • DBLOCK and DINDEX both have a random filename. We cannot be sure that a DBLOCK/DINDEX pair are stored in the same folder. So for each file (both DBLOCK and DINDEX), folder location should be stored somewhere. However they belong together, the location of one file cannot be calculated using the filename/location of the other file.

I still prefer a mechanism where the location can be calculated using the filename. No extra information needs to be stored in the database, and theoretically migration back and forth between using subfolders or not should be possible (making it possible to restore using a third party tool to do a restore).

Because filenames are completely random, we have to cheat a bit when generating the filename for a new file to be uploaded. The best way I can think of, is calculating a few characters of the filename to match a particular checksum. That checksum can be used as folder name (see “Approach 3” in the first post).

1 Like

Would we really have to cheat and manipulate filenames? How about we let Duplicati generate filenames like it always does, but then run the name through a hash function. Take the the last 2 or 3 characters (depending on how many folders you want to end up with) to determine where the file should be placed on the back end.

Example:

echo 'duplicati-b04f7ec9562cb47ba9e992c87e90b5da3.dblock.zip.aes' | sha256sum

92bf201ff334a1fac711fcc6f102a4c4dc888da2b7c78c6c3c81b1579533adf3

this file could be placed in /f/3

echo 'duplicati-b4cbd825ee4454ec88f068d4ab88ec7b8.dblock.zip.aes' | sha256sum

dcd7b14a006735546fb0f47551691be1f8d559cbd27fd74e09aa402eea459228

And this file would be placed in /2/8

Nature of hashing functions is that the distribution should be even. The files would end up be splitting into 16 ^ X folders, where X is the number of characters you pull off the end of the hash.

Two characters = 256 folders
Three characters = 4096 folders

Edit - well maybe if they are completely random already it’s unnecessary to run them through a hash function. Just use the last 2 or 3 chars of the original filename lol - before the .dblock.zip.aes of course.

Edit2 - perhaps it would still be useful for dlist files which do not have random names.

I think I now have the logic and code worked out for the directories.

I was thinking it would be a benefit to be able to calculate the path using the filename. I did this in my first iteration of creating subfolders.

Though now I’m not sure what benefit of being able to calculate the existing path. The reason being that I’m storing the paths are in the database. This entire calculation only occurs when a volume file is being uploaded to the backend. At all other times the files are looked up in the database.

For retention, there is little to be done to avoid directories becoming sparse. But a fix for that may be a job which consolidates files from sparse directories. This job could be part of the retention checking.

This would result in thousands of folders that contain only one file. If the number of folders would be limited (using only the last 2 characters of a filename, max 256 folders) would limit the total backup size too much.
I was thinking of a way to create about 3000 folders that fill up to about 3000 files, using no more folders than needed.
A random filename would distribute the files over too many folders.

One benefit is that no changes are required to the local db structure.
Another benefit could be that filling up sparse folders could be done during backup operations (just generate a filename that “fits” in that folder and upload it, instead of downloading and re-uploading during a compact operation). But I guess somehow that could be implemented using your idea too.

Good point! You’ve clearly thought about this a lot more than I have.

I don’t need to modify the db structure. The path is added to the filename.

Any subfolder solution will need to deal with sparse folders. Given that the db holds all the filenames and paths we can calculate where to place additional files and target sparse folders, if we want that as a solution.

The important part, to me, I have the subfolders coded and working; and it works with existing backups. The algorithm to place folders can be easily modified or replaced even with existing backups. So later on any other ideas for file placement can just be added in.

I think I have the algo done for max filling folders with a flat structure.

I’m going to start with 2000 files per folder and 1000 folders per folder.

The first 2000 files go in the root “/” directory. The next 2000 files go in “/1/”.

The 25,654,396 file goes in “/12/826/”

So this seems to reach the goals. It is pretty flat, works with the limit of 5000 items per folder, and supports a lot of files.

I just ran a test with 10 files per folder and 10 folders per folder on 6GB of data at 50MB blocks.

2 Likes

Excellent!

Looks like this approach works great, even for existing backups without subfolders. Great work, love it!

Can I make a few suggestions?

  • To get backend files and folders sorted nicely, leading zeroes could be added to folder names: 0001 instead of 1.
  • In Duplicati, filenames have a prefix, a dash and a single character to indicate the filetype. All files are preceeded with duplicati-b or duplicati-i by default.
    To make clear that a folder contains Duplicati backup files, the folders could have the same prefix: <prefix>-d (example: duplicati-d0001).
  • Isn’t the 2-level tree a bit overkill? If you just stick to 2000 folders in the backend root with 2000 files per folder, you can backup up to 100 TB per backup job when using the default 50MB DBLOCK size. If you increase this to 4000 files per folder, the max is 200 TB (2000 folders * 4000 files * 25MB average file size). I guess this should be enough for any backup job.
  • You could make a 1-level structure (see above) default and introduce an advanced option: --folder-depth=n (default 1). Folder depth could be set to 0 for no subfolders and 2 for an extra tree level for superlarge backups.
  • Other useful advanced options that could be added:
    --max-files-per-remote-subfolder=n (default 2000 or maybe more). Can be used to set the number of files in every folder.
    --max-folders-per-remote-subfolder=n (default 2000). Can be used to specify how many subfolders are created before a new parent folder is created.

Thanks. And the options are definitely a good idea. I used your suggested parameters.

“2-level” well… the logic is already done. And it isn’t just 2-level. It will go indefinitely. The folder structure will not be a bottle-neck.

For most users they’ll likely never branch out of the 1st level of folders.

Regarding file sorting and 0001, why even the need to have them sorted?

The filenames are pretty long, which is helpful to keep them unique in a single folder. But at 2000 files per directory the filenames can likely be shortened and thereby reduce the database size. We would want to try and keep the filenames long enough to not have collisions since we might want to reorganize the files in a future process. Though perhaps the filenames could actually grow in length as the number of files increase since I already have the total file count handy in the process. Something to think about further.

Not a special reason, it could be handy when accessing the remote files using a file browser instead of the Duplicati software. For example, if you want to download the files to/from local storage for migration to another backend, of for disaster recovery tasks.
Apart from that, it just gives a cleaner look.

The filenames contain 32 random characters, because it’s a standard GUID.
But all characters (except the 13th) are random, so the filenames could be truncated to save some space in the local DB. I don’t have a clue how how much the db size would decrease because of shoreter filenames.