Logic and usecase for remote subfolders

kees-z · March 6, 2019, 11:19pm

The problem:

@jcbadmin’s question made me think about how using subfolders at the remote location could be useful. As the uploaded files have no direct relation to the source folder structure, there is no added value for creating (sub)folders at the backup location.

However, distributing backup files across a set of subfolders could improve Duplicati in one important way.
The default file size of a DBLOCK file is 50 MB. Each DBLOCK file is accompanied with a small DINDEX file. For larger backup sources, this results in a long list of files in a single folder. This has several drawbacks:

Many files in a single folder could slow down performance if the backend file system is not suitable for many files in one folder.
For manual operations, like copying files to other storage media (from/to usb disk for migrating to another storage provider), it’s difficult to split the operation in a number of sub-operations.
Some storage providers (Microsoft OneDrive, SharePoint) have a limit for the number of files per folder. In OneDrive/SharePoint, you can store up to 5000 files per folder. So with the default volume size, you can backup up to 200 GB per backup job. If the number of remote files exceeds the 5000 limit, the backup fails and it’s not easy to recover from it. Unless you use some hacks, OneDrive and SharePoint are not suitable for larger backups.
That’s too bad, I have 5TB storage quota in my Office 365 OneDrive account. For my backup of video files, I’m forced to use a gigantic 2GB upload volume size.

A possible solution:

According to the documentation, Duplicati requires only these 4 operations for any backend: GET, PUT, LIST, DELETE. The CREATEFOLDER operation is not listed there, but works for almost any backend: Duplicati already can create a folder at the backend (in the Add backup wizard, step 2: if a path does not exist, Duplicati can create it for you).

If Duplicati could create subfolders and distribute DBLOCK- and DINDEX files accross these folders, large backups will be possible, even using storage providers that support a limited number of files per folder.
For example, for Microsoft OneDrive backends, that limits the number of entries per folder to 5000, 3000 folders containing 3000 files, more than 200TB can be stored for a single backup job using the default 50 MB volume size! (1 DBLOCK and 1 DINDEX are about 50MB, so average file size is about 25 MB. 3000 folders x 3000 files = 9000000 files. 9000000 x 25 MB = 225000000MB)

The requirements:

If using subfolders would be implemented, these requirements should be met:

Current backup jobs should keep working without changing anythnig to the backend files.
Restore operations should work with any version of Duplicati that supports subfolders, regardless if the backup location contains subfolders or not.
Minimal changes (or no changes) to the local database: Duplicati should “know” where a certain file could be found. This is not very easy, because DBLOCK- and DINDEX files contain a random sequence of hexadecimal characters.
Backward compatibility: somehow you have to be able to do restore operations from subdfolder-enabled backends using older versions of Duplicati (with some manual actions).

Preparation:

Before a backup operation starts, Duplicati performs a LIST command to see if the remote location is reachable and contains the backup files that is expects. In this phase, Duplicati could check for the existence of subfolders.
If the remote location is empty (initial backup), Duplicati should create subfolders to store files groupwise. An optional advanced option, like --use-subfolders=on/off/auto could change the default behaviour.
If the remote location contains one or more DBLOCK and DINDEX files in the root, duplicati should not create subfolders and keep working as it always did.
If the remote location contains one or more DLIST files and a number of subfolders, the backup job should use subfolders.

Approach 1: simple and powerful

The first approach that comes to mind, is to create a subfolder for all files that start with the same 2 random digits. For example: duplicati-bc231488dbb5e8b1a79a4ed0c1d82fb35.dblock.zip.aes could be stored in /duplicati-bc (duplicati is the default prefix that’s also used for files).
You can even create a deeper folder structure, so this file could also be stored in /duplicati-b/duplicati-c, or in /duplicati-bc/duplicati-23.

This approach meets most requirements (Local DB structure can stay unchanged, backward compatible, restore operations from both storage methods possible). It should be relatively easy to implement and calculating the correct subfolder from the filename is fast and straightforward.

However, there’s one major drawback that makes this approach unusable: when using filesnames with random hex sequences, the number of folders will grow very fast and each folder most likely will contain a very small number of files. When using subfolders for filenames with the same first 2 digits and a sub-subfolder for each file with the same 3rd/4th hex digit, you will end up with more than 65000 folders, resulting in 65000 LIST operations for each backup/restore/compact/repair operation.

Approach 2: Filenames with increasing numbers.

Start with duplicati-00000000000000000000000000000001.dblock.aes and increase this number for each new file. Store the file in a folder named duplicati-000001-005000 until the max-files-per-folder limit (5000) is reached. For file 5001, a new folder named duplicati-005001-010000 is created and so on.
The first part of the sequence could stay random.
Advantages: each subfolder fills up, until the maximum is reached. Also it’s quite easy to calculate the next number to be used.
Disadvantage: part of the hex sequence is not random anymore.

Approach 3: Best of both worlds, semi-random numbers

A little less straightforward approach is combining the first approaches a bit. Effectively this results in a folder with a random number, containing files using (near) random numbers. Each folder fills up until the number of files reaches the limit for a folder. The location of a file can be cauculated using a relatively easy procedure, using its filename, without using the local DB.
To make a folder fill up with files, before the next folder is created, we have to cheat a little bit with genrating a filename. This is the procedure:

When the first upload volume is created, a “real” random filename is generated, for example duplicati-be879d9922aadf2796b558bf88e676bb8.dblock.zip.aes. The random part is e879d9922aadf2796b558bf88e676bb8.
To calculate the folder name for storing this file, we separate the random part in 8 groups of 4 hexadecimal digits: e879 d992 2aad f279 6b55 8bf8 8e67 6bb8.
We calculate the total of these parts: e879 + d992 + 2aad + f279 + 6b55 + 8bf8 + 8e67 + 6bb8 = 4d09d. The last 4 digits (d09d) are the checksum for this filename.
A new folder duplicati-d09d is created and the file is stored in this folder.
For the next filename, a random sequence of 28 characters is generated, for example 254d52eab3499613419edcd5eae2.
For this sequence, the checksum is calculated the same way: 254d + 52ea + b349 + 9613 + 419e + dcd5 + eae2 = 3cae8, checksum is cae8.
This checksum is substracted from the first checksum (d09d). If the result is negative, 10000 is added to the result. So, d09d - cae8 = 05b5. This value is added to the 28 character random sequence, the new filename will be duplicati-254d52eab3499613419edcd5eae205b5.dblock.zip.aes. This file will have the same checksum as the first one (d09d) and will be stored in the same subfolder.
This way, zillions of semi-random filenames can be generated that have the same checksum, thus are stored in the same subfolder. From the random looking filename, the correct foldername can be calculated without using any additional source (such as local DB).
If the max-files-per-folder limit is reached, a new “real” random filename is generated, checksum calculated and a new subfolder is created.

Proof of concept:

I tested this approach using a Google worksheet. It can be opened by clicking this link.
The first tab (Calculation) demonstrates how filenames are generated. In step 1-3, enter a prefix, a 32 char hex string and a 28 char hex string. The 28 char string is extended to a 32 char string with the same checksum.
In the second tab, you can enter 32 char hex string or generate a random one. 1000 new random hex strings with the same checksum are generated automatically.
The third tab (Max files) can be used to calculate how many data can be stored using a given number of folders, files per folder and upload volume size.

Conclusion:

As far as I can see, an introduction of using backend subfolders would drastically increase backup capacity for some widely used storage providers, without breaking compatibility with the current storage model. When running the first command (LIST), the used storage model (with/without subfolders) could be detected easily.
Using the checksum-approach, filenames are still 99.99% random, but subfolders will be created and filled sequentially, resulting in a minimal amount of subfolders.
The only parts that are affected are the PUT, GET, DELETE and LIST operations: for each file, a checksum must be calculated to determine the correct remote path.
Restore operations using an older version of Duplicati that doesn’t support subfolders is still possible with an easy hack: move all remote files to a single folder and start the restore pointing to that folder.
Some additional commandline options could define the behaviour, like --use-remote-subfolders=on/off/auto and max-files-per-remote-folder=n.
--use-remote-subfolders could have a default setting auto. Every storage type could have its own preferred setting for setting it on or off, or a specific number of files per folder. The commandline options could overrule that setting.

I’m aware this is all just theory and there is a good chance that I’m overlooking some important things. But in general terms, I guess it should technically be possible to be implemented in Duplicati.

warwickmm · March 7, 2019, 2:08am

In case you haven’t seen it, related issue here:

ts678 · March 7, 2019, 2:38am

Try OneDrive v2. Manual testing and user experience with OneDrive for Business says it avoids 5000 limit. On the other hand I don’t know if there’s some higher limit that it might run into if it keeps adding more files.

‘No files were found at the remote location’ using Onedrive for business
Failed Backup with id: 3 (Missing Files)

JonMikelV · March 9, 2019, 12:55am

Great write-up!

I haven’t heard much rumbling about LIST timeouts in a while but that doesn’t mean a performance boost wouldn’t be welcome.

I need to re-read the checksum math to understand folder naming but from what I could tell there wasn’t anything missing.

kees-z · March 9, 2019, 1:45pm

Thanks, didn’t notice that one.
However, all mentioned drawbacks are addressed using the checksum approach:

“the code would need to traverse all the folders to find all the files.”
Using the checksum approach, the subfolder where each file is located can easily be calculated from the filename.
“To support both sub-folder and non-sub-folder setups, there needs to be a marker file inside the root folder.”
A marker file is not needed, the initial list command could easily detect if there are one or more <prefix>-*.dblock files in the root path. In that case, operations should start in non-subfolder mode. If the remote path contains one or more DLIST files and one or more <prefix>-xxxx folders, the operation should start in subfolder mode. If the remote location doesn’t contain any Duplicati backup files/folders, the operation should start in the default mode for the selected storage provider.
“use a simple naming scheme, like “folder-1”. It would be simple to reconstruct the path from the local database”
Using the checksum method, nothing needs to be changed to the local DB, because each filename has a direct relation to the folder name it is located in.
“the filename is “prefix-tXXYYZZNNNNNNN”, corresponding to the path “XX/YY/ZZ/prefix-tXXYYZZNNNNNNN”. That would leave a maximum of 256 folders on each level with a random distribution.”
Problems with this approach is that this results in the creation of many subfolders with one of a few files in each subfolder. The checksum approach will create one folder. This folder fills up with files until a predefined limit. When the number of files reaches this limit, the next folder will be created, resulting in a minimum number of folders (and a minimum number of required LIST commands in each operation).

On top of that, the checksum method is quite backward-compatible. Existing backups can keep using the current (non subfolder) storage scheme and the local DB scheme is identical for both methods.
Restoring from remote subfolders using an old Duplicati version is even possible with minimal manual preparation, by moving all DBLOCK/DINDEX files to a single folder.

I’ve tried that, the backup failed as soon as the number of remote files reached 5000. I’m not sure which Duplicati version I used, I’m pretty sure it was the OneDrive V2 backend.
In addition to that, it scares me off that I’m hardly able to access my remote files, other than Duplicati itself and the OneDrive web interface. Using WebDAV or the OneDrive client, access to folders with more than 5000 files results in an error or empty folder view.

JonMikelV · March 10, 2019, 2:02am

How does your design fair over time?

As retention kicks in older folders could become sparse/empty. Makes me wonder things like:

Should empty folders be deleted?
Should compacting join sparse folders

kees-z · March 11, 2019, 9:09am

Using the LIST command, executed at the beginning of each backup operation, Duplicati can inventory the number of files in each subfolder. That information could be used to fill up the gaps that were caused by a prior COMPACT operation, beginning with the subfolder with the smallest amount of missing files.
This will result in a minimum number of subfolders and make existence of empty folders very unlikely.

In the exceptional case that all files were deleted from a subfolder, it will automatically be filled with DBLOCK and DINDEX files as soon as other subfolders are 100% filled with backup files. Optionally empty folders could be deleted at the beginning of a backup job or during a COMPACT operation.

Using this approach, there is no need to join sparse folders. Merging folders will generate a lot of traffic, because all files from a subfolder must be downloaded, renamed and re-uploaded to the destination folder (Duplicati doesn’t support MOVE command for backend operations). This will also result in unnecessary operations to the local DB to register the renamed filenames.

JonMikelV · March 11, 2019, 10:51pm

Well now that you explain it, it’s obvious.

drwtsn32 · March 12, 2019, 6:59pm

Nice writeup and clever solution! It sounds great to me.

archon810 · August 4, 2019, 5:51am

Any updates on implementing one of these solutions?

BlueBlock · August 4, 2019, 6:04am

It’s an issue I’m actively looking to now. No ETA.

archon810 · August 4, 2019, 9:39am

For context: Box.com max number of files per folder limitation.

ts678 · August 4, 2019, 2:08pm

My question would be which will overload first – the remote files-per-folder or the Duplicati database? Improving scaling in one area works until you hit some scaling problem in the next area. So it goes…

kees-z · August 6, 2019, 5:50pm

Without doubt the remote files-per-folder.

For OneDrive, the max files per folder is still 5000. Using the default DBLOCK size of 50 MB, the maximum amount of data per backup is about 125 GB (2500 DBLOCK and 2500 DINDEX files, 2500 * 50 MB = 125 GB). The number of backup versions reduces this (one DLIST file for every version). The amount of modified data in the available backup versions reduces it further. Uncompacted DBLOCK files could reduce the maximum backup data even further.
So with OneDrive’s 5000 files limit, I guess the maximum protected source data will be somewhere around 80 GB, which is way too small.

This limit forced me to increase the DBLOCK size to 2 GB for my backup job that protects 4 TB of videos and uses OneDrive as storage provider.
3000 subfolders with 3000 files per folder would increase the 125 GB limit to 200 TB, which is virtually unlimited.

For larger backups, I agree that you should choose a larger DBLOCK size. Even more important is to think about the internal block size (in my videos backup job, I increased the default 100 KB th 10 MB). Increasing the block size will drastically reduce the local database size (database of my 4 TB backup is about 80 MB, takes about 10 minutes to recreate!).

ts678 · August 6, 2019, 6:07pm

Is OneDrive v2 still doing that 5000 file limit for you? Back when there was a choice, a couple of people switched from OneDrive original (now gone) to the v2 version (Graph API) and the problem went away.

Results have been kind of mixed, but that might also because “OneDrive” is a very broad MS branding.

This is getting at what I was talking about, but I agree that if there’s still a 5000 file limit, it’ll likely hurt first.

Choosing sizes in Duplicati for anyone who wants to learn more about this sort of configuration tuning…

archon810 · August 6, 2019, 9:08pm

Interesting, I’ll play with the internal block size. I don’t expect many changes to existing files for our backups but instead lots of media files being added. I think 10MB should drastically improve performance.

kees-z · August 6, 2019, 9:18pm

Keep in mind that you cannot change the internal block size for existing backup jobs.

archon810 · August 6, 2019, 9:41pm

Right, I’ll have to blow things up.

BlueBlock · August 11, 2019, 2:16pm

Can anyone help with an algorithm to generate the follow series?

“1”, “2”, “3”, “1/1”, “1/2”, “1/3”, “2/1”, “2/2”, “2/3”, “3/1”, “3/2”, “3/3”, “1/1/1”, “1/1/2”, “1/1/3”, “1/2/1”, “1/2/2”, “1/2/3”, “1/3/1”, “1/3/2”, “1/3/3”

kees-z · August 11, 2019, 4:17pm

As a non-coder, I guess this shouldn’t be too complicated to do. Maybe it will be a bit easier to use only this part of the sequence:

“1/1/1”, “1/1/2”, “1/1/3”, “1/2/1”, “1/2/2”, “1/2/3”, “1/3/1”, “1/3/2”, “1/3/3”

So folder 1 only contains subfolders “1”, “2” and “3” (no files). I suppose this will make the algorithm a bit easier to code.

However, I see a few issues using this folder structure:

There is no relation between the filenames and the location where the files are stored. The only way to search for a file is listing the contents of each folder and check if it’s there.
To avoid extensive folder list operations, an extension of the local db structure is required. Personally I think storing information in the local databse should be avoided where possible: currently most issues in Duplicati are related to the local db, every addition of new fields could make things worse.
If a new file has to be uploaded, the only way to calculate the location is using the total number of files at the backend. However, when applying a retention policy or a compact operation, an unknown number of files (DBLOCK and DINDEX) will be deleted, decreasing the total number of files. When using the new total filecount for calculating the subfolder for the next upload, probably a subfolder is chosen that already contains the maximum number of uploaded files. A workaround should be implemented to avoid this, I have no idea how to do this, other than store folder information (number of files for every folder) in the local db. I’m not sure if this is recommendable (see previous item).
DBLOCK and DINDEX both have a random filename. We cannot be sure that a DBLOCK/DINDEX pair are stored in the same folder. So for each file (both DBLOCK and DINDEX), folder location should be stored somewhere. However they belong together, the location of one file cannot be calculated using the filename/location of the other file.

I still prefer a mechanism where the location can be calculated using the filename. No extra information needs to be stored in the database, and theoretically migration back and forth between using subfolders or not should be possible (making it possible to restore using a third party tool to do a restore).

Because filenames are completely random, we have to cheat a bit when generating the filename for a new file to be uploaded. The best way I can think of, is calculating a few characters of the filename to match a particular checksum. That checksum can be used as folder name (see “Approach 3” in the first post).