Logic and usecase for remote subfolders


#1

The problem:

@jcbadmin’s question made me think about how using subfolders at the remote location could be useful. As the uploaded files have no direct relation to the source folder structure, there is no added value for creating (sub)folders at the backup location.

However, distributing backup files across a set of subfolders could improve Duplicati in one important way.
The default file size of a DBLOCK file is 50 MB. Each DBLOCK file is accompanied with a small DINDEX file. For larger backup sources, this results in a long list of files in a single folder. This has several drawbacks:

  • Many files in a single folder could slow down performance if the backend file system is not suitable for many files in one folder.
  • For manual operations, like copying files to other storage media (from/to usb disk for migrating to another storage provider), it’s difficult to split the operation in a number of sub-operations.
  • Some storage providers (Microsoft OneDrive, SharePoint) have a limit for the number of files per folder. In OneDrive/SharePoint, you can store up to 5000 files per folder. So with the default volume size, you can backup up to 200 GB per backup job. If the number of remote files exceeds the 5000 limit, the backup fails and it’s not easy to recover from it. Unless you use some hacks, OneDrive and SharePoint are not suitable for larger backups.
    That’s too bad, I have 5TB storage quota in my Office 365 OneDrive account. For my backup of video files, I’m forced to use a gigantic 2GB upload volume size.

A possible solution:

According to the documentation, Duplicati requires only these 4 operations for any backend: GET, PUT, LIST, DELETE. The CREATEFOLDER operation is not listed there, but works for almost any backend: Duplicati already can create a folder at the backend (in the Add backup wizard, step 2: if a path does not exist, Duplicati can create it for you).

If Duplicati could create subfolders and distribute DBLOCK- and DINDEX files accross these folders, large backups will be possible, even using storage providers that support a limited number of files per folder.
For example, for Microsoft OneDrive backends, that limits the number of entries per folder to 5000, 3000 folders containing 3000 files, more than 200TB can be stored for a single backup job using the default 50 MB volume size! (1 DBLOCK and 1 DINDEX are about 50MB, so average file size is about 25 MB. 3000 folders x 3000 files = 9000000 files. 9000000 x 25 MB = 225000000MB)

The requirements:

If using subfolders would be implemented, these requirements should be met:

  • Current backup jobs should keep working without changing anythnig to the backend files.
  • Restore operations should work with any version of Duplicati that supports subfolders, regardless if the backup location contains subfolders or not.
  • Minimal changes (or no changes) to the local database: Duplicati should “know” where a certain file could be found. This is not very easy, because DBLOCK- and DINDEX files contain a random sequence of hexadecimal characters.
  • Backward compatibility: somehow you have to be able to do restore operations from subdfolder-enabled backends using older versions of Duplicati (with some manual actions).

Preparation:

Before a backup operation starts, Duplicati performs a LIST command to see if the remote location is reachable and contains the backup files that is expects. In this phase, Duplicati could check for the existence of subfolders.
If the remote location is empty (initial backup), Duplicati should create subfolders to store files groupwise. An optional advanced option, like --use-subfolders=on/off/auto could change the default behaviour.
If the remote location contains one or more DBLOCK and DINDEX files in the root, duplicati should not create subfolders and keep working as it always did.
If the remote location contains one or more DLIST files and a number of subfolders, the backup job should use subfolders.

Approach 1: simple and powerful

The first approach that comes to mind, is to create a subfolder for all files that start with the same 2 random digits. For example: duplicati-bc231488dbb5e8b1a79a4ed0c1d82fb35.dblock.zip.aes could be stored in /duplicati-bc (duplicati is the default prefix that’s also used for files).
You can even create a deeper folder structure, so this file could also be stored in /duplicati-b/duplicati-c, or in /duplicati-bc/duplicati-23.

This approach meets most requirements (Local DB structure can stay unchanged, backward compatible, restore operations from both storage methods possible). It should be relatively easy to implement and calculating the correct subfolder from the filename is fast and straightforward.

However, there’s one major drawback that makes this approach unusable: when using filesnames with random hex sequences, the number of folders will grow very fast and each folder most likely will contain a very small number of files. When using subfolders for filenames with the same first 2 digits and a sub-subfolder for each file with the same 3rd/4th hex digit, you will end up with more than 65000 folders, resulting in 65000 LIST operations for each backup/restore/compact/repair operation.

Approach 2: Filenames with increasing numbers.

Start with duplicati-00000000000000000000000000000001.dblock.aes and increase this number for each new file. Store the file in a folder named duplicati-000001-005000 until the max-files-per-folder limit (5000) is reached. For file 5001, a new folder named duplicati-005001-010000 is created and so on.
The first part of the sequence could stay random.
Advantages: each subfolder fills up, until the maximum is reached. Also it’s quite easy to calculate the next number to be used.
Disadvantage: part of the hex sequence is not random anymore.

Approach 3: Best of both worlds, semi-random numbers

A little less straightforward approach is combining the first approaches a bit. Effectively this results in a folder with a random number, containing files using (near) random numbers. Each folder fills up until the number of files reaches the limit for a folder. The location of a file can be cauculated using a relatively easy procedure, using its filename, without using the local DB.
To make a folder fill up with files, before the next folder is created, we have to cheat a little bit with genrating a filename. This is the procedure:

  • When the first upload volume is created, a “real” random filename is generated, for example duplicati-be879d9922aadf2796b558bf88e676bb8.dblock.zip.aes. The random part is e879d9922aadf2796b558bf88e676bb8.
  • To calculate the folder name for storing this file, we separate the random part in 8 groups of 4 hexadecimal digits: e879 d992 2aad f279 6b55 8bf8 8e67 6bb8.
  • We calculate the total of these parts: e879 + d992 + 2aad + f279 + 6b55 + 8bf8 + 8e67 + 6bb8 = 4d09d. The last 4 digits (d09d) are the checksum for this filename.
  • A new folder duplicati-d09d is created and the file is stored in this folder.
  • For the next filename, a random sequence of 28 characters is generated, for example 254d52eab3499613419edcd5eae2.
  • For this sequence, the checksum is calculated the same way: 254d + 52ea + b349 + 9613 + 419e + dcd5 + eae2 = 3cae8, checksum is cae8.
  • This checksum is substracted from the first checksum (d09d). If the result is negative, 10000 is added to the result. So, d09d - cae8 = 05b5. This value is added to the 28 character random sequence, the new filename will be duplicati-254d52eab3499613419edcd5eae205b5.dblock.zip.aes. This file will have the same checksum as the first one (d09d) and will be stored in the same subfolder.
  • This way, zillions of semi-random filenames can be generated that have the same checksum, thus are stored in the same subfolder. From the random looking filename, the correct foldername can be calculated without using any additional source (such as local DB).
  • If the max-files-per-folder limit is reached, a new “real” random filename is generated, checksum calculated and a new subfolder is created.

Proof of concept:

I tested this approach using a Google worksheet. It can be opened by clicking this link.
The first tab (Calculation) demonstrates how filenames are generated. In step 1-3, enter a prefix, a 32 char hex string and a 28 char hex string. The 28 char string is extended to a 32 char string with the same checksum.
In the second tab, you can enter 32 char hex string or generate a random one. 1000 new random hex strings with the same checksum are generated automatically.
The third tab (Max files) can be used to calculate how many data can be stored using a given number of folders, files per folder and upload volume size.

Conclusion:

As far as I can see, an introduction of using backend subfolders would drastically increase backup capacity for some widely used storage providers, without breaking compatibility with the current storage model. When running the first command (LIST), the used storage model (with/without subfolders) could be detected easily.
Using the checksum-approach, filenames are still 99.99% random, but subfolders will be created and filled sequentially, resulting in a minimal amount of subfolders.
The only parts that are affected are the PUT, GET, DELETE and LIST operations: for each file, a checksum must be calculated to determine the correct remote path.
Restore operations using an older version of Duplicati that doesn’t support subfolders is still possible with an easy hack: move all remote files to a single folder and start the restore pointing to that folder.
Some additional commandline options could define the behaviour, like --use-remote-subfolders=on/off/auto and max-files-per-remote-folder=n.
--use-remote-subfolders could have a default setting auto. Every storage type could have its own preferred setting for setting it on or off, or a specific number of files per folder. The commandline options could overrule that setting.

I’m aware this is all just theory and there is a good chance that I’m overlooking some important things. But in general terms, I guess it should technically be possible to be implemented in Duplicati.


No backup without "--no-backend-verification=true"
#2

In case you haven’t seen it, related issue here:


#3

Try OneDrive v2. Manual testing and user experience with OneDrive for Business says it avoids 5000 limit. On the other hand I don’t know if there’s some higher limit that it might run into if it keeps adding more files.

‘No files were found at the remote location’ using Onedrive for business
Failed Backup with id: 3 (Missing Files)


No backup without "--no-backend-verification=true"
#4

Great write-up!

I haven’t heard much rumbling about LIST timeouts in a while but that doesn’t mean a performance boost wouldn’t be welcome. :slight_smile:

I need to re-read the checksum math to understand folder naming but from what I could tell there wasn’t anything missing.


#5

Thanks, didn’t notice that one.
However, all mentioned drawbacks are addressed using the checksum approach:

  • “the code would need to traverse all the folders to find all the files.”
    Using the checksum approach, the subfolder where each file is located can easily be calculated from the filename.
  • “To support both sub-folder and non-sub-folder setups, there needs to be a marker file inside the root folder.”
    A marker file is not needed, the initial list command could easily detect if there are one or more <prefix>-*.dblock files in the root path. In that case, operations should start in non-subfolder mode. If the remote path contains one or more DLIST files and one or more <prefix>-xxxx folders, the operation should start in subfolder mode. If the remote location doesn’t contain any Duplicati backup files/folders, the operation should start in the default mode for the selected storage provider.
  • “use a simple naming scheme, like “folder-1”. It would be simple to reconstruct the path from the local database”
    Using the checksum method, nothing needs to be changed to the local DB, because each filename has a direct relation to the folder name it is located in.
  • “the filename is “prefix-tXXYYZZNNNNNNN”, corresponding to the path “XX/YY/ZZ/prefix-tXXYYZZNNNNNNN”. That would leave a maximum of 256 folders on each level with a random distribution.”
    Problems with this approach is that this results in the creation of many subfolders with one of a few files in each subfolder. The checksum approach will create one folder. This folder fills up with files until a predefined limit. When the number of files reaches this limit, the next folder will be created, resulting in a minimum number of folders (and a minimum number of required LIST commands in each operation).

On top of that, the checksum method is quite backward-compatible. Existing backups can keep using the current (non subfolder) storage scheme and the local DB scheme is identical for both methods.
Restoring from remote subfolders using an old Duplicati version is even possible with minimal manual preparation, by moving all DBLOCK/DINDEX files to a single folder.

I’ve tried that, the backup failed as soon as the number of remote files reached 5000. I’m not sure which Duplicati version I used, I’m pretty sure it was the OneDrive V2 backend.
In addition to that, it scares me off that I’m hardly able to access my remote files, other than Duplicati itself and the OneDrive web interface. Using WebDAV or the OneDrive client, access to folders with more than 5000 files results in an error or empty folder view.


#6

How does your design fair over time?

As retention kicks in older folders could become sparse/empty. Makes me wonder things like:

  • Should empty folders be deleted?
  • Should compacting join sparse folders

#7

Using the LIST command, executed at the beginning of each backup operation, Duplicati can inventory the number of files in each subfolder. That information could be used to fill up the gaps that were caused by a prior COMPACT operation, beginning with the subfolder with the smallest amount of missing files.
This will result in a minimum number of subfolders and make existence of empty folders very unlikely.

In the exceptional case that all files were deleted from a subfolder, it will automatically be filled with DBLOCK and DINDEX files as soon as other subfolders are 100% filled with backup files. Optionally empty folders could be deleted at the beginning of a backup job or during a COMPACT operation.

Using this approach, there is no need to join sparse folders. Merging folders will generate a lot of traffic, because all files from a subfolder must be downloaded, renamed and re-uploaded to the destination folder (Duplicati doesn’t support MOVE command for backend operations). This will also result in unnecessary operations to the local DB to register the renamed filenames.


#8

Well now that you explain it, it’s obvious. :slight_smile:


#9

Nice writeup and clever solution! It sounds great to me.