The problem:
@jcbadmin’s question made me think about how using subfolders at the remote location could be useful. As the uploaded files have no direct relation to the source folder structure, there is no added value for creating (sub)folders at the backup location.
However, distributing backup files across a set of subfolders could improve Duplicati in one important way.
The default file size of a DBLOCK file is 50 MB. Each DBLOCK file is accompanied with a small DINDEX file. For larger backup sources, this results in a long list of files in a single folder. This has several drawbacks:
- Many files in a single folder could slow down performance if the backend file system is not suitable for many files in one folder.
- For manual operations, like copying files to other storage media (from/to usb disk for migrating to another storage provider), it’s difficult to split the operation in a number of sub-operations.
- Some storage providers (Microsoft OneDrive, SharePoint) have a limit for the number of files per folder. In OneDrive/SharePoint, you can store up to 5000 files per folder. So with the default volume size, you can backup up to 200 GB per backup job. If the number of remote files exceeds the 5000 limit, the backup fails and it’s not easy to recover from it. Unless you use some hacks, OneDrive and SharePoint are not suitable for larger backups.
That’s too bad, I have 5TB storage quota in my Office 365 OneDrive account. For my backup of video files, I’m forced to use a gigantic 2GB upload volume size.
A possible solution:
According to the documentation, Duplicati requires only these 4 operations for any backend: GET, PUT, LIST, DELETE. The CREATEFOLDER operation is not listed there, but works for almost any backend: Duplicati already can create a folder at the backend (in the Add backup wizard, step 2: if a path does not exist, Duplicati can create it for you).
If Duplicati could create subfolders and distribute DBLOCK- and DINDEX files accross these folders, large backups will be possible, even using storage providers that support a limited number of files per folder.
For example, for Microsoft OneDrive backends, that limits the number of entries per folder to 5000, 3000 folders containing 3000 files, more than 200TB can be stored for a single backup job using the default 50 MB volume size! (1 DBLOCK and 1 DINDEX are about 50MB, so average file size is about 25 MB. 3000 folders x 3000 files = 9000000 files. 9000000 x 25 MB = 225000000MB)
The requirements:
If using subfolders would be implemented, these requirements should be met:
- Current backup jobs should keep working without changing anythnig to the backend files.
- Restore operations should work with any version of Duplicati that supports subfolders, regardless if the backup location contains subfolders or not.
- Minimal changes (or no changes) to the local database: Duplicati should “know” where a certain file could be found. This is not very easy, because DBLOCK- and DINDEX files contain a random sequence of hexadecimal characters.
- Backward compatibility: somehow you have to be able to do restore operations from subdfolder-enabled backends using older versions of Duplicati (with some manual actions).
Preparation:
Before a backup operation starts, Duplicati performs a LIST command to see if the remote location is reachable and contains the backup files that is expects. In this phase, Duplicati could check for the existence of subfolders.
If the remote location is empty (initial backup), Duplicati should create subfolders to store files groupwise. An optional advanced option, like --use-subfolders=on/off/auto
could change the default behaviour.
If the remote location contains one or more DBLOCK and DINDEX files in the root, duplicati should not create subfolders and keep working as it always did.
If the remote location contains one or more DLIST files and a number of subfolders, the backup job should use subfolders.
Approach 1: simple and powerful
The first approach that comes to mind, is to create a subfolder for all files that start with the same 2 random digits. For example: duplicati-bc231488dbb5e8b1a79a4ed0c1d82fb35.dblock.zip.aes
could be stored in /duplicati-bc
(duplicati
is the default prefix that’s also used for files).
You can even create a deeper folder structure, so this file could also be stored in /duplicati-b/duplicati-c
, or in /duplicati-bc/duplicati-23
.
This approach meets most requirements (Local DB structure can stay unchanged, backward compatible, restore operations from both storage methods possible). It should be relatively easy to implement and calculating the correct subfolder from the filename is fast and straightforward.
However, there’s one major drawback that makes this approach unusable: when using filesnames with random hex sequences, the number of folders will grow very fast and each folder most likely will contain a very small number of files. When using subfolders for filenames with the same first 2 digits and a sub-subfolder for each file with the same 3rd/4th hex digit, you will end up with more than 65000 folders, resulting in 65000 LIST operations for each backup/restore/compact/repair operation.
Approach 2: Filenames with increasing numbers.
Start with duplicati-00000000000000000000000000000001.dblock.aes
and increase this number for each new file. Store the file in a folder named duplicati-000001-005000
until the max-files-per-folder limit (5000) is reached. For file 5001, a new folder named duplicati-005001-010000
is created and so on.
The first part of the sequence could stay random.
Advantages: each subfolder fills up, until the maximum is reached. Also it’s quite easy to calculate the next number to be used.
Disadvantage: part of the hex sequence is not random anymore.
Approach 3: Best of both worlds, semi-random numbers
A little less straightforward approach is combining the first approaches a bit. Effectively this results in a folder with a random number, containing files using (near) random numbers. Each folder fills up until the number of files reaches the limit for a folder. The location of a file can be cauculated using a relatively easy procedure, using its filename, without using the local DB.
To make a folder fill up with files, before the next folder is created, we have to cheat a little bit with genrating a filename. This is the procedure:
- When the first upload volume is created, a “real” random filename is generated, for example
duplicati-be879d9922aadf2796b558bf88e676bb8.dblock.zip.aes
. The random part ise879d9922aadf2796b558bf88e676bb8
. - To calculate the folder name for storing this file, we separate the random part in 8 groups of 4 hexadecimal digits:
e879 d992 2aad f279 6b55 8bf8 8e67 6bb8
. - We calculate the total of these parts:
e879 + d992 + 2aad + f279 + 6b55 + 8bf8 + 8e67 + 6bb8 = 4d09d
. The last 4 digits (d09d
) are the checksum for this filename. - A new folder
duplicati-d09d
is created and the file is stored in this folder. - For the next filename, a random sequence of 28 characters is generated, for example
254d52eab3499613419edcd5eae2
. - For this sequence, the checksum is calculated the same way:
254d + 52ea + b349 + 9613 + 419e + dcd5 + eae2 = 3cae8
, checksum iscae8
. - This checksum is substracted from the first checksum (d09d). If the result is negative,
10000
is added to the result. So,d09d - cae8 = 05b5
. This value is added to the 28 character random sequence, the new filename will beduplicati-254d52eab3499613419edcd5eae205b5.dblock.zip.aes
. This file will have the same checksum as the first one (d09d) and will be stored in the same subfolder. - This way, zillions of semi-random filenames can be generated that have the same checksum, thus are stored in the same subfolder. From the random looking filename, the correct foldername can be calculated without using any additional source (such as local DB).
- If the max-files-per-folder limit is reached, a new “real” random filename is generated, checksum calculated and a new subfolder is created.
Proof of concept:
I tested this approach using a Google worksheet. It can be opened by clicking this link.
The first tab (Calculation) demonstrates how filenames are generated. In step 1-3, enter a prefix, a 32 char hex string and a 28 char hex string. The 28 char string is extended to a 32 char string with the same checksum.
In the second tab, you can enter 32 char hex string or generate a random one. 1000 new random hex strings with the same checksum are generated automatically.
The third tab (Max files) can be used to calculate how many data can be stored using a given number of folders, files per folder and upload volume size.
Conclusion:
As far as I can see, an introduction of using backend subfolders would drastically increase backup capacity for some widely used storage providers, without breaking compatibility with the current storage model. When running the first command (LIST), the used storage model (with/without subfolders) could be detected easily.
Using the checksum-approach, filenames are still 99.99% random, but subfolders will be created and filled sequentially, resulting in a minimal amount of subfolders.
The only parts that are affected are the PUT, GET, DELETE and LIST operations: for each file, a checksum must be calculated to determine the correct remote path.
Restore operations using an older version of Duplicati that doesn’t support subfolders is still possible with an easy hack: move all remote files to a single folder and start the restore pointing to that folder.
Some additional commandline options could define the behaviour, like --use-remote-subfolders=on/off/auto
and max-files-per-remote-folder=n
.
--use-remote-subfolders
could have a default setting auto
. Every storage type could have its own preferred setting for setting it on or off, or a specific number of files per folder. The commandline options could overrule that setting.
I’m aware this is all just theory and there is a good chance that I’m overlooking some important things. But in general terms, I guess it should technically be possible to be implemented in Duplicati.