I spent a weekend looking at the slow performance of the restore page when it has to list files and folders.
All examples compares performance of current Duplicati API load time vs raw custom SQL queries on a 40GB dataset, a 350MB database, and some 74k files. Running on a laptop with SSD and an i5 4288U CPU.
The restore page uses the same API path for most of it’s actions. This is nice for the flexibility it provides, but it seems to cause horrible performance because each call does everything instead of just what you want.
I’m dividing this post up in a section for each call made by the frontend where I’ll provide details on what’s called, what data we get, some performance stats, and my suggestions for a query that performs better.
Get restore points (backup sets)
/api/v1/backup/13/filesets
returns a list of json objects
{
Version: 0,
Time: "2018-03-09T22:06:31+01:00",
FileCount: -1,
FileSizes: -1
}
Time: 1.37 sec
Takes ~2 ms using query:
select
(select count(*) from Fileset b where Fileset.id >= b.id) as Version,
strftime('%Y-%m-%dT%H:%M:%S', datetime(Timestamp, 'unixepoch')) as Timestamp,
'-1' as FileCount,
'-1' as FileSizes
from Fileset
order by Timestamp desc
returns list of rows:
{
Version: 0,
Timestamp: 2018-04-08T17:00:00,
FileCount: -1,
FileSizes: -1
}
Get filesystem folders
/api/v1/filesystem?onlyfolders=true&showhidden=true
returns a list of json objects, example object:
{
"text": "My Documents",
"id": "%MY_DOCUMENTS%",
"cls": "folder",
"iconCls": "x-tree-icon-special",
"check": false,
"leaf": false,
"resolvedpath": "/Users/rune",
"hidden": false,
"symlink": false
}
Time: 70 ms
Not worth optimizing, but I’m not sure if it is used during restore.
It seems to be for translating absolute paths to the system equivalent shortcuts and it doesn’t drop in performance with backup size.
Get root folder
/api/v1/backup/17/files/*?prefix-only=true&folder-contents=false&time=2018-03-09T22%3A06%3A55%2B01%3A00
returns a list of files (1 file), a list of filesets (1 file set), and 3 info variables
{
"prefix-only": "true",
"folder-contents": "false",
"time": "2018-03-09T22:06:31+01:00",
"Filesets": [
{
"Version": 0,
"Time": "2018-03-09T22:06:31+01:00",
"FileCount": 68662,
"FileSizes": 44252728456
}
],
"Files": [
{
"Path": "/Users/rune",
"Sizes": []
}
]
}
Time: 4.75 sec
This query is ridiculous. It literally just gets the path to be used in the next query. Making it take 9 seconds to view JUST the top level folders.
This can be done by simply sorting the table by Path and getting the first row.
select Path from File order by Path limit 1
Takes 1 ms
returns
{
Path: /Users/rune/git/
}
Get files and folders in the root folder
/api/v1/backup/17/files/%2F?prefix-only=false&folder-contents=true&time=2018-03-09T22%3A06%3A55%2B01%3A00&filter=%2F
returns the same format. List of files (multiple), list of filesets (1), and 4 info variables
{
"prefix-only": "false",
"folder-contents": "true",
"time": "2018-03-09T22:06:55+01:00",
"filter": "/",
"Filesets": [
{
"Version": 0,
"Time": "2018-03-09T22:06:55+01:00",
"FileCount": 96554,
"FileSizes": 188312003061
}
],
"Files": [
{
"Path": "/etc/",
"Sizes": [
-1
]
},
{
"Path": "/home/",
"Sizes": [
-1
]
},
{
"Path": "/opt/",
"Sizes": [
-1
]
},
{
"Path": "/var/",
"Sizes": [
-1
]
}
]
}
Time: 4.96 sec
Why are we getting file count and file sizes? It does not appear to be used in the frontend.
Query to get files and folders (no invisible files)
Takes 400-500ms depending on path length (due to the fact that we rely on regex matching)
select File.Path from Fileset
left join FilesetEntry on FilesetEntry.FilesetID = Fileset.ID
left join File on File.ID = FilesetEntry.FileID
where Timestamp = '1520629591' and Path REGEXP "^\/Users\/rune\/git\/duplicati\/[^\.]{1}[^\/]+\/?$"
Query to get visible and invisible folders without files
Takes 130 ms to just get a list of file paths
select File.Path from Fileset
left join FilesetEntry on FilesetEntry.FilesetID = Fileset.ID
left join File on File.ID = FilesetEntry.FileID
where BlocksetID < 0 and Timestamp = '1520629591' and Path REGEXP "^\/Users\/rune\/[^\/]+\/?$"
Explanation of regex
-
^
start of path -
/Users/rune/
our root/current browsing path -
[^\.]{1}
file/folder name does NOT start with.
-
[^\/]+
pick up anything until the next slash -
\/?
expect 1 or 0 slashes (a file or a folder) -
$
the path must stop here (or else it’s in a sub directory)
Expand a folder
/api/v1/backup/17/files/%2Fetc%2F?prefix-only=false&folder-contents=true&time=2018-03-09T22%3A06%3A55%2B01%3A00&filter=%2Fetc%2F
It’s the same function as listing the root folders, so same call, same query. But for the same of completeness I did the benchmarks.
Time: 2.34 sec
Seems to be less for smaller sub folders, but still 2+ sec to return a couple of files in a sub folder
Takes the same 400-500ms depending on path using the proposed queries above
Expand all subfolders (Frontend doesn’t really support this, just an example)
Takes 400-700 seconds depending on paths and sub items
select File.Path from Fileset
left join FilesetEntry on FilesetEntry.FilesetID = Fileset.ID
left join File on File.ID = FilesetEntry.FileID
where Timestamp = '1520629200' and Path REGEXP "^\/Users\/rune\/git\/duplicati\/[^\/]+"
Conclusions
So to load the restore page by default showing the visible files and folders in the root folder:
Duplicati queries:
get fileset - 1.37sec
get filesystem - 70ms
get root folder - 4.75 sec
get files and folders in root folder - 4.96 sec
total - 11,15 sec
Raw queries:
get fileset - 1ms
get filesystem - 70ms
get root folder - 1ms
get files and folders in root folder - 500ms
total - 572 ms
We’re looking at around 90-95% reduction in time to get the data required to display the default view.
Additionally, we’re looking at close to 80% reduction in time to open sub folders as well.
It should be noted this is a slightly unfair comparison since the Duplicati queries are measured in time from HTTP request to HTTP response, including time spent logging, while the queries are raw time to get the data.
Nonetheless, I think it proves a point - There is a lot of performance to gain by moving from generic queries to more specialized queries.
The /api/v1/backup/17/files/
API call ends up in these methods:
Duplicati/Library/Main/Operation/ListFilesHandler.cs#L27
Duplicati/Library/Main/Controller.cs#L313
These are called by the commandline as well as the web API, which may explain why they return too much data.
I think the best option is to write some new API paths using customized queries to provide just the data required by the frontend. That being said, I haven’t had time to play with the actual code implementation, so this is still very much theoretical.