Optimizing load speed on the restore page

Pectojin · April 10, 2018, 7:00pm

I spent a weekend looking at the slow performance of the restore page when it has to list files and folders.

All examples compares performance of current Duplicati API load time vs raw custom SQL queries on a 40GB dataset, a 350MB database, and some 74k files. Running on a laptop with SSD and an i5 4288U CPU.

The restore page uses the same API path for most of it’s actions. This is nice for the flexibility it provides, but it seems to cause horrible performance because each call does everything instead of just what you want.

I’m dividing this post up in a section for each call made by the frontend where I’ll provide details on what’s called, what data we get, some performance stats, and my suggestions for a query that performs better.

Get restore points (backup sets)

/api/v1/backup/13/filesets
returns a list of json objects

{
    Version: 0,
    Time: "2018-03-09T22:06:31+01:00",
    FileCount: -1,
    FileSizes: -1
}

Time: 1.37 sec

Takes ~2 ms using query:

select 
(select count(*) from Fileset b where Fileset.id >= b.id) as Version, 
strftime('%Y-%m-%dT%H:%M:%S', datetime(Timestamp, 'unixepoch')) as Timestamp,
'-1' as FileCount,
'-1' as FileSizes
from Fileset
order by Timestamp desc

returns list of rows:

{
  Version: 0,
  Timestamp: 2018-04-08T17:00:00,
  FileCount: -1,
  FileSizes: -1
}

Get filesystem folders

/api/v1/filesystem?onlyfolders=true&showhidden=true

returns a list of json objects, example object:

{
    "text": "My Documents",
    "id": "%MY_DOCUMENTS%",
    "cls": "folder",
    "iconCls": "x-tree-icon-special",
    "check": false,
    "leaf": false,
    "resolvedpath": "/Users/rune",
    "hidden": false,
    "symlink": false
}

Time: 70 ms

Not worth optimizing, but I’m not sure if it is used during restore.

It seems to be for translating absolute paths to the system equivalent shortcuts and it doesn’t drop in performance with backup size.

Get root folder

/api/v1/backup/17/files/*?prefix-only=true&folder-contents=false&time=2018-03-09T22%3A06%3A55%2B01%3A00

returns a list of files (1 file), a list of filesets (1 file set), and 3 info variables

{
  "prefix-only": "true",
  "folder-contents": "false",
  "time": "2018-03-09T22:06:31+01:00",
  "Filesets": [
    {
      "Version": 0,
      "Time": "2018-03-09T22:06:31+01:00",
      "FileCount": 68662,
      "FileSizes": 44252728456
    }
  ],
  "Files": [
    {
      "Path": "/Users/rune",
      "Sizes": []
    }
  ]
}

Time: 4.75 sec

This query is ridiculous. It literally just gets the path to be used in the next query. Making it take 9 seconds to view JUST the top level folders.

This can be done by simply sorting the table by Path and getting the first row.

select Path from File order by Path limit 1

Takes 1 ms

returns

{
  Path: /Users/rune/git/
}

Get files and folders in the root folder

/api/v1/backup/17/files/%2F?prefix-only=false&folder-contents=true&time=2018-03-09T22%3A06%3A55%2B01%3A00&filter=%2F

returns the same format. List of files (multiple), list of filesets (1), and 4 info variables

{
  "prefix-only": "false",
  "folder-contents": "true",
  "time": "2018-03-09T22:06:55+01:00",
  "filter": "/",
  "Filesets": [
    {
      "Version": 0,
      "Time": "2018-03-09T22:06:55+01:00",
      "FileCount": 96554,
      "FileSizes": 188312003061
    }
  ],
  "Files": [
    {
      "Path": "/etc/",
      "Sizes": [
        -1
      ]
    },
    {
      "Path": "/home/",
      "Sizes": [
        -1
      ]
    },
    {
      "Path": "/opt/",
      "Sizes": [
        -1
      ]
    },
    {
      "Path": "/var/",
      "Sizes": [
        -1
      ]
    }
  ]
}

Time: 4.96 sec

Why are we getting file count and file sizes? It does not appear to be used in the frontend.

Query to get files and folders (no invisible files)

Takes 400-500ms depending on path length (due to the fact that we rely on regex matching)

select File.Path from Fileset
left join FilesetEntry on FilesetEntry.FilesetID = Fileset.ID
left join File on File.ID = FilesetEntry.FileID
where Timestamp = '1520629591' and Path REGEXP "^\/Users\/rune\/git\/duplicati\/[^\.]{1}[^\/]+\/?$"

Query to get visible and invisible folders without files

Takes 130 ms to just get a list of file paths

select File.Path from Fileset
left join FilesetEntry on FilesetEntry.FilesetID = Fileset.ID
left join File on File.ID = FilesetEntry.FileID
where BlocksetID < 0 and Timestamp = '1520629591' and Path REGEXP "^\/Users\/rune\/[^\/]+\/?$"

Explanation of regex

^ start of path
/Users/rune/ our root/current browsing path
[^\.]{1} file/folder name does NOT start with .
[^\/]+ pick up anything until the next slash
\/? expect 1 or 0 slashes (a file or a folder)
$ the path must stop here (or else it’s in a sub directory)

Expand a folder

/api/v1/backup/17/files/%2Fetc%2F?prefix-only=false&folder-contents=true&time=2018-03-09T22%3A06%3A55%2B01%3A00&filter=%2Fetc%2F

It’s the same function as listing the root folders, so same call, same query. But for the same of completeness I did the benchmarks.

Time: 2.34 sec

Seems to be less for smaller sub folders, but still 2+ sec to return a couple of files in a sub folder

Takes the same 400-500ms depending on path using the proposed queries above

Expand all subfolders (Frontend doesn’t really support this, just an example)

Takes 400-700 seconds depending on paths and sub items

select File.Path from Fileset
left join FilesetEntry on FilesetEntry.FilesetID = Fileset.ID
left join File on File.ID = FilesetEntry.FileID
where Timestamp = '1520629200' and Path REGEXP "^\/Users\/rune\/git\/duplicati\/[^\/]+"

Conclusions

So to load the restore page by default showing the visible files and folders in the root folder:

Duplicati queries:

get fileset - 1.37sec
get filesystem - 70ms
get root folder - 4.75 sec
get files and folders in root folder - 4.96 sec
total - 11,15 sec

Raw queries:

get fileset - 1ms
get filesystem - 70ms
get root folder - 1ms
get files and folders in root folder - 500ms
total - 572 ms

We’re looking at around 90-95% reduction in time to get the data required to display the default view.

Additionally, we’re looking at close to 80% reduction in time to open sub folders as well.

It should be noted this is a slightly unfair comparison since the Duplicati queries are measured in time from HTTP request to HTTP response, including time spent logging, while the queries are raw time to get the data.

Nonetheless, I think it proves a point - There is a lot of performance to gain by moving from generic queries to more specialized queries.

The /api/v1/backup/17/files/ API call ends up in these methods:
Duplicati/Library/Main/Operation/ListFilesHandler.cs#L27
Duplicati/Library/Main/Controller.cs#L313
These are called by the commandline as well as the web API, which may explain why they return too much data.

I think the best option is to write some new API paths using customized queries to provide just the data required by the frontend. That being said, I haven’t had time to play with the actual code implementation, so this is still very much theoretical.

kenkendk · April 23, 2018, 7:37am

I agree, we can easily add new specialized query API endpoints, if the generic ones are too slow (as your numbers indicate).

I have postponed looking into the listing speed because I want to rewrite the paths.

Currently they are all stored in plain:

C:\Data\Folder1\
C:\Data\Folder1\file1.txt
C:\Data\Folder1\file2.txt
C:\Data\Folder2\
C:\Data\Folder2\file1.txt
...

This is nice and simple, but it has a huge overhead as the data is not compressed and tends to be very repetitive. This makes the SQLite database larger and slows down queries.

My idea is to make a new table that is called something like PathPrefix and then store all prefixes there, eg.:

1,C:\Data\
2,C:\Data\Folder1\
3,C:\Data\Folder2\
...

So we can then simply store:

1,Folder1\
2,file1.txt
2,file1.txt
1,Folder2\
3,file2.txt
...

This reduces the amount of path storage a lot, with only having a single integer-based join to get the full path. An upside to this is that it is really fast to do folder listings, as we just need to grab the prefix-id, and then return all that matches that ID.

We could in principle store this even more compact, such that a prefix can depend on another prefix (i.e. store all folder names) but this makes it much slower to build each path, as we need an unknown number of repeated lookups.

We can omit it, if it is not used, but I think the idea was to use it in the UI for a tooltip or similar.

Pectojin · April 23, 2018, 7:44am

I like the ideal of moving paths into a new table. That will allow us to create much better queries than I did above as well as reducing size.

I think we should reconsider getting the file count and file sizes for now. Maybe put them in a separate call so we can fetch it later if required?
The reason none of my queries gets those numbers is that I found it to be really expensive to get out of the DB.

kenkendk · April 23, 2018, 7:48am

Fine by me

Pectojin · April 23, 2018, 7:51am

But if we’re looking to restructure the database doing that first makes sense

kees-z · April 23, 2018, 7:59am

Please forgive me if I say stupid things, I’m not a developer…
Wouldn’t it save more space in the local DB (and probably make it faster) to store each individual item instead of the full path? If all items get a unique ID and an additional “Child of” field, parent folders of a subfolder just have to be stored once.

So:

C:\Data\Folder1\
C:\Data\Folder1\file1.txt
C:\Data\Folder1\file2.txt
C:\Data\Folder2\
C:\Data\Folder2\file1.txt

Could also be stored like this:

ID,Parent,Name
1,0,C:\
2,1,C:\Data
3,2,Folder1
4,2,Folder2
5,3,file1.txt
6,3,file2.txt
7,4,file1.txt

Pectojin · April 23, 2018, 8:04am

Not a stupid question

It would use less space, but you could risk having to recursively join the table on itself 10 times to get the full path for a file. So you trade disk space for CPU and lookup time.

kenkendk · April 23, 2018, 8:13am

Exactly.

That would store it in the most compressed manner. However, to get the full path for C:\Data\Folder1\file1.txt, you would need to first read:

5,3,file1.txt

Then figure out you need 3, read that, figure out you need 2, read that, then figure out you you need 1, read that, see that you are done.

Since we don’t know the number of steps required (the path “depth” can be arbitrarily long), we need to implement some kind of “while” loop, making table lookups. This is not efficient to do with relational databases, which is why I want exactly one step. It also makes sense to use “one” as we frequently want “all content of this folder”, which is then nicely grouped. If we choose “two”, we could save (potentially) more disk space, but there is no good rule for where to split the path (after 4 folders? 5?).

kees-z · April 23, 2018, 8:13am

If the table is used by the file picker this makes no sense, does it?
Browsing a folder structure is always top-down. One query should be required to list the contents of the root folder: Give me all items that are child of ID 1 will result in C:\Data.
When the user clicks to open Data, another single query is needed to show the contents: Give me all items that are child of ID 2, which results in Folder1 and Folder2.
This is not different from storing the full path of each folder, the query will become longer, but in both scenarios there’s just one query needed for each folder that’s opened in the file picker.
Am I missing something?

Pectojin · April 23, 2018, 8:23am

From a semantic point of view it’s logical to store them in a parent-child structure because that’s what directories are.

It even makes a lot of sense from functional or objective programming perspectives.

However, it’s tricky to implement and runs slower in a relational database due to the way the SQL is structured

kenkendk · April 23, 2018, 8:28am

Yes, in that sense it makes a lot of sense. We get the same functionality by just splitting it once.

Where it hurts with multiple splits, is when we need to build the full paths. For instance, if we need to test a filter, we need to build the full path before we can evaluate the filter. Also when writing the list of paths (in the dlist file), we want to know the full paths.

It is possible that we can implement all the existing uses of the paths with some kind of stack/queue to avoid looking up each parent (it should already be known, when we list the contents). But with the single split, we can simply write an SQL view that returns the full paths, and it will be approximately as fast as it is now, with almost no code change.

kenkendk · April 23, 2018, 9:34am

Ok, I made a quick go at this. The re-structure is quite simple to do in SQL:

/*
The PathPrefix contains a set
of path prefixes, used to minimize
the space required to store paths
*/
CREATE TABLE "PathPrefix" (
    "ID" INTEGER PRIMARY KEY,
    "Prefix" TEXT NOT NULL
);
CREATE UNIQUE INDEX "PathPrefixPrefix" ON "PathPrefix" ("Prefix");


/*
The FileLookup table contains an ID
for each path and each version
of the data and metadata
*/
CREATE TABLE "FileLookup" (
    "ID" INTEGER PRIMARY KEY,
    "PrefixID" INTEGER NOT NULL,
    "Path" TEXT NOT NULL,
    "BlocksetID" INTEGER NOT NULL,
    "MetadataID" INTEGER NOT NULL
);

/* Fast path based lookup, single properties are auto-indexed */
CREATE UNIQUE INDEX "FileLookupPath" ON "FileLookup" ("PrefixID", "Path", "BlocksetID", "MetadataID");


/* Build the prefix table */
INSERT INTO "PathPrefix" ("Prefix")
SELECT DISTINCT
    CASE SUBSTR("Path", LENGTH("Path")) WHEN  '/' THEN
        rtrim(SUBSTR("Path", 1, LENGTH("Path")-1), replace(replace(SUBSTR("Path", 1, LENGTH("Path")-1), "\", "/"), '/', ''))
    ELSE
        rtrim("Path", replace(replace("Path", "\", "/"), '/', ''))
    END AS "Prefix"
FROM "File";

/* Build the path lookup table */
INSERT INTO "FileLookup" ("Path", "PrefixID", "BlocksetID", "MetadataID")

SELECT 
  SUBSTR("Path", LENGTH("ParentFolder") + 1) AS "Path", 
  "ID" AS "PrefixID", 
  "BlocksetID", 
  "MetadataID" 
FROM

(SELECT "Path", "BlocksetID", "MetadataID",
    CASE SUBSTR("Path", LENGTH("Path")) WHEN  '/' THEN
        rtrim(SUBSTR("Path", 1, LENGTH("Path")-1), replace(replace(SUBSTR("Path", 1, LENGTH("Path")-1), "\", "/"), '/', ''))
    ELSE
        rtrim("Path", replace(replace("Path", "\", "/"), '/', ''))
    END AS "ParentFolder"
FROM "File") "A" INNER JOIN "PathPrefix" "B" ON "A"."ParentFolder" = "B"."Prefix";

DROP TABLE "File";

CREATE VIEW "File" AS SELECT "A"."ID" AS "ID", "B"."Prefix" || "A"."Path" AS "Path", "A"."BlocksetID" AS "BlocksetID", "A"."MetadataID" AS "MetadataID" FROM "FileLookup" "A", "PathPrefix" "B" WHERE "A"."PrefixID" = "B"."ID";


UPDATE "Version" SET "Version" = 8;

Then we just need to adjust the code that inserts into the File table to insert into the prefix+lookup tables instead.

Pectojin · April 23, 2018, 10:23am

Nice

I’ll take a look at some of the query options for the restore page.

There are also some calls from the commandline library, which rely on that common query that is currently listing files from the DB table. It would have to be reworked.

Do you know of any other dependencies?

kees-z · April 23, 2018, 10:30am

Ah, that makes sense. I was thinking of the file picker only, but of course there’s more that makes use of the paths stored in the db that would have to query the db multiple times. Thanks for the clarification!

kenkendk · April 23, 2018, 10:44am

With the view in place, we do not need to rewrite the queries, but of course, if we can speed it up we should.

The only real change is to make sure the inserts are creating both entries, afaik, there are no other issues.

Pectojin · April 23, 2018, 10:46am

Ah, good point. I had forgotten about the view.

Let’s keep it simple and then we can improve performance other places later

kenkendk · April 23, 2018, 8:37pm

I have implemented the refactored code now. There are a few places where it can be optimized further, but at least this uses the reduced storage space and should make it significantly faster to do the prefix-based lookups for the UI: