Crash at end of backup because filelist is more than 4GB, efficient way to recover?

ts678 · March 28, 2024, 12:55pm

You’re probably thinking just about scaling up. There’s also scaling down, e.g. to backup a NAS. Embedded databases can sometimes be persuaded to scale up, but the awkwardness of a non embedded database may make it hard to go small and maintain it. It’s likely a separate process using more files for its databases than SQLite, and possibly having to run in a Docker container.

For maintenance, SQLite is (somewhat) a single file database. You don’t want to get to where a special backup procedure and tool is needed to backup the database for your backup program. Recently, we’ve been dodging that with database recreate from destination, but that grows slow when the backup grows large, because of too many files and too many blocks to insert back in.

Fancier database comes up sometimes, and you can see the other opinions. The most recent:

The “many roundtrips” probably refers to the opposite side of database use. I talked about the problem of a few large queries getting slow, but there are also millions of tiny ones that add up.

To see what I mean (and make a really big profiling log), you can also add the option below to enjoy what I think by default are mostly per-block SQL requests on small blocks. They add up, however they don’t stand out in the profiling log the way the occasional really slow queries do.

Some profiling tools can add up the total time cost of repeated fast queries. Duplicati doesn’t, however there’s a limited amount that can be done short of a redesign to not be issuing them.

Generally in SQL, I think things like inserting rows to the database one at a time is expensive.

  --profile-all-database-queries (Boolean): Activates logging of all database
    queries
    To improve performance of the backups, frequent database queries are not
    logged by default. Enable this option to log all database queries, and
    remember to set either --console-log-level=Profiling or
    --log-file-log-level=Profiling to report the additional log data
    * default value: false

In a sense, this is also a scaling up and down problem. Does any one solution cover all cases?

Introducing “Duplicati, Inc.” is an unknown, but it might force a greater emphasis on large scale.

If you look at other comments on SQLite, you’ll see some saying it’s single-threaded, which is a function of it being a library that uses the thread of its caller. I think a caller that wants threading could implement it in the application. It might not be simple, and it requires thought on its usage.

Threading here is aimed at the annoying Task Manager (etc.) view of only getting one CPU core worth of SQL CPU processing on your far higher core system. There’s also the problem that the hard drive might collapse first. I’m not sure quite how big databases manage to avoid that issue.

Personally, I have a feeling that SQLite nested loop joins are where Duplicati’s big joins get slow.

I am not an SQLite wizard, but here are some docs:

Using SQLite In Multi-Threaded Applications

Order of Tables in a Join

The current implementation of SQLite uses only loop joins. That is to say, joins are implemented as nested loops.

Table and Index Scans

For each table read by the query, the output of EXPLAIN QUERY PLAN includes a record for which the value in the “detail” column begins with either “SCAN” or “SEARCH”. “SCAN” is used for a full-table scan, including cases where SQLite iterates through all records in a table in an order defined by an index. “SEARCH” indicates that only a subset of the table rows are visited.

Ways to gain speed including adding indexes to encourage the query planner to choose search over scan. I think I had a query the other day that chose to scan on innermost loop. It was slow.

You’d have to clarify that. The database doesn’t store any of that but it does track it, for example metadata such as a timestamp is stored in some dblock file alongside blocks of some files’ data. Blocks don’t have metadata, but they may be metadata. They might also be a blocklist. It’s all in rather flat layout in storage. In the database, SQLite already uses B-Tree tables for performance. How to work a query is left to the database, which doesn’t say hierarchy can’t be used, but how?

There are quite a few other performance paths, such as usn-policy on Windows to avoid drive scanning on Windows. If you’re in a millions of files situation, scan can be slow, however USN journal itself has a fixed size, and may need enlarging using Windows tools such as fsutil usn.

Options exist to increase upload parallelism, CPU concurrency, and other things. It’s tuning for specific situations and desires. If you want to play with SQLite cache tunings, search forum for CUSTOMSQLITEOPTIONS_DUPLICATI using the forum search too. Feel free to look into the settings on all of this which work well for your case. Favor results over any general opinons, as situations vary a lot, and both equipment and personnel to investigate are rare. Try some ideas.

Wow. I made a 10 million file tree of empty files (1 TB drive on this desktop won’t hold big ones), tried to back it up with Duplicati, found it taking too long, and added an exclude. Formerly I used Macrium Reflect Free for a fast backup to a USB drive. That’s gone, and file backups are slower.

I’m not sure why. Probably the image backup just dumps the Windows master file table as data, rather than having to walk it and inspect individual files to look for clues suggesting need to read.