Crash at end of backup because filelist is more than 4GB, efficient way to recover?

Hey! Thanks for all your work on creating a nice cross-platform backup tool!

I tried to backup a folder to B2 that’s about 1.6TB in millions files. And after a week of waiting it crashed with:

Attempted to write a stream that is larger than 4GiB without setting the zip64 option

From what I can gather that’s because of my large file count. Is there an efficient way to get to a valid state from here? Everything is uploaded, so I’d prefer to not have to either download or upload everything again to get to a valid state.

(Also, creating a bug report doesn’t seem to work either (eg: it takes forever), I presume at some point it will be finished but it will probably be to big to share anyway.)

Welcome to the forum @bob1

was the only other report, and is only serving to make me think you had to set option yourself.

  --zip-compression-zip64 (Boolean): Toggles Zip64 support
    The zip64 format is required for files larger than 4GiB, use this flag to
    toggle it
    * default value: False

The last thing normally is the dlist file of the correct date. Does B2 web UI show that there?
Alternatively, <job> → Show log → Remote might show it on the first screen, however failure frequently rolls back database transactions that didn’t finish, and so may erase log data there.

Another large backup needing a larger blocksize. The default 100 KB slows down past 100 GB. Scaling it up helps, but can’t counteract millions of files, as each of those make at least 2 blocks because one holds data, and the other holds metadata such as permissions. Database gets big, dlist file gets big (I’m assuming your Options screen Remote volume size isn’t set super high).

Choosing sizes in Duplicati

What you should have is a series of dblock (default 50 MB, but can be bigger) and dindex files being uploaded, and a dlist after them. If you see huge files or lack a dlist, then that’s not good.
Awkward manual recovery may be possible. On the other hand, blocksize can only be changed with fresh start on correct blocksize. Splitting the backup might also be a reasonable approach.

Thaks a lot of your reply :slight_smile:

Yea I spotted the zip64 option, put it to “True” now obviously. An automatic retry with the option changed would have been nice haha.

Last file is unfortunately: duplicati-b002aecdef5ce4244b40bdd6e74266509.dblock.zip.aes

I did have the forethought to change the volume size to 200MB (I was afraid much higher than 50MB would be too “weird” and I’d be inviting bugs).

So I guess my best course of action is redo everything, volume size of 200MB, blocksize of 1MB and zip64 on?

I’m still trying to talk you into splitting the backup, but if you won’t then it sounds like ZIP64 is definitely a need. Remote volume size seems reasonable. Blocksize could maybe be bigger.

Generally the rule of thumb is that speed decreases past 1 million blocks in the database, so multiple million files already means it’ll be slow, regardless of maximum size of data blocks…

Pushing blocksize higher means less deduplication, so more B2 space use, but more speed.

With such a large backup, database recreate time can also be pretty large, so be sure to test disaster recovery if the time matters. At the very least, test a small direct restore occasionally.

There is an edge case when there are no file lists uploaded at all, and in this case it does not work to simply retry. Making an initial upload with one file, will create the dlist file.

If the full upload crashes, like in your case, you can simply run repair and it will write a “synthetic” filelist based on the local database.

But yeah, would be nice if it worked without any dlist files on the remote destination :smiley:

It doesn’t seem to work. I’m not sure if initial backup is special. I’ve seen it work in backup

and that’s the only call I can find. In my backup test, I did have to run Repair first because of

  • 2024-03-26 12:40:56 -04 - [Error-Duplicati.Library.Main.Operation.FilelistProcessor-ExtraRemoteFiles]: Found 2 remote files that are not recorded in local storage, please run repair
  • 2024-03-26 12:40:56 -04 - [Error-Duplicati.Library.Main.Operation.BackupHandler-FatalError]: Fatal error RemoteListVerificationException: Found 2 remote files that are not recorded in local storage, please run repair

after trying Backup right after rudely interrupting Duplicati with a process kill. Ran a Repair then another Backup. No sign of a synthetic filelist on the Repair dropdown menu, just the latest run, however that might have been good enough to solve the request here, except for bad blocksize.

On any future interruptions, it’d probably be worth just trying Backup again to see what it can do.

It does sometimes. On a second smaller killed backup (slightly upload throttled for predictability), result was a rerun backup as expected (from prior test) and also synthetic filelist for first try here.

image

Profiling log has:

2024-03-26 15:06:22 -04 - [Profiling-Timer.Finished-Duplicati.Library.Main.Operation.Common.DatabaseCommon-CommitTransactionAsync]: PreSyntheticFilelist took 0:00:00:00.124
2024-03-26 15:06:22 -04 - [Information-Duplicati.Library.Main.Operation.Backup.UploadSyntheticFilelist-PreviousBackupFilelistUpload]: Uploading filelist from previous interrupted backup

Job log Complete log has:

  "Messages": [
    "2024-03-26 15:06:13 -04 - [Information-Duplicati.Library.Main.Controller-StartingOperation]: The operation Backup has started",
    "2024-03-26 15:06:14 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: List - Started:  ()",
    "2024-03-26 15:06:14 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: List - Completed:  (18 bytes)",
    "2024-03-26 15:06:14 -04 - [Information-Duplicati.Library.Main.Operation.FilelistProcessor-KeepIncompleteFile]: keeping protected incomplete remote file listed as Temporary: duplicati-20240326T190513Z.dlist.zip",
    "2024-03-26 15:06:14 -04 - [Information-Duplicati.Library.Main.Operation.FilelistProcessor-Remove incomplete file]: removing incomplete remote file listed as Uploading: duplicati-b42ed9f9386194912a3ff57ea82025ec7.dblock.zip",
    "2024-03-26 15:06:14 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Delete - Started: duplicati-b42ed9f9386194912a3ff57ea82025ec7.dblock.zip (936.03 KB)",
    "2024-03-26 15:06:14 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Delete - Completed: duplicati-b42ed9f9386194912a3ff57ea82025ec7.dblock.zip (936.03 KB)",
    "2024-03-26 15:06:16 -04 - [Information-Duplicati.Library.Main.Operation.FilelistProcessor-Remove incomplete file]: removing incomplete remote file listed as Uploading: duplicati-b2bd41ab28fe84243b41fcdf6d685554e.dblock.zip",
    "2024-03-26 15:06:16 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Delete - Started: duplicati-b2bd41ab28fe84243b41fcdf6d685554e.dblock.zip (953.94 KB)",
    "2024-03-26 15:06:16 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Delete - Completed: duplicati-b2bd41ab28fe84243b41fcdf6d685554e.dblock.zip (953.94 KB)",
    "2024-03-26 15:06:18 -04 - [Information-Duplicati.Library.Main.Operation.FilelistProcessor-Remove incomplete file]: removing incomplete remote file listed as Uploading: duplicati-b4d2cff2c7f784500977ee8736e5066bf.dblock.zip",
    "2024-03-26 15:06:18 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Delete - Started: duplicati-b4d2cff2c7f784500977ee8736e5066bf.dblock.zip (956.25 KB)",
    "2024-03-26 15:06:18 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Delete - Completed: duplicati-b4d2cff2c7f784500977ee8736e5066bf.dblock.zip (956.25 KB)",
    "2024-03-26 15:06:20 -04 - [Information-Duplicati.Library.Main.Operation.FilelistProcessor-SchedulingMissingFileForDelete]: scheduling missing file for deletion, currently listed as Uploading: duplicati-b05e3a72ccd5f4d64a6202ede756c81e5.dblock.zip",
    "2024-03-26 15:06:20 -04 - [Information-Duplicati.Library.Main.Operation.FilelistProcessor-Remove incomplete file]: removing incomplete remote file listed as Uploading: duplicati-beb6eda77c5f0485aa3f2003550debf09.dblock.zip",
    "2024-03-26 15:06:20 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Delete - Started: duplicati-beb6eda77c5f0485aa3f2003550debf09.dblock.zip ()",
    "2024-03-26 15:06:20 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Delete - Completed: duplicati-beb6eda77c5f0485aa3f2003550debf09.dblock.zip ()",
    "2024-03-26 15:06:22 -04 - [Information-Duplicati.Library.Main.Operation.Backup.UploadSyntheticFilelist-PreviousBackupFilelistUpload]: Uploading filelist from previous interrupted backup",
    "2024-03-26 15:06:23 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Put - Started: duplicati-b8c30a40f693c4f9fa032840adeee92ab.dblock.zip (943.29 KB)",
    "2024-03-26 15:06:23 -04 - [Information-Duplicati.Library.Main.BasicResults-BackendEvent]: Backend event: Put - Started: duplicati-b814569284588475c9d1b3150bdb34e3c.dblock.zip (928.97 KB)"
  ],

So basically it looks like this is not the no-dlist edge case, but needing a Repair first can block it.
Probably worth a try to just retry on future interrupted backups. It might also be a natural hope…

Yes, that is the problem. The repair will recognize that the dblock files are superfluos and wants to delete them. It should be possible to just treat this cases as if there is an empty dlist file and then make the synthetic dlist file based on the contents.

Regarding just making the backup smaller, I already split the backup into smaller parts, making it even smaller would be somewhat unnatural. Though possible in principle of course.

Repair doesn’t seem to do anything, but I think that’s the edge case you described. Uploading one file file and then running repair also doesn’t help. But that also follows from the above conversation if I understand correctly.

I’ll just redo everything with the correct settings and see if that helps. Is there a way to set the zip64 option globally? I could not find it in settings → advanced options.

Things slow down if there are more than approx. 1 million blocks, regardless of block size and the file count of the backup object. That is, the optimal upper limit size of a backup task is determined by and only by the acceptability on the negative consequences of using large block sizes. The larger block size which we can afford for the consequent cost, the larger the backup object can be, without slowing things down (due to the hash collision I guess).
Am I getting it right?

What list is dlist on? On blocks or volumes? I didn’t seem to find this in the manual.
Knowing this seems to enable a judgement on the necessity of toggling --zip-compression-zip64 for a specific backup task.

Welcome to the forum @kyo

It’s just a rough rule of thumb, and nobody has had spare time to try extensive what-if-testing, however I think the stated conclusion is going against what I was saying about the file counts:

If you mean the optimal upper limit is determined by need for speed, that’d probably be correct, unless you hit the dlist file size limit being discussed here, with workaround being to use ZIP64. Specific destinations can also have specific limits on the maximum number of files, if it gets big.

Generally agree. I’d point out that the cost of less deduplication varies a lot with the type of files. Sometimes you have totally duplicated files, which will deduplicate fully regardless of block size. The small blocksize helps to catch things like repeated data patterns inside or between the files.

The space increases from poorer deduplication due to larger blocksize is capped, as per above, because identical files still get deduplicated, and loss of inside and between file deduplication is capped to no deduplication, and that’s capped at actual file size. Also recall that it’s compressed.

Slowdown below, in the SQL, is potentially exponential with table size, and hugely unpredictable.

The slowdowns are due to SQL and SQLite which have to do extensive work to figure things out. Look at About → Show log → Live → Profiling (or log-file) and you might find some queries slow.

SQLite by default uses a cache_size of a mere 2 MB of memory, and people have multi-GB-DBs. What can happen when cache is exhausted is constant repeated reading of files on drive instead. Sometimes this doesn’t even show up as drive activity, as it’s satisfied from the OS cache instead (with large speed cost compared to memory). Memory isn’t a sure cure, as it’s virtual, and the OS might go into paging/swapping instead. There’s a not-yet-in-the-manual way to set SQLite cache.

Query Planning explains why it’s tough to give definitive answers about SQL processing speeds.

For any given SQL statement, there might be hundreds or thousands or even millions of different algorithms of performing the operation. All of these algorithms will get the correct answer, though some will run faster than others. The query planner is an AI that tries to pick the fastest and most efficient algorithm for each SQL statement.

Temporary Files Used By SQLite has other things that can go to disk if size of things gets too big. Something faster than a mechanical hard drive is another thing that can sometimes raise speeds. Original rule of thumb was probably from me watching repeat hard drive reads on temporary files.

Sysinternals Process Monitor and Process Explorer are helpful tools if you really want to explore.

You can even watch it do a Restore, which runs a dblock at a time, and writes blocks as needed, which might involve many files (deduplication in reverse), and small random writes (especially as backup ages and compact runs). The larger blocksize probably improves restore speeds as well.

Search at the top of the screen finds dlist, although oddly it missed the other reference. Try these:

How the backup process works and the Processing a large file if your files exceed your blocksize.
How the restore process works

The dlist lists files and what blocks they use. Block volume can change at any time from compact.

The easiest way to know if you’re about to need ZIP64 is to see if you’re getting close to overflow.

If you have to guess, it’s mainly full paths, so gets bigger the more paths and the deeper the path. Files with multiple blocks don’t have blocks listed directly, due to blocklist indirection linked above.

On first test, it complained of a mismatch and forced a repair. Second test got luckier and worked.

You probably looked over the dropdown list, which tends to offer only universal options for all jobs. Destination-specific options aren’t there, and compression .zip is optional, though alternative .7z fails badly and is considered as Experimental (per help) and barely used, on the usage reporter.

--zip-compression-zip64

You can probably use Edit as text box on the right, type the option in, verify with Commandline for purposes of normal run (don’t actually run it in Commandline unless you want to), but know the export options don’t pick up the global default settings, and I’m not sure if that’s a bug or a feature.

3 Likes

was a prior discussion, if anyone wants to get deep into the technicalities of Duplicati .zip files.

has more details on what’s in a .dlist, best used when looking at one. Get a filelist.json
from a dlist file in an unencrypted backup (or decrypt an encrypted dlist, if you find that simpler).

Thanks for the incredibly detailed explanation.
My understanding is the speed bottleneck on either having more than 0.5 million actual files (like in the op’s case) or having more than 1 million blocks (when the backup object is of multiple hundreds of GBs or even several TBs, which isn’t uncommon in this era I guess) stems from the inherent scaling limitation of SQLite. If this is true, replacing SQLite with a more scalable DB enables backup object to scale? Or, using a hierarchical structure for storing metadata of files/blocks?
Anyway, I get it. The best practice for current implementation of making a large backup is to stay below the above-mentioned upper limit by splitting into small backup tasks and/or adjusting block size carefully, and enable the zip64 support.

I actually have more than 20 million files in this specific backup. So I guess I’ll back up tar files by hand (current system). But I’m actually curious how “workable” it is though. So I’ll try the following as an experiment:

  • zip64 option
  • blocksize at 400kb
  • volume size at 200MB
  • Configure SQLite properly
    - WAL mode
    - synchronous=NORMAL
    - in memory temp storage
    - cache_size to 16MB

Any other suggestions are welcome

You’re probably thinking just about scaling up. There’s also scaling down, e.g. to backup a NAS. Embedded databases can sometimes be persuaded to scale up, but the awkwardness of a non embedded database may make it hard to go small and maintain it. It’s likely a separate process using more files for its databases than SQLite, and possibly having to run in a Docker container.

For maintenance, SQLite is (somewhat) a single file database. You don’t want to get to where a special backup procedure and tool is needed to backup the database for your backup program. Recently, we’ve been dodging that with database recreate from destination, but that grows slow when the backup grows large, because of too many files and too many blocks to insert back in.

Fancier database comes up sometimes, and you can see the other opinions. The most recent:

The “many roundtrips” probably refers to the opposite side of database use. I talked about the problem of a few large queries getting slow, but there are also millions of tiny ones that add up.

To see what I mean (and make a really big profiling log), you can also add the option below to enjoy what I think by default are mostly per-block SQL requests on small blocks. They add up, however they don’t stand out in the profiling log the way the occasional really slow queries do.

Some profiling tools can add up the total time cost of repeated fast queries. Duplicati doesn’t, however there’s a limited amount that can be done short of a redesign to not be issuing them.

Generally in SQL, I think things like inserting rows to the database one at a time is expensive.

  --profile-all-database-queries (Boolean): Activates logging of all database
    queries
    To improve performance of the backups, frequent database queries are not
    logged by default. Enable this option to log all database queries, and
    remember to set either --console-log-level=Profiling or
    --log-file-log-level=Profiling to report the additional log data
    * default value: false

In a sense, this is also a scaling up and down problem. Does any one solution cover all cases?

Introducing “Duplicati, Inc.” is an unknown, but it might force a greater emphasis on large scale.

If you look at other comments on SQLite, you’ll see some saying it’s single-threaded, which is a function of it being a library that uses the thread of its caller. I think a caller that wants threading could implement it in the application. It might not be simple, and it requires thought on its usage.

Threading here is aimed at the annoying Task Manager (etc.) view of only getting one CPU core worth of SQL CPU processing on your far higher core system. There’s also the problem that the hard drive might collapse first. I’m not sure quite how big databases manage to avoid that issue.

Personally, I have a feeling that SQLite nested loop joins are where Duplicati’s big joins get slow.

I am not an SQLite wizard, but here are some docs:

Using SQLite In Multi-Threaded Applications

Order of Tables in a Join

The current implementation of SQLite uses only loop joins. That is to say, joins are implemented as nested loops.

Table and Index Scans

For each table read by the query, the output of EXPLAIN QUERY PLAN includes a record for which the value in the “detail” column begins with either “SCAN” or “SEARCH”. “SCAN” is used for a full-table scan, including cases where SQLite iterates through all records in a table in an order defined by an index. “SEARCH” indicates that only a subset of the table rows are visited.

Ways to gain speed including adding indexes to encourage the query planner to choose search over scan. I think I had a query the other day that chose to scan on innermost loop. It was slow.

You’d have to clarify that. The database doesn’t store any of that but it does track it, for example metadata such as a timestamp is stored in some dblock file alongside blocks of some files’ data. Blocks don’t have metadata, but they may be metadata. They might also be a blocklist. It’s all in rather flat layout in storage. In the database, SQLite already uses B-Tree tables for performance. How to work a query is left to the database, which doesn’t say hierarchy can’t be used, but how?

There are quite a few other performance paths, such as usn-policy on Windows to avoid drive scanning on Windows. If you’re in a millions of files situation, scan can be slow, however USN journal itself has a fixed size, and may need enlarging using Windows tools such as fsutil usn.

Options exist to increase upload parallelism, CPU concurrency, and other things. It’s tuning for specific situations and desires. If you want to play with SQLite cache tunings, search forum for CUSTOMSQLITEOPTIONS_DUPLICATI using the forum search too. Feel free to look into the settings on all of this which work well for your case. Favor results over any general opinons, as situations vary a lot, and both equipment and personnel to investigate are rare. Try some ideas.

Wow. I made a 10 million file tree of empty files (1 TB drive on this desktop won’t hold big ones), tried to back it up with Duplicati, found it taking too long, and added an exclude. Formerly I used Macrium Reflect Free for a fast backup to a USB drive. That’s gone, and file backups are slower.

I’m not sure why. Probably the image backup just dumps the Windows master file table as data, rather than having to walk it and inspect individual files to look for clues suggesting need to read.