Why restore needs so much time of verifying remote data

gpatel-fr · November 17, 2023, 9:08am

Thanks for the experiment, it seems to doom ExFAT as a support for Duplicati.
This may be just as well since there is a cloud of uncertainty about ExFAT reliability: it has no file allocation table duplication, it seems to mean that a single failure can wreck data recovery.
As the big advantage of ExFAT over old FAT is that it can manage bigger (>4GB) files, while Duplicati stores data in files typically much smaller (50 MB) the whole idea seems rather misguided.

hubert · November 17, 2023, 11:03am

i choose it to use it for linux, win (and mac). Is there a perfered filesystem which is good for linux and win?

hubert · November 17, 2023, 11:03am

thanks so much for investigation!

Jojo-1000 · November 17, 2023, 11:26am

Not sure if it would make a difference, but it would be worth a try to use DirectoryInfo.EnumerateFiles instead of Directory.EnumerateFiles, because that also returns all the metadata we need. Maybe the implementation of that is more optimized than doing a name lookup. If you still have your folder we could test the speed difference.

ts678 · November 17, 2023, 11:49am

The folder will likely be around awhile. The files are zero bytes, so it’s compact. Filled by:

#!/usr/bin/perl
$suffix = "0123456789" x 5;
for ($path = 0; $path < 122000; $path++) {
        $prefix = sprintf("%08d", $path);
        open my $fh, ">>", "${prefix}-${suffix}" or die $!;
}

Given the low access of the drive, I’d probably test using a little USB stick, if I had a spare.
If you do, give it a try. There’s also a way to put a filesystem in a file, but I forget which OS.

Jojo-1000 · November 17, 2023, 12:00pm

I tested 100 repeated lists of my 2600 element backup folder on NTFS SATA HDD to get an idea if this could be beneficial:

List only names: 00:00:00.2101002
List + lookup by name: 00:00:00.5752416
List DirectoryInfo: 00:00:00.2419649

So listing the FileInfo directly is almost as fast as only the names, twice as fast as looking up by name. On exFAT this will probably be even worse. I will see if I can create a test myself, and also test on Mono.

Test code

string path = @"D:\Backup";
var timeInfo = TimeSpan.Zero;
var timeListNames = TimeSpan.Zero;
var timeLookupNames = TimeSpan.Zero;
var watch = System.Diagnostics.Stopwatch.StartNew();

for (int i = 0; i < 100; ++i) {
    // List by DirectoryInfo
    watch.Restart();
    System.IO.FileInfo[] fileInfos2 = new System.IO.DirectoryInfo(path).GetFiles();
    watch.Stop();
    timeInfo += watch.Elapsed;

    // List files by name
    watch.Restart();
    string[] files = System.IO.Directory.GetFiles(path);
    watch.Stop();
    timeListNames += watch.Elapsed;
    watch.Start();
    System.IO.FileInfo[] fileInfos = files.Select(f => new System.IO.FileInfo(f)).ToArray();
    watch.Stop();
    timeLookupNames += watch.Elapsed;
}
Console.WriteLine($"List only names: {timeListNames}\nList + lookup by name: {timeLookupNames}\nLookup DirectoryInfo: {timeInfo}");

Update:

SD card with exFAT and 110000 files, 10 repeats:

List only names: 00:00:01.9148963
List + lookup by name: 00:00:03.7029486
List DirectoryInfo: 00:00:02.1024485

Looks similar. I don’t know why the Duplicati list operation takes orders of magnitude longer than my test, but there is still a big improvement. There is also definitely caching going on. The first access on the directory takes an extra 2 seconds, either in file explorer or the script. These numbers are after caching.

I can’t test this on mono, because while writing the files my VM decided it no longer likes to access this device and I can’t even open it any more.

ts678 · November 17, 2023, 12:06pm

What file systems can you use on a virtual hard disk (VHD)? suggests you can put exFAT on there.

I suspect that’s the reason my drive was manufactured with exFAT. It possibly reduces such issues.
Linux NTFS support has been improving. I don’t follow macOS.

gpatel-fr · November 17, 2023, 12:48pm

I think NTFS will work on Linux. However, a better question is: what could be the use of a Duplicati backup of Windows files on Linux ? you will not be able to continue to use the drive for backups when switching operating system, so it will be only to restore the files.
In this case, the quality of write support for NTFS drives under Linux is moot: the drive will be used read only anyway.
So my advice is to reformat the drive with NTFS.

ts678 · November 17, 2023, 1:57pm

I’m unclear on the status of Mac. There was a parenthesized “(and mac)”, then it got ignored later.
Looking briefly, Mac looks like the hardest, but Linux has several NTFS (an old, or new-emerging).

I haven’t benchmarked different file counts in the folder, but linear search usually scales as square.
Very rough view of time per request seemed like linear slowdown, then the requests also increase.

Workarounds if exFAT needs to be kept (but I’m not sure it does) are to scale back down maybe by
splitting the big backup into smaller ones, and/or raising Remote volume size on Options screen.

is on the right track, except it probably doesn’t mean blocksize which is 100 KB default (so too small)
but dblock-size which most people would set in the Remote volume size field. You can also look at:

Choosing sizes in Duplicati talks about both sizes, but more on using fewer files (more exFAT speed).
The default 100 KB blocksize is inside the dblock volume. Small makes a big database and slow SQL beyond perhaps 100 GB. Your 3 GB files get tracked as 30 thousand blocks each. Is that necessary?

EDIT:

I do have one data point on the drive. My 1692 file (500 MB volume) backup took 8 seconds to list.
You have about 72 times as many files. Linear scaling would run about 10 minutes, so it’s over linear although not quite square either (but square is closer, and this is by extremely imprecise measuring).

hubert · November 17, 2023, 4:00pm

Because running the backup from linux server was very slow i decided to use windows client to backup by smb share for first time and than continue under linux.
Sadly that this isnt possible.

Also, yes i found it much simplier to restore from different systems.

hubert · November 17, 2023, 4:38pm

should i run some test with my files, or going on to move these files to a ntfs partition?

Jojo-1000 · November 17, 2023, 5:15pm

I hacked in the file listing using DirectoryInfo and just created a new backup in the folder with many files.

The new list method completed after ~20 seconds. For comparison, I am currently running the lookup by name (to rule out that it is a drive speed difference) and it is currently still running after 8 minutes. If it is really going to take hours, that is a strong argument to change the way our file listing is done. I think my simple test was faster because it did not actually access any metadata, so it was not loaded.

Edit:
Yes, that took about an hour. I think it is clear that the speed can be improved a lot by changing the list method.

ts678 · November 17, 2023, 5:25pm

Using PowerShell (.NET Framework) as a proof of concept, my big folder lists in 50 seconds, however I’m only getting the name and timestamps because I didn’t see an easy way to match Duplicati’s exact grab…

Get-ChildItem "E:\fill" | Select-Object Name, *Time > $null

ts678 · November 17, 2023, 5:52pm

Any tests you run will be a little faster after the move. Think about the tests you’ve just been doing.

This won’t change though, and will slow things down forever. Are you sure you prefer to keep that?
Changing will probably mean fresh backup, but it will probably finish faster using a larger blocksize.

What’s being backed up? If Linux server itself, that surprises me, but I’m not on Linux all that much.

Generally the recommendation is to put Duplicati on the system being backed up, so avoid network.

I don’t think I understand your restore approach either. Are you restoring Linux by way of Windows?

Jojo-1000 · November 17, 2023, 6:11pm

I also just noticed that the remote operation log will contain two complete file list results for every backup, with all of the 120k filenames and metadata in json format. I don’t think it is a major contributor to backup size, but I wonder if it is necessary to keep that much detail.

ts678 · November 17, 2023, 6:26pm

log-retention at least deletes them eventually. They’re sometimes useful (but large) complements to the other log entries. I’ve even put them in a spreadsheet (they’re pretty close to CSV) to sort them by date, which can give one a view of earlier activity after log-retention has purged the records from that time.

Not totally necessary, but sometimes useful. Also another reason for big backups to use bigger volumes.

hubert · November 17, 2023, 8:16pm

i dont backup my system. Its only my data on that server

hubert · November 18, 2023, 12:48pm

so, someone will create a pull-request for it?

hubert · November 18, 2023, 12:57pm

what is the problem to do so? If i go through all filelist.json of the dlist.zip files and change the path of the files, will it than be possible?

ts678 · November 18, 2023, 2:20pm

You’re probably in uncharted territory with hacking on the internals, and there are several to hack.

Editing the dlist does nothing to change the database (which is always supposed to stay in sync).
It’s possible to manually edit in the database, but are you sure you can make it exactly matching?
Best way to be sure is to recreate the database from your hacked dlist file. While in filelist, notice:

"metahash":"dPOJyJ0X8wo7LaKnDKQgXqSbfflPc0apa5L1EVvCAis=","metasize":137

which is the typical size for Windows metadata such as permissions. Linux uses a different format, meaning that if you resume the backup on Linux, it will scan all the files and get the new metadata.

Metadata should be relatively small by volume. Backups don’t change existing backups, so the old metadata will remain as-is, and I’m not sure what happens if you actually attempt a restore of what appears to be a Linux file (from path), but actually has Windows metadata from your initial backup.

There are other experimental ways (this whole idea is an experiment, but some are more doubtful)
which could be tried. The database does not have full pathnames these days, but pulls off the path prefixes into their own table, so as to save space. Individual paths have a name and point to prefix.

Duplicati can recreate missing dlist files from an intact database, so it might be possible to edit path prefixes, delete (more safely, hide for now) the dlist files, ask for a repair, and see if dlists are made.

The problem is it’s complex, and you might be the first. Whatever comes out of it, you get to keep it.

I’m still not clear on whether the previous wish was to restore data on the Linux server via Windows, moving it back to Linux server via SMB, or if the preferred path (as it once was) was for all on Linux. Currently (given the line of questioning), I’m guessing you’d prefer all-Linux but don’t wish to backup again because it was slow the first time, even if a lot can be gained as mentioned from fresh backup.