Write operations on source disk

Snowman8373 · February 1, 2024, 2:38pm

Hello,

I am currently using Duplicati for a backup of some local files on my PC’s SSD to an external HDD. Now I noticed in the task manager of windows that there are constant write operations to my local SSD (where the source folder sits). They are up to 30MB/s.

Does anybody know what is happening here? Shouldn’t there only be reading operations? In the long run this would wear out the SSD.

Thanks for any clarification what is going on.

ts678 · February 1, 2024, 3:38pm

No. It’s not a file copy. Assuming we’re talking C: drive, there’s likely a database and temp files.

Database management
tempdir
How the backup process works

Snowman8373 · February 1, 2024, 3:50pm

Ahh, now I see the database file growing, I just did not think of that. Of course there would be some temp files, but the write rate seemed very high to me. But now with the database it makes sense (and this is also my first backup so a lot to write).

Thanks for your fast response

ts678 · February 1, 2024, 4:10pm

One can move the database there. That’s what I do, mostly to have a nice bundle if PC HDD ever dies. Without that, one would have to recreate a DB to do disaster recovery. It also balances the drive loads. Depending on size of your backup (current defaults are good for 100 GB), database use can grow to a higher level as backup continues, so then you can decide if you prefer slower, or faster plus SSD write.

asynchronous-upload-folder can also be moved, but I’m not sure if you’re motivated enough to do that.

Snowman8373 · February 1, 2024, 4:29pm

Thanks for the tip, I will probably do that!

Unfortunately, I don’t really understand what you mean by that. Could you explain further?

ts678 · February 1, 2024, 4:38pm

Especially for initial backup, database will grow as it goes. This may mean more accesses (although many of them will probably be reads). If done on HDD, it’s slower. SSD is faster, but subject to wear. During my backup doing HDD onto portable HDD with database, Task Manager could see load shift.

Snowman8373 · February 1, 2024, 4:52pm

Ok, so probably best is to just move the database to the external HDD and see how it performs?

ts678 · February 1, 2024, 5:00pm

What’s best depends on which factor you prefer to optimize.
How big is source? If big enough, blocksize bump may help.
This would require a fresh backup start though.

As a side note, drive letters change. Windows Drive Letters provides special handling for Destination, however the more other things get moved to a drive letter, the more things must change if letter does.

Snowman8373 · February 1, 2024, 5:07pm

Currently my source is 550GB and will probably grow with ~100GB/Year. Right now I am using the default settings for block size and remote volume size.

My external drive is just connected via normal USB.

Regarding the drive letters, I have already thought about that and set it manually to a higher one so this one will always be free and should not change (at least it did not until now after a few unplugs/plugs. But in case it changes, i can just change the letter for the drive manually before the backup. Then I would not have to change the configs.

Edit:
Speed is not my main focus, I am probably happier if my SSD just lives longer (although I know that does not matter too much nowadays if you not write crazy amounts).

Lets say I move the database to the external HDD and the performance is fine for me, do you anyway recommend a change to the block size and remote volume size for my case when the backup is not to a cloud?

ts678 · February 1, 2024, 6:27pm

Block size, yes, to keep database speed up, especially given the expected growth of the source area.

Remote volume size is less clear, and even depends on things like drive format. exFAT (unlike NTFS) does linear search, so slows down with a lot of files. Regardless, yours sounds similar to mine, and I scaled both values up by 10, so blocksize 1 MB and Remote volume size 500 MB (although thinking about it more, it depends on frequency of restore. For disasters, big is probably fine – doing frequent restores might find things like single-file performance better with a smaller Remote volume size). This hasn’t been benchmarked, and in general I would focus on a blocksize boost for a larger backup IMO.

EDIT:

github.com/duplicati/duplicati

File list verify is unnecessarily slow on some filesystems

opened 10:19PM - 18 Nov 23 UTC

Jojo-1000

performance issue

- [x] I have searched open and closed issues for duplicates. - [x] I have searc…hed the [forum](https://forum.duplicati.com) for related topics. ---------------------------------------- ## Environment info - **Duplicati version**: current master - **Operating system**: Windows 10 - **Backend**: File ## Description Listing the backend files for exFAT targets (may also apply to other FAT types) takes much longer than required. It is suspected this is due to unnecessary lookups of files by name. Using `DirectoryInfo` to directly list the folder contents as `FileInfo`, this lookup is avoided and the list operation completes much faster (50 seconds instead of 1 hour for 110000 files). `ISystemIO` should be updated to implement the more efficient method of listing. ## Steps to reproduce 1. Create a folder on exFAT partition with 10000 - 100000 empty files 2. Create new backup with that folder as target 3. Run backup - **Actual result**: File verify takes a long time (minutes to hours). - **Expected result**: File verify should not take longer than a minute. ## Test code This code simulates two different methods of file access. It was discovered that accessing `LastAccessTime` or other metadata is the main reason for the slowdown. ```cs string path = @"F:\test"; int iterations = 1; var timeInfo = TimeSpan.Zero; var timeListNames = TimeSpan.Zero; var timeLookupNames = TimeSpan.Zero; var watch = System.Diagnostics.Stopwatch.StartNew(); for (int i = 0; i < iterations; ++i) { // List files directly (current implementation) watch.Restart(); string[] files = System.IO.Directory.GetFiles(path); watch.Stop(); timeListNames += watch.Elapsed; watch.Start(); var accessTimes = (from fileName in files let fi = new System.IO.FileInfo(fileName) select fi.LastAccessTime).ToList(); watch.Stop(); timeLookupNames += watch.Elapsed; // List by DirectoryInfo watch.Restart(); var accessTimes2 = (from fi in new System.IO.DirectoryInfo(path).GetFiles() select fi.LastAccessTime).ToList(); watch.Stop(); timeInfo += watch.Elapsed; } Console.WriteLine($"List only names: {timeListNames}\nList + lookup by name: {timeLookupNames}\nList DirectoryInfo: {timeInfo}"); ``` ### Output with 17000 files ``` List only names: 00:00:00.0264646 List + lookup by name: 00:00:58.3627105 List DirectoryInfo: 00:00:00.0315012 ```

is the GitHub issue, which you can trace back to the forum where we figured out the issue with exFAT. Backup was probably larger than yours in this case though. Also FYI Remote volume size can change (preferably slowly) if you don’t like it, whereas blocksize generally must be set right from initial backup.

kees-z · February 1, 2024, 8:01pm

Before data is uploaded to the backup target, each chunk of data is written multiple times to local storage. Probably database transaction is just a small part of the write activity you see in task manager.
Especially during the initial backup, I guess more data is written to local storage than the total size of the source data. This is because quite a lot of operations have to be applied to the source data before the backup data is uploaded:

All source data has to be split up in blocks of a fixed size (default is 100 KB). During the initial backup, deduplication does not occur, so all source data will be split up in 100 KB blocks and archived in a Zip file. Before this Zip file is created, raw blocks will be stored in .TMP files. This is the first write operation on local storage and is about the same size as the complete selection of source data. These temporary files will be stored at the location specified by the tempdir setting.
The temporary files containing your raw data chunks have to be archived and compressed in .ZIP archives (Remote Volumes). So all data will be sent to anew file where the blocks are stored in a compressed format. I’m not 100% sure if this is a separate operation, or if this is combined with the first step.
If encryption is applied (which is the default), the generated .ZIP files (remote volumes) have to be converted to encrypted archives (.ZIP.AES files). This is again a new write operation of the same source data. The encrypted archives will be created at the location specified by the asynchronous-upload-folder setting.
In the meantime, additional writing is performed to the local storage. If free internal memory is not sufficient, TEMP files will be created, writing to the local database is performed, etc.

So the total amount of source data will be written to local storage 2 or 3 times before it is uploaded to the backup location.

To minimize these write operations, use the tempdir and asynchronous-upload-folder settings to perform these actions on cheap storage.
On my Synology NAS, I have plugged in a 16GB USB key, created a small folder structure (/temp/duplicati/upload) and pointed the temp and asynchronous-upload folders to respectively/usbshare1/temp/duplicati/ and /usbshare1/temp/duplicati/upload/ (on Windows use something like X:\temp\duplicati and X:\temp\duplicati\upload).

Snowman8373 · February 2, 2024, 7:38am

Thanks, maybe I will change that someday.

@kees-z Thank you for your detailed insight! But now I have some more questions:

Good that you can confirm that, because thats what I also thought yesterday evening after some further thoughts and research.

Why is this even done on local disk and not in RAM? Of course on the local disk there will be much more space available but that can not be the reason. RAM would be way faster by just writing the end result to disk. And for online backups to a cloud the limiting factor is the internet connection anyway so you do not need to process much data in advance to keep up with the internet connection. Or am I missing something?

Of course for data like block and file hashes it make sense to be stored on disk as this will grow with the backup size. (I am referring to this explanation: How the Backup Process Works - Duplicati 2 User's Manual)

Now I will think about mounting a folder/virtual disk in RAM (I am on windows but I am sure one can set up something comparable to tmpfs on linux). Because using a USB stick is probably a big bottle neck for backups to a external HDD).

I am not deep into the topic, so what do you think about that? I am just curious and maybe there is also a good reason for all this.

gpatel-fr · February 2, 2024, 9:51am

That’s a good question. IIRC there is a pending PR for that, but it’s a complicated code and there is some concern of breaking stuff. On a more immediate mode, this is mainly a concern for the initial backup of a huge amount of data, that tends to be done on rather powerful systems that can afford a Ram disk, and it’s quite possible for this initial backup (or even all backups) to redirect temp files to this device. It is making the backup faster indeed.

Unless I am (recently) mistaken, you will need a server operating system. On client systems, you have only paid addons, freeware (very limited), and open source addons (not quite as reliable I think as tmpfs or the built in solution in Windows server OS)

kees-z · February 2, 2024, 10:10am

ImDisk Toolkit could do this on Windows. You can even mount a Ramdisk before a backup starts and unmount it after completion using --run-script-before and --run-script-after.

Not sure how a USB stick would interfere with making backups to an external HDD. As long as drive letters don’t change, you can point TEMP and UPLOAD files to the USB stick and upload your backup files to the external HDD. USB3 has enough available bandwidth to send both through the same bus.

gpatel-fr · February 2, 2024, 10:15am

You say ‘could’. Not ‘can’. Is it meaningful, that is, do you actually use this tool on a regular basis ?

Snowman8373 · February 2, 2024, 10:17am

Nice to know, but I wonder why they went with the disk approach in the beginning, because from a programming perspective IMHO thats just more work.

Thanks for all your answers! I will try it with this tool and hopefully this will be changed in the future.

kees-z · February 2, 2024, 10:30am

Never used it myself, but read about successful use here.

I’m not a developer, but I guess it’s because Duplicati relies on external tools for compression and encryption, like AESCrypt.

Snowman8373 · February 2, 2024, 10:44am

Have just tried it, works fine at first glance.

Hmm yeah probably thats the reason.

Do you remember the title of it? I did not find anything, but if it exists it would be nice to have a look.

gpatel-fr · February 2, 2024, 11:08am

Yes, actually it does not exist, it was referred to by the author, but it has not been sent as a PR:

You’ll see that it involves change to more than 30 files.

kees-z · February 2, 2024, 11:15am

FYI: This has also been discussed years ago here.