Still Stupidly Slow Backups?

StevenJohns · August 1, 2019, 8:21pm

Hi All,

I’ve come back to Duplicati after about a year away because I couldn’t get it to backup VHD files without copying the whole file every time (completely negating why we would use Duplicati… i.e. deduplication) and the backups of normal files of over 300GB were taking hours.

I’ve installed 2.0.4.22 on a test system and can see the exact same results - has anyone looked at getting rid of the slow SQLite database as yet as that was the consensus as to why it was taking forever to backup ?

I don’t want to slag off the product as I think it has some great features, but it seems that the two fundamental requirements for a backup program would be speed and deduplication so if neither of these have been fixed as yet then I guess I’ll take another look back here in a year or two…

Cheers

BlueBlock · August 2, 2019, 2:30am

Good feedback. I ran some quick tests with two .iso files identical except for one byte. Depu wasn’t effective as far as i could tell. But identical files were dedup’d.

So I’m thinking it it happening at the file level and any lower.

It is something to look at improving.

Interesting about sqlite… I’ll run some tests on it as well.

drwtsn32 · August 2, 2019, 4:00am

What is your --blocksize set to? (Default is 100KB.)
And how large were these ISOs you tested with?

In my experience the deduplication works well. Changing 1 byte in a secondary copy of a large file should only result in 1 extra block on the back end. Yes, it may take Duplicati some time to process that file initially, but the back end storage should only grow by a minuscule amount.

StevenJohns · August 2, 2019, 10:27am

My understanding of Duplicati’s DeDupe feature is that they have NOT implemented variable chunk sizes and therefore if the one byte was put at the end of the ISO file then the result would be as you say, only one additional block saves at the back end, however if the one bute was inserted (not changed, but added) at the beginning of the file then the block boundaries would all be out of sync and the whole file would end up on the back end again. so it doesn’t loo like they have actually implemented dedupe very well at all.

Pectojin · August 2, 2019, 10:34am

Correct. Duplicati is strictly doing static file chunking.

It’s not advisable to use it to back up large ISOs or VM disks.

drwtsn32 · August 2, 2019, 11:51am

Great point! Yes that defeats the current dedupe implementation. I do not know if dynamic chunk size is the only way to solve it. A sliding dedupe window is a solution I’ve seen in other products.

BlueBlock · August 2, 2019, 1:08pm

OK. And that’s what I did in my test. I removed the first byte.

StevenJohns · August 2, 2019, 1:16pm

and this will also mean that you shouldn’t use it to backup an Exchange Server (large EDB files) or an SQL server (large database files).

So it looks like this will only be useful for backing up file servers or desktops I guess ?

drwtsn32 · August 2, 2019, 1:23pm

“Shouldn’t” isn’t the right word… You can still back up files that dont deduplicate well - it’ll just take more storage on the back end.

Pectojin · August 2, 2019, 1:35pm

I think one of the tricky things about changing the chunking algorithm is doing it without screwing up the performance of normal file chunking.

drwtsn32 · August 2, 2019, 1:51pm

Yeah I don’t know how variable chunk or sliding window dedupe are actually implemented without totally killing performance… I guess there are some really smart people out there that figured it out!

StevenJohns · August 2, 2019, 3:12pm

technically true, however to comply with our industry regulations (and best practice for any business) we need to send a copy of the local backup to another site. We are fortunate to have 100Mb/s private links to other sites however sending over a 900GB Exchange MDB file every night is just bonkers.

At the moment we use Veeam, which only saves the changed parts of a file, however the license is costly and I hoped that we could get a more configurable open source product, but maybe not.

ts678 · August 2, 2019, 4:33pm

Is this Hyper-V making the VHD? Is it the exact same size every backup, or does it change? There appear to be several ways that one can set up the format of VHDs, however Microsoft warns against two formats:

VHD-format dynamic virtual hard disks are not recommended for virtual machines that run server workloads in a production environment

Avoid using VHD-format differencing virtual hard disks on virtual machines that run server workloads in a production environment

Supported formats then would appear to be trimmed down to “Fixed hard disk image” which doesn’t seem like it would get slid around, That would hurt “disk” a great deal even before file hurts Duplicati a great deal.

The question about “does VHD size change?” is one way to infer format. I don’t “think” Fixed VHD changes.

Are you checking numbers from the backup log of a run when you say “copying the whole file every time”?
Specifically is “BytesUploaded” actually about the same as “SizeOfExaminedFiles”? That would seem odd even if the file “slid” by some offset (though I don’t see why it would, because it’s such a costly thing to do).

Does the whole-file copy happen even if the VHD is used only briefly, or does that require an extensive run? Even an extensive run that changes everything seems like there ought to be “some” deduplication possible.

How full is the VHD from the guest OS point of view? If not full, deduplication can be encouraged by zeroing free space, for example running sdelete (carefully…) with the -z option “(good for virtual disk optimization)”.

StevenJohns · August 3, 2019, 6:26pm

I only ever use fixed VHD’s (VHDX actually) - if I fire up the virtual server, then shut it down immediately, then use Duplicati from the host to backup the VHDX, it will copy the whole VHDX file. this has nothing to do with how full the VHDX is, starting and stopping a virtual server should only change a few files, but Duplicati doesn’t understand sliding scale or variable size chunks and so it gets confused and thinks the whole file is different as per the examples above.
This is a fundamental issue with chunking that hasn’t been addressed correctly.
The other way to do this is to move Duplicati to use a block level scheme, but that would take a complete rewrite of the storage subsystem which they are not going to do. Essentially I don’t believe that Duplicati will ever perform correctly.

ts678 · August 3, 2019, 8:57pm

So “copy the whole VHDX file” is backed up by the numbers I was asking for? Could you supply stats?

Still need to wade through VHDX Format Specification v1.00 (spec for Fixed VHD showed it being very easy) BUT if Fixed VHDX is truly fixed size, then wouldn’t lots of zero blocks in image be deduplicated?

Even if one doesn’t go out of one’s way to help deduplication, one should see something unless drive is overwritten hugely with random data (encrypted data might look random enough). Whether or not VHDX slides data around on image is TBD. That’s where variable chunking will help (but only for small slides).

Obviously your results are your results, but I want to see this studied deeply, and see if any aids exist…

StevenJohns · August 3, 2019, 10:49pm

No, I am not looking at Bytes Uploaded etc as I can already see whats happening.

Using Veeam from the host, the incremental backup takes approx 5 minutes, whilst the Duplicati incremental backup runs for hours.

Backup programs should be install, configure, set and forget - if they aren’t that then they aren’t ready for production. It’s a shame as I can see a lot of work has gone into this product, but they don’t seem to have got the fundamentals of a backup application sorted so I can’t waste any more of my time, I may try again in yet another year, but I doubt it.

ts678 · August 4, 2019, 1:09am

Your choice. I wonder how you see bytes uploaded. Slow is clearly not desired, but diagnosis takes time. Because now seems a bad time for you, I’ll have to resort to web research and conjecture about Veeam.

Veeam must have interesting methods (or your hardware is fast). Reading 300 GB in 5 minutes is FAST.
Possibly they understand VHDX and NTFS file formats, find changed files, and scan. Duplicati has to go completely through the VHDX at a byte level. Even a backup with variable chunking would need to do so.

Sometimes people debate whether VM backups should be done at the host or guest. For simple backup programs such as one would run on a desktop, life in the guest is good. File timestamps to read, if using those to find changed files, or NTFS USN change journal which Duplicati supports are faster than scans.

OK, maybe this is the “secret sauce” in Veeam. It’s not that Duplicati doesn’t get the fundamentals right, (theoretically at least – there are definitely still beta bugs) but that Veeam goes far beyond fundamentals.

How Backup Works

During incremental job sessions, the source Veeam Data Mover uses CBT to retrieve only those data blocks that have changed since the previous job session. If CBT is not available, the source Veeam Data Mover interacts with the target Veeam Data Mover on the backup repository to obtain backup metadata, and uses this metadata to detect blocks that have changed since the previous job session.

Changed Block Tracking

The CBT mechanism is implemented as a file system filter driver — Veeam CBT driver. The driver is installed on every Microsoft Hyper-V host added to the backup infrastructure.

And if I got that wrong, there’s more. Duplicati has no VM guest agents, and no change tracking installed. Reading through 300 GB at SATA-3 speed could not be done in 5 minutes even ignoring data processing.

If you want speed, Duplicati in the guest might do better. Run in Administrator group with –usn-policy use which should avoid scanning timestamps, and hopefully will find small individual files to read for changes.

Or just be happy that you have a premium product that goes the extra mile to work well for your situation. Duplicati, as a general-purpose backup product (at the moment) can’t do that, and probably cannot soon.

EDIT:

USN_RECORD_V2 structure looks to me like a USN Change Journal doesn’t keep detailed CBT records, meaning Duplicati can’t extend its current support into something that would precisely track the changes.

UrBackup seems to have a reasonably priced commercial add-on that might help with the VHDX backup:

Change block tracking for UrBackup Client on Windows 2.x

Without change block tracking all data has to be read and inspected in order to find and transfer the differences during an incremental image or file backup. This can take hours compared to the same taking minutes with change block information.

It looks, though, like the “client” means the system being backed up, i.e. backup software resides in guest, as opposed to Veeam where host software coordinates with additional guest software to get backup done.

Possibly you’ll find something else. There are certainly other commercial programs aimed at VM backups.

Good luck!

BlueBlock · August 4, 2019, 5:18am

Veeam does have a leg-up as there is a client-side install required. While it is a great use-case, this just doesn’t seem to be a use-case for coming out of beta.