Duplicati 2 vs. Duplicacy 2

dgcom · September 18, 2017, 7:15pm

Hi!

Yes, there should be no huge difference in performance if designs are close enough. And again - let me say that this testing is not fully conclusive yet - I need to run similar tests on larger data source.

This level of compression impact is strange - the was plenty of CPU headroom. Testing is done on i5-3570 3.40GHz and I haven’t seen single core pegged and compression should utilize hyper-threading pretty well.
For backups, --zip-compression-level=1 is not bad actually if performance is a concern. And if backing up already compressed files, it may be actually recommended.

The behavior of --synchronous-upload="true" was a surprise for me as well and I remember looking at disk I/O chart and seeing better utilization when it was set to true. I am pretty sure I haven’t screwed up testing, but will re-test again.

I have some ideas on what profiling I can do on this, but first I want to try similar tests on a much bigger set of very large files - 15Gb with Windows Server 2016 iso, some .Net app memory dump Virtual Box vmdk file with CentOs and couple of rar files… Largest file is 5.5Gb, most files are non-compressible.

Will update this thread with the results.

kenkendk · September 18, 2017, 8:50pm

That could explain it perhaps. If the disk is reading two different files (source and compressed file), maybe that reduces the throughput. But it should not matter as much as your results indicate.

I look very much forward to the results of this!

tophee · September 18, 2017, 9:10pm

You’re doing such a great job so I don’t want to ask too much, but if you don’t mind, could you post your updated results directly in the actual post? It would make it much easier for people to find and read those results and they would be preserved as long as this forum exists.

It can be done via copy and paste: Here's a very easy way of creating tables on this forum

JonMikelV · September 18, 2017, 9:28pm

What? Then why am I wasting my time hitting spaces all over the place to get my posts to line up???

Let’s see…

DT	DC	Feature
x	*	Free (*DC command line free for individual users, else per user or computer fee)
x	x	Client side software
		Server side software
…	…	…

Oh, yeahhh…much better…

dgcom · September 19, 2017, 12:48am

@kenkendk - I will try to point temporary folder to another local drive and see if it changes the behavior… I also have to point out, that source and temp are located on relatively slow disk… This creates a lot of variables and still does not explain why CY is at least twice as fast. We should definitely try and find the bottleneck.

@tophee - I prefer working with real spreadsheet, of course - it allows much better formatting, calculations, notes, etc…
However, I will see if I can post brief summaries here and re-link to the Google Spreadsheet for detailed data.
I am not always record data systematically - quick tests may benefit from embedded tables.

tophee · September 19, 2017, 3:53am

@kenkendk, maybe that could be the default compression level if the majority of files in a block are compressed files such as zip or jpeg or docx?

JonMikelV · September 19, 2017, 4:05am

Are you proposing a per-dblock (archive) compression setting based on the actual archive contents?

tophee · September 19, 2017, 4:15am

Basically, I don’t know what I’m talking about

I was going to propose the lower compression level for certain file types but then realised that duplicati is not zipping individual files and so I came up with this majority thing

kenkendk · September 19, 2017, 8:46am

All compressed files are not re-compressed. There is a file (default_compressed_extensions.txt) which contains all extensions that are known to be un-compressable:

github.com

duplicati/duplicati/blob/master/Duplicati/Library/Main/default_compressed_extensions.txt

# List of compressed archives
.7z #7-Zip Compressed File
.alz #ALZip Archive
.bz #Bzip Compressed File
.bz2 #Bzip2 Compressed File
.cab #Microsoft Cabinet File / InstallShield Cabinet File
.cbr #Comic Book RAR Archive
.cbz #Comic Book Zip Archive
.dar #Disk ARchive Compressed File
.deb #Debian Software Package
.dl_ #Windows Dynamic Link Library, packed
.dsft #Microsoft Application Virtualization package
.ex_ #Windows Executable, packed
.gz #Gnu Zipped Archive
.jar #Java Archive File
.lzma #Lzma Archive File
.mpkg #Meta Package File
.msi #Microsoft Windows Installer Package
.msp #Microsoft Windows Installer Patch
.msu #Microsoft Update Standalone Package

This file has been truncated. show original

Content from files in that list are sent to the zip archive without compression, so that should not matter.

tophee · September 19, 2017, 9:04am

So the performance difference observed with --zip-compression-level=1 are based only on (attempted) compression of file types not on that list?

kenkendk · September 19, 2017, 9:28am

Yes. I even made a unittest to ensure that the compression excludes non-compressible files.

@dgcom can you confirm that the file extensions are not on the list?

dgcom · September 19, 2017, 1:41pm

The files in my current data set are not on the list of compressed files, moreover, they compress pretty well - approx 50%.

dgcom · September 20, 2017, 5:22am

Apologies for the delay with testing, but I wanted to make sure that I provide useful data here.

Since we all agree that there is a bottleneck of some sorts, I decided that current data set should be quite fine to help identify it. So, I automated the test a bit more (after all, it takes a lot of time!).
Initially, I thought to use Perfmon, but this would take more time (although with nicer charts), but I quickly realized that just screenshots from Sysinternals Process Explorer should be enough here to pinpoint an area needing improvement.

Writing entire report in the forum post is also not very feasible, so I used Google Presentation this time.
But to allow better searching and forum experience, I’ll add some info here as well.

I run each test at least 3 times (actually more than that) and captured timing and parameters (to be sure) every time.
Yes, using --synchronous-upload="true" does indeed pause execution and takes more time - I believe I swapped false/true value in my previous testing.
I also checked what would happen if I locate temp folder on another drive (BTW, Windows version uses TMP environment variable and ignores TEMP/TMPDIR). And, obviously, I tested lowest compression, no compression at all and no encryption in addition to no compression.
The picture from the table below is relatively clear - yes, TI can reach CY performance… if you disable compression… and possibly encryption - all probably depend on your CPU.
But if you browse through performance graphs I include in the linked document, you will realize that actual bottleneck is in the less efficient use of CPU, including calculating hash…
I briefly looked at the results of google:".net sha256 efficiency" search and found some interesting read there.

Anyway, I still want to see how large files behave, but this preliminary result here would help me to restrict and better manage larger data set testing…

	Time		Parameters
Run 1	Run 2	Run 3	*Duplicati*
0:09:07	0:08:48	0:08:51	`--synchronous-upload="true"`
0:07:11	0:07:47	0:07:22	`--synchronous-upload="false"`
0:06:20	0:06:36	0:06:21	`--synchronous-upload="false"`
			`Set TMP=D:\Temp` (separate drive)
0:04:27	0:04:13	0:04:12	`--synchronous-upload="false"--zip-compression-level=1`
			`Set TMP=D:\Temp`
0:03:31	0:03:35	0:03:37	`--synchronous-upload="false"--zip-compression-level=0`
			`Set TMP=D:\Temp`
0:03:01	0:03:09	0:03:10	`--synchronous-upload="false"--zip-compression-level=0 --no-encryption="true"`
			`Set TMP=D:\Temp`
			*Duplicacy*
0:03:09	0:03:09	0:03:10	`-threads 1`

tophee · September 20, 2017, 5:34am

Sorry if this sounds obvious to some, but do I read these results correctly when I say: It’s the compression, stupid? (Although duplicacy still has an edge unless you turn off encryption too… )

kenkendk · September 20, 2017, 6:56am

That is really great work! And I am well aware that such setups take a long time to get up, and report on.

That is some comfort to me That means we need to tweak things to get there, and re-consider the default settings.

There is an issue here that says that the default hashing implementation is inefficient, and that we should use another:

github.com/duplicati/duplicati

Don't use SHA256Managed (use the native one)

opened 12:51PM - 02 Aug 17 UTC

closed 08:28AM - 20 Sep 17 UTC

GSPP

enhancement performance issue

When profiling Duplicati I sometimes see 100% CPU usage and at that time I see a… lot of time spent in SHA256Managed. The .NET Framework also includes a native SHA256 implementation. From earlier experimenting I know that the native version becomes much faster when the data blocks are over a certain size. 4KB would certainly be over the cut-off point. I remember that the threshold was a few hundred bytes with SHA1. So I suggest simply switching to the native algorithm to improve speed. I haven't quite understood the source code yet but from profiling I get the impression that the hash function is invoked with blocks that are not very big. I might be wrong about that. Especially when using the native version it pays off to pass bigger chunks of data per call to TransformBlock. That's why I wanted to mention that. Tested with latest 2.0 version as of today.

It should be trivial to switch over:

It does nothing on Mono, but no ill effects:

github.com

mono/mono/blob/main/mcs/class/System.Core/System.Security.Cryptography/SHA256Cng.cs

//
// NOTE: DO NOT EDIT - This file was automatically generated using
//	/mcs/class/System.Core/tools/hashwrap.cs
//
// System.Security.Cryptography.SHA256Cng
//
// Authors:
//	Sebastien Pouliot  <sebastien@ximian.com>
//
// Copyright (C) 2008 Novell, Inc (http://www.novell.com)
//
// Permission is hereby granted, free of charge, to any person obtaining
// a copy of this software and associated documentation files (the
// 'Software'), to deal in the Software without restriction, including
// without limitation the rights to use, copy, modify, merge, publish,
// distribute, sublicense, and/or sell copies of the Software, and to
// permit persons to whom the Software is furnished to do so, subject to
// the following conditions:
// 
// The above copyright notice and this permission notice shall be

This file has been truncated. show original

Yes, it appears that compression is some of the problem. From what I read, CY uses LZ4, which is considered “fast”:

I can add LZ4 support to Zip:

But unfortunately, writing LZ4 streams inside a zip archive will make it incompatible with most Zip tools.

I will make a versions that uses SHA256Cng, to see if that takes the edge of the CPU usage.

I also have the concurrent_processing branch, which I have been working on for improving error handling. With that branch, it should be possible to stream data as a fast as the disk can provide it, without any of the “pauses” you see.

We can also consider storing the zip file in memory, instead of writing it to disk (--write-volumes-to-disk ? ).

kenkendk · September 20, 2017, 11:17am

I have put up a new version (2.0.2.8) which now uses SHA256Cng as the hashing implementation: Releases · duplicati/duplicati · GitHub

drakar2007 · September 20, 2017, 1:27pm

Never mind, I figured it out - I had the terminology backwards in my head

original stupid question for posterity

I’m confused - is synchronous-upload disabled by default (i.e. if not using the advanced option setting at all)? Is there any reason?

dgcom · September 20, 2017, 10:04pm

Thank you, glad I can help here.

Oh, I see, glad it was discovered already. What is interesting, I see that .Net Core 2.0 have better managed implementation as well - wonder if this can be backported…
Performance Improvements in .NET Core

I got your latest build and tested it - results are below. It clearly shows very noticeable performance improvement, however there is still room to grow
Implementing multithreaded processing and upload/download should help to offload those tasks from cores which are busy with hashing, compression and encryption - overall, backup programs should be I/O bound on sufficiently fast current CPU. But then you need to take care of keeping blocks in memory instead of trashing disk - but that is more complex because memory can grow significantly depending on parameters used.

2.0.2.6	2.0.2.8	Improve	Parameters
0:08:55	0:08:19	6.66%	–synchronous-upload=“true”
0:07:27	0:07:03	5.39%	–synchronous-upload=“false”
0:06:26	0:06:02	5.98%	–synchronous-upload=“false”
			Set TMP=D:\Temp
0:04:17	0:03:47	11.78%	–synchronous-upload=“false” --zip-compression-level=1
			Set TMP=D:\Temp
0:03:34	0:02:50	20.87%	–synchronous-upload=“false” --zip-compression-level=0
			Set TMP=D:\Temp
0:03:07	0:02:42	13.39%	–synchronous-upload=“false” --zip-compression-level=0 --no-encryption=“true”
			Set TMP=D:\Temp
CPU was still pegged, but seems to be used more efficiently.

Looking forward to see all three major math points (hash, compression and encryption) optimized
For now, I’ll use --zip-compression-level=1 for test on larger source set.

kenkendk · September 21, 2017, 8:57am

The optimization essentially fixes the problem that SHA256.Create() returns the slow implementation in .Net standard profile (which Duplicati uses). The fix I made was to change calls to HashAlgorithm.Create("SHA256") into calling my own method that returns the faster implementation.

The change has impact on Windows only, but I can see that there is supposed to be a faster OpenSSL based version for Linux, which I might be able to load also.

dgcom · September 22, 2017, 4:41am

Thank you for clarification, @kenkendk
My tests against 15Gb source are finally running, hope to have results tomorrow.

In a mean time - since sha256 is improved, we the other two - encryption and compression may need a look.
Although disabling encryption when no compression is used does not show a lot of improvement anyway…
But what about zip? Have you thought about different implementations? Since it looks like compression is designed as module, it should be possible to add different implementation without removing the original. From my quick research it looks like DotNetZip beats SharpZipLib and ZipArchive can be even faster… Moreover, DotNetZip can do ParallelDeflateOutputStream…

I am asking, because I put together small table comparing the resulting backup/compression sizes and even that you mention CY uses “fast” compression, it is on the same level as TI default and uses much less CPU:

Test 1 size	Bytes
InfoZip default	998,508,177
Default compression	1,005,213,845
Duplicacy	1,005,435,292
–zip-compression-level=1	1,054,828,415
–zip-compression-level=0	2,008,793,923
Source	2,026,607,926