Duplicati 2 vs. Duplicacy 2

I have put up a new version (2.0.2.8) which now uses SHA256Cng as the hashing implementation: Releases · duplicati/duplicati · GitHub

1 Like

Never mind, I figured it out - I had the terminology backwards in my head :frowning_face:

original stupid question for posterity

I’m confused - is synchronous-upload disabled by default (i.e. if not using the advanced option setting at all)? Is there any reason?

Thank you, glad I can help here.

Oh, I see, glad it was discovered already. What is interesting, I see that .Net Core 2.0 have better managed implementation as well - wonder if this can be backported…
Performance Improvements in .NET Core

I got your latest build and tested it - results are below. It clearly shows very noticeable performance improvement, however there is still room to grow :slight_smile:
Implementing multithreaded processing and upload/download should help to offload those tasks from cores which are busy with hashing, compression and encryption - overall, backup programs should be I/O bound on sufficiently fast current CPU. But then you need to take care of keeping blocks in memory instead of trashing disk - but that is more complex because memory can grow significantly depending on parameters used.

2.0.2.6 2.0.2.8 Improve Parameters
0:08:55 0:08:19 6.66% –synchronous-upload=“true”
0:07:27 0:07:03 5.39% –synchronous-upload=“false”
0:06:26 0:06:02 5.98% –synchronous-upload=“false”
Set TMP=D:\Temp
0:04:17 0:03:47 11.78% –synchronous-upload=“false” --zip-compression-level=1
Set TMP=D:\Temp
0:03:34 0:02:50 20.87% –synchronous-upload=“false” --zip-compression-level=0
Set TMP=D:\Temp
0:03:07 0:02:42 13.39% –synchronous-upload=“false” --zip-compression-level=0 --no-encryption=“true”
Set TMP=D:\Temp
CPU was still pegged, but seems to be used more efficiently.

Looking forward to see all three major math points (hash, compression and encryption) optimized :slight_smile:
For now, I’ll use --zip-compression-level=1 for test on larger source set.

The optimization essentially fixes the problem that SHA256.Create() returns the slow implementation in .Net standard profile (which Duplicati uses). The fix I made was to change calls to HashAlgorithm.Create("SHA256") into calling my own method that returns the faster implementation.

The change has impact on Windows only, but I can see that there is supposed to be a faster OpenSSL based version for Linux, which I might be able to load also.

Thank you for clarification, @kenkendk
My tests against 15Gb source are finally running, hope to have results tomorrow.

In a mean time - since sha256 is improved, we the other two - encryption and compression may need a look.
Although disabling encryption when no compression is used does not show a lot of improvement anyway…
But what about zip? Have you thought about different implementations? Since it looks like compression is designed as module, it should be possible to add different implementation without removing the original. From my quick research it looks like DotNetZip beats SharpZipLib and ZipArchive can be even faster… Moreover, DotNetZip can do ParallelDeflateOutputStream…

I am asking, because I put together small table comparing the resulting backup/compression sizes and even that you mention CY uses “fast” compression, it is on the same level as TI default and uses much less CPU:

Test 1 size Bytes
InfoZip default 998,508,177
Default compression 1,005,213,845
Duplicacy 1,005,435,292
–zip-compression-level=1 1,054,828,415
–zip-compression-level=0 2,008,793,923
Source 2,026,607,926

I finally got the results together for the second test - larger set of big files.

Source included 5 files of total size a bit over 15Gb. Some files were compressible, so that default zip compression would end up with approx. 10Gb archive.

So far I rested backup timing only. I plan to check restore speed as well (when I have time).
Results are not bad, but still a bit disappointing. Full details also available in the Google spreadsheet I shared earlier, but below you can find all the numbers as well.

Backup Time Destination Parameters
Run 1 Run 2 Run 3 Size Files Folders Duplicati 2.0.2.6
1:12:40 1:17:38 1:11:14 10,968,733,985 421 1 Deafault compression
0:44:34 0:41:00 0:41:38 11,153,579,823 427 1 –zip-compression-level=1
0:56:30 0:51:12 0:49:18 15,801,175,593 605 1 –zip-compression-level=0
0:31:01 0:30:44 0:29:58 15,800,984,736 605 1 –zip-compression-level=0
–no-encryption=“true”
Duplicacy 2.0.9
0:27:27 0:27:23 0:27:24 10,931,372,608 2732 2931 -threads 1

Details on the source:

Size File name
5,653,628,928 en_windows_server_2016_x64_dvd_9327751.iso
1,572,864,000 DVD5.part1.rar
1,572,864,000 DVD9.part1.rar
5,082,054,656 disk1.vmdk
1,930,806,927 dotnetapp.exe.1788.dmp

Compressability

Size Compression
10,788,967,173 InfoZip default
10,968,733,985 Default compression
10,931,372,608 Duplicacy
11,153,579,823 –zip-compression-level=1
15,801,175,593 –zip-compression-level=0
15,812,218,511 Source

The results are not too bad, but still a bit disappointing - CY is able to maintain same or better compression/deduplication as the TI defaults but showing noticeable performance difference.
Although TI can get closer to those times in expense of space efficiency,
I will also have to note, that CY can perform even better with multithreaded upload.

2 Likes

Yes, compression is a module, so it is fairly simple to just add in another compression library.

But, unlike for CY, TI needs to store multiple files, so it needs a container/archive format in addition to a compression algorithm.

The zip archive format is quite nice, in that you can compress the individual files with different algorithms. Not sure SharpCompress supports injecting a custom stream at this point, but I can probably mingle in something. If we do this, it is trivial to compress blocks in parallel and only sync them for writing to the archive.

The reason for CY being faster is because it uses LZ4, which is not a container but a compression algorithm. As I wrote earlier, writing LZ4 streams into a zip archive is possible, but it will not be possible to open such an archive with normal zip tools.

There is some limitation to what TI can do with this, as it uploads (larger) volumes. But it is possible to do chunked uploads (to B2, S3 and others), and here chunks can be uploaded in parallel.

I have looked at the design, and I think I can adapt the concurrent_processing branch to support parallel uploads of volumes. It is tricky to get right, because a random upload can fail, and the algorithm needs to work correctly no matter which uploads fail. But with the new design, it is simpler to track the state provided by each volume, and thus also to roll back such changes.

I took a stab at this, and wrote a new library for it:

The library picks the best hashing implementation on the platform (CNG on Windows, AppleCC on MacOS, OpenSSL on Linux) and provides that instead of the default managed version. Performance measurements show 10x improvement on MacOS and 6x improvement on Linux.

Unfortunately, there is something that triggers an error when I use this in Duplicati:

I can reproduce the error on my machine, but I have not yet figured out what causes it.

1 Like

you are given good information about Backup source support comparison…
thanks a lot…

Regard
AMAAN

1 Like

Hint: If you refer to some other post on this forum, it would be great if you could provide a link to that post.

The link will automatically be visible from both sides (i.e. there will also be a link back from the post you’re linking to). Those links will make it much easier for people to navigate the forum and find relevant information.

So far in this discussion, the focus has been on speed and it looks like we can expect duplicati to eventually catch up with duplicacy in single-threaded uploads and perhaps even in multi-threaded uploads. Good.

Another conclusion I draw from the above discussion is that, compared to duplicacy, duplicati saves a lot of remote storage space (which usually costs money). Also good.

But there is another killer feature on the duplicacy side: cross-source deduplication. Duplicacy can use the same backup archive for backup jobs on different machines. Depending on the scenario, this may well make up for duplicati’s more efficient storage use.

I seem to remember someone (@JonMikelV?) saying that this could become possible in duplicati too. Is that correct and if so, is it on the road map?

I think databases use is the weak point of Duplicati. It’s a very “sophisticated” approach, and though elegant, it creates several points of failure.

When everything is ok, there is the problem of the occupied space (several hundred megabytes).

When a problem occurs (database corrupted, eg), it is a nightmare. Several hours (sometimes days) to rebuild the database from the backup. Exactly the time you don’t have when needs a restore.

Ok, just backup the databases after each backup job with another tool, but it’s strange to back up the backup databases (?!).:roll_eyes:

2 Likes

If it was me that was probably in my early days so I may not have correctly understood how stuff worked. :blush:

While in theory cross source deduplication could happen, it would require a huge rewrite so multiple backups could share the same destination.

For that to happen a destination log / database of some sort to handle block tracking across the multiple sources would need to be created.

For example, if sources A and B both have the same file (or at least a matching block) and it’s debate deleted from source A something has to stop source A backup from deleting the still-in-use-at-source-B block.

Similarly, if you set up two jobs sharing the same destination but with different retention schedules something needs to keep stuff from being deleted h until it’s flagged as deletable in all backup jobs.

1 Like

Okay, I was indeed wondering how it would be possible, but I guess I was hoping to much to see clearly…

There is actually a pretty good description of the challenges in cross-source deduplication and how duplicacy solves it here. Quite ingenious, actually.

This is starting to dawn on me too. I have had so many issues with duplicati and basically all of them seem to be related to the database (though I can’t really tell whether the “missing 1 file in backend, please run repair” error is a db problem but since it tells me to repair the database, I guess it is.

Well, I think that is a viable solution (or it could be one, if duplicati took care of that by itself). But I agree, it’s definitely not a plus for duplicati.

So to boil this comparison down even further, I’d say that, at the moment, there are really only three (or maybe four) things that speak for duplicati compared to duplicacy:

  1. It’s entirely open source and free
  2. It uses less storage space in the backend
  3. It supports a much greater number backends (though if duplicacy supports yours, that is, of course, irrelevant)

Depending on taste, you could add the UI as a fourth advantage. Personally, I really like the Web UI and the way it makes command line options available without forcing you to use the CLI.

On the minus side, duplicati is (currently)

  1. slower,
  2. still in beta (and rightly so),
  3. unable to do cross-source deduplication (which, in my use case, is pretty nifty to have as it allows you to move large files or folders all over the place locally, without having to worry about that making your backend storage explode).

Edit: I just discovered that duplicacy allows you to change the password for your archive. I’m inclined to add this as another plus for duplicacy.

Edit2: I just discovered the very (!) basic restore functionality of duplicacy: it is very cumbersome to search for and restore a file whose exact location you don’t know. You can only do it via CLI (as described here) and even then, you have to run a separate command to then restore the identified file. Edit: Sorry, it’s not as bad. Search also works in the GUI version, but you can’t search across revisions. So the difference to duplicati is not as big, but duplicati still seems better here.

1 Like

Very well summarized.

Just a small note:

The CLI version of Duplicacy is completely free and opensource as well.

I personally prefer to use softwares in the CLI version. Things get more “under control” and “in the hand.” And after the initial setup, they are “setup and forget” type.

Rclone is another good example of CLI software (and I use it for some types of files) (including for the full backup of the Duplicati databases :wink: :roll_eyes:)

Yes, of course! Hence:

Looks like this is not necessarily the case:

Here are informal results. By informal I mean that I would kill the ongoing backup, then resume at the same spot with a different thread count, wait until it stabilized on a long stretch where there were minimal duplicates, and then recorded the MB/s. It is not rigorous, but at least an indicator.

-threads 1 ~12MB/s - Not great, but at least it will finish the initial backup in about a day or two.
-threads 2 ~1.93MB/s - very slow, it was slated to take well over a week
-threads 64 ~1.1MB/s - extremely slow, earlier on before it stabilized, it was running at only 100KB/s
-threads 16 ~619KB/s - seems even slower, but I did not let this one run very long, so it may not have stabilized at full speed
-threads 32 ~2.0MB/s - This was a longer test than for 16 threads, and converged to a somewhat higher rate, though not significantly different than 2 threads
The bottom line is that only one clear winner emerged: 1 single backup thread.

In thinking about it, this may make more sense than it seems for a spinning hard drive. That’s because writing each chunk requires a head seek, and if there are multiple threads, that’s a LOT more seeks going on. In fact, the way I pinpointed what was going on is running a separate speed test on the drive while the multi-threaded processes were running. The array normally writes at 200MB/s, but with more than 4 backup threads running, it slowed down to less than 9MB/s.

With only 1 thread running, it is still writing at over 130MB/s for separate processes.

Based upon this, I am guessing that an SSD would actually benefit far more from multiple threads, because there is no seek involved.

From: Duplicacy Issue: optimization of speed (was: one thread for local HD backup)

2 Likes

Nice analytics there!

Were you using a temp folder on the same drive as the source data?

I wonder how much speed difference there would be using a 2nd drive or a ramdisk as the temp location…

I have noticed this also. Theoretically, it should be much faster to run queries agains a database, and it is also more crash resistant.

In Duplicati, there are quite a few problems related to the database.

However, many of the issues will not be fixed by simply removing the database.

In some cases there are failures because the remote store reports different contents (for whatever reason). Removing the database here will just make the problem invisible until you really need the files.

Another problem with the database is consistency issues. Again, removing the database will not fix these, just make them show up later on where they are more likely to prevent correct restores. The errors that produce an inconsistent database should really be found and fixed asap.

Then there is the problem with recreating the local database. This is primarily a speed issue (and there are some recreate errors as well). This could be solved by not using a database, or at least using a much simpler database structure. But it could perhaps also be fixed by storing data in the dindex files that allow a much faster database rebuild. I am more in favor of the latter option here, as the first one is already handled by the RecoveryTool.exe.

Without implying that there are problems with CY, simply not having a local database means there are a number of checks that CY cannot do. When TI does these checks and they fail, it looks like TI is more fragile than CY.

I really like the simplicity in CY, it contains way less code that needs to be maintained, but I think TI will benefit from the local database in the long run.

The size of the database can also be reduced in the future (path compression will do a great size reduction). But note that CY also stores a large number of files on disk to make a cache of remote blocks.

Actually, it would not require a great deal, if we simply allow the same block to exist twice (well, that is actually already allowed, but not in a useful way).

However, it makes it much harder to detect problems, if we suddenly have two sources that update the remote store. In CY this is less of a problem as they can “see” all the blocks that exist, but because TI hides this inside the zip archives, it needs to download and unpack some files to get the same information.

But, it is not impossible that we can use some of the same strategy as CY uses for concurrently reclaiming old data. It is not on my short-list of things to do :slight_smile:

2 Likes

Hello Kenneth, I agree with the points you mentioned.

In summary, the weaknesses related to using databases are two:

  • occupied space;
  • time for rebuilding (when a problem occurs).

I agree that both aspects can be improved in future releases.

For other points you mentioned, I understand that they reflect a difference of philosophy between CY and TI: TI is much more “user-friendly”, and CY is more “on hand”. Example: TI checks backups automatically and randomly. In CY you have to do it by hand using the “check” command. But it works very well:

I include the above routine in my backup batches.

Additionally I created a routine to download random files and compare them with local files.

That is, things that TI does automatically I put in the backup batches.

Good point. But I have had duplicati running on my machine for months now, and whenever I look at it, I see something like this:

image

It’s complaints about missing files is just never ending. In the beginning I did run repair, but since the errors kept coming back and since sometimes they seems to go away without using repair, I’ve stopped bothering.

In any case, I’d say it’s rather unlikely that all those missing files are missing because of some backend problem that duplicati thankfully identified while duplicacy would have missed them. I my eyes, duplicati either identifies problems that are none or it identifies problems that it created itself during upload.

BTW, I found another big minus for duplicacy, which I added to my list above: search and restore is not so great in duplicacy.