Duplicati 2 vs. Duplicacy 2

dgcom · September 22, 2017, 5:38pm

I finally got the results together for the second test - larger set of big files.

Source included 5 files of total size a bit over 15Gb. Some files were compressible, so that default zip compression would end up with approx. 10Gb archive.

So far I rested backup timing only. I plan to check restore speed as well (when I have time).
Results are not bad, but still a bit disappointing. Full details also available in the Google spreadsheet I shared earlier, but below you can find all the numbers as well.

Backup	Time		Destination			Parameters
Run 1	Run 2	Run 3	Size	Files	Folders	*Duplicati 2.0.2.6*
1:12:40	1:17:38	1:11:14	10,968,733,985	421	1	Deafault compression
0:44:34	0:41:00	0:41:38	11,153,579,823	427	1	–zip-compression-level=1
0:56:30	0:51:12	0:49:18	15,801,175,593	605	1	–zip-compression-level=0
0:31:01	0:30:44	0:29:58	15,800,984,736	605	1	–zip-compression-level=0
						–no-encryption=“true”
						*Duplicacy 2.0.9*
0:27:27	0:27:23	0:27:24	10,931,372,608	2732	2931	-threads 1

Details on the source:

Size	File name
5,653,628,928	en_windows_server_2016_x64_dvd_9327751.iso
1,572,864,000	DVD5.part1.rar
1,572,864,000	DVD9.part1.rar
5,082,054,656	disk1.vmdk
1,930,806,927	dotnetapp.exe.1788.dmp

Compressability

Size	Compression
10,788,967,173	InfoZip default
10,968,733,985	Default compression
10,931,372,608	Duplicacy
11,153,579,823	–zip-compression-level=1
15,801,175,593	–zip-compression-level=0
15,812,218,511	Source

The results are not too bad, but still a bit disappointing - CY is able to maintain same or better compression/deduplication as the TI defaults but showing noticeable performance difference.
Although TI can get closer to those times in expense of space efficiency,
I will also have to note, that CY can perform even better with multithreaded upload.

kenkendk · September 23, 2017, 9:15pm

Yes, compression is a module, so it is fairly simple to just add in another compression library.

But, unlike for CY, TI needs to store multiple files, so it needs a container/archive format in addition to a compression algorithm.

The zip archive format is quite nice, in that you can compress the individual files with different algorithms. Not sure SharpCompress supports injecting a custom stream at this point, but I can probably mingle in something. If we do this, it is trivial to compress blocks in parallel and only sync them for writing to the archive.

The reason for CY being faster is because it uses LZ4, which is not a container but a compression algorithm. As I wrote earlier, writing LZ4 streams into a zip archive is possible, but it will not be possible to open such an archive with normal zip tools.

There is some limitation to what TI can do with this, as it uploads (larger) volumes. But it is possible to do chunked uploads (to B2, S3 and others), and here chunks can be uploaded in parallel.

I have looked at the design, and I think I can adapt the concurrent_processing branch to support parallel uploads of volumes. It is tricky to get right, because a random upload can fail, and the algorithm needs to work correctly no matter which uploads fail. But with the new design, it is simpler to track the state provided by each volume, and thus also to roll back such changes.

I took a stab at this, and wrote a new library for it:

The library picks the best hashing implementation on the platform (CNG on Windows, AppleCC on MacOS, OpenSSL on Linux) and provides that instead of the default managed version. Performance measurements show 10x improvement on MacOS and 6x improvement on Linux.

Unfortunately, there is something that triggers an error when I use this in Duplicati:

I can reproduce the error on my machine, but I have not yet figured out what causes it.

amaan · November 10, 2017, 11:30pm

you are given good information about Backup source support comparison…
thanks a lot…

Regard
AMAAN

tophee · November 11, 2017, 8:31am

Hint: If you refer to some other post on this forum, it would be great if you could provide a link to that post.

The link will automatically be visible from both sides (i.e. there will also be a link back from the post you’re linking to). Those links will make it much easier for people to navigate the forum and find relevant information.

tophee · December 28, 2017, 12:13pm

So far in this discussion, the focus has been on speed and it looks like we can expect duplicati to eventually catch up with duplicacy in single-threaded uploads and perhaps even in multi-threaded uploads. Good.

Another conclusion I draw from the above discussion is that, compared to duplicacy, duplicati saves a lot of remote storage space (which usually costs money). Also good.

But there is another killer feature on the duplicacy side: cross-source deduplication. Duplicacy can use the same backup archive for backup jobs on different machines. Depending on the scenario, this may well make up for duplicati’s more efficient storage use.

I seem to remember someone (@JonMikelV?) saying that this could become possible in duplicati too. Is that correct and if so, is it on the road map?

TowerBR · December 28, 2017, 1:08pm

I think databases use is the weak point of Duplicati. It’s a very “sophisticated” approach, and though elegant, it creates several points of failure.

When everything is ok, there is the problem of the occupied space (several hundred megabytes).

When a problem occurs (database corrupted, eg), it is a nightmare. Several hours (sometimes days) to rebuild the database from the backup. Exactly the time you don’t have when needs a restore.

Ok, just backup the databases after each backup job with another tool, but it’s strange to back up the backup databases (?!).

JonMikelV · December 29, 2017, 4:04am

If it was me that was probably in my early days so I may not have correctly understood how stuff worked.

While in theory cross source deduplication could happen, it would require a huge rewrite so multiple backups could share the same destination.

For that to happen a destination log / database of some sort to handle block tracking across the multiple sources would need to be created.

For example, if sources A and B both have the same file (or at least a matching block) and it’s debate deleted from source A something has to stop source A backup from deleting the still-in-use-at-source-B block.

Similarly, if you set up two jobs sharing the same destination but with different retention schedules something needs to keep stuff from being deleted h until it’s flagged as deletable in all backup jobs.

tophee · December 29, 2017, 7:05pm

Okay, I was indeed wondering how it would be possible, but I guess I was hoping to much to see clearly…

There is actually a pretty good description of the challenges in cross-source deduplication and how duplicacy solves it here. Quite ingenious, actually.

This is starting to dawn on me too. I have had so many issues with duplicati and basically all of them seem to be related to the database (though I can’t really tell whether the “missing 1 file in backend, please run repair” error is a db problem but since it tells me to repair the database, I guess it is.

Well, I think that is a viable solution (or it could be one, if duplicati took care of that by itself). But I agree, it’s definitely not a plus for duplicati.

So to boil this comparison down even further, I’d say that, at the moment, there are really only three (or maybe four) things that speak for duplicati compared to duplicacy:

It’s entirely open source and free
It uses less storage space in the backend
It supports a much greater number backends (though if duplicacy supports yours, that is, of course, irrelevant)

Depending on taste, you could add the UI as a fourth advantage. Personally, I really like the Web UI and the way it makes command line options available without forcing you to use the CLI.

On the minus side, duplicati is (currently)

slower,
still in beta (and rightly so),
unable to do cross-source deduplication (which, in my use case, is pretty nifty to have as it allows you to move large files or folders all over the place locally, without having to worry about that making your backend storage explode).

Edit: I just discovered that duplicacy allows you to change the password for your archive. I’m inclined to add this as another plus for duplicacy.

Edit2: I just discovered the very (!) basic restore functionality of duplicacy: it is very cumbersome to search for and restore a file whose exact location you don’t know. ~~You can only do it via CLI~~ (as described here) and even then, you have to run a separate command to then restore the identified file. Edit: Sorry, it’s not as bad. Search also works in the GUI version, but you can’t search across revisions. So the difference to duplicati is not as big, but duplicati still seems better here.

TowerBR · December 29, 2017, 10:44pm

Very well summarized.

Just a small note:

The CLI version of Duplicacy is completely free and opensource as well.

I personally prefer to use softwares in the CLI version. Things get more “under control” and “in the hand.” And after the initial setup, they are “setup and forget” type.

Rclone is another good example of CLI software (and I use it for some types of files) (including for the full backup of the Duplicati databases )

tophee · December 29, 2017, 10:46pm

Yes, of course! Hence:

tophee · January 2, 2018, 5:11pm

Looks like this is not necessarily the case:

Here are informal results. By informal I mean that I would kill the ongoing backup, then resume at the same spot with a different thread count, wait until it stabilized on a long stretch where there were minimal duplicates, and then recorded the MB/s. It is not rigorous, but at least an indicator.

-threads 1 ~12MB/s - Not great, but at least it will finish the initial backup in about a day or two.
-threads 2 ~1.93MB/s - very slow, it was slated to take well over a week
-threads 64 ~1.1MB/s - extremely slow, earlier on before it stabilized, it was running at only 100KB/s
-threads 16 ~619KB/s - seems even slower, but I did not let this one run very long, so it may not have stabilized at full speed
-threads 32 ~2.0MB/s - This was a longer test than for 16 threads, and converged to a somewhat higher rate, though not significantly different than 2 threads
The bottom line is that only one clear winner emerged: 1 single backup thread.

In thinking about it, this may make more sense than it seems for a spinning hard drive. That’s because writing each chunk requires a head seek, and if there are multiple threads, that’s a LOT more seeks going on. In fact, the way I pinpointed what was going on is running a separate speed test on the drive while the multi-threaded processes were running. The array normally writes at 200MB/s, but with more than 4 backup threads running, it slowed down to less than 9MB/s.

With only 1 thread running, it is still writing at over 130MB/s for separate processes.

Based upon this, I am guessing that an SSD would actually benefit far more from multiple threads, because there is no seek involved.

From: Duplicacy Issue: optimization of speed (was: one thread for local HD backup)

JonMikelV · January 3, 2018, 2:06am

Nice analytics there!

Were you using a temp folder on the same drive as the source data?

I wonder how much speed difference there would be using a 2nd drive or a ramdisk as the temp location…

kenkendk · January 3, 2018, 2:45pm

I have noticed this also. Theoretically, it should be much faster to run queries agains a database, and it is also more crash resistant.

In Duplicati, there are quite a few problems related to the database.

However, many of the issues will not be fixed by simply removing the database.

In some cases there are failures because the remote store reports different contents (for whatever reason). Removing the database here will just make the problem invisible until you really need the files.

Another problem with the database is consistency issues. Again, removing the database will not fix these, just make them show up later on where they are more likely to prevent correct restores. The errors that produce an inconsistent database should really be found and fixed asap.

Then there is the problem with recreating the local database. This is primarily a speed issue (and there are some recreate errors as well). This could be solved by not using a database, or at least using a much simpler database structure. But it could perhaps also be fixed by storing data in the dindex files that allow a much faster database rebuild. I am more in favor of the latter option here, as the first one is already handled by the RecoveryTool.exe.

Without implying that there are problems with CY, simply not having a local database means there are a number of checks that CY cannot do. When TI does these checks and they fail, it looks like TI is more fragile than CY.

I really like the simplicity in CY, it contains way less code that needs to be maintained, but I think TI will benefit from the local database in the long run.

The size of the database can also be reduced in the future (path compression will do a great size reduction). But note that CY also stores a large number of files on disk to make a cache of remote blocks.

Actually, it would not require a great deal, if we simply allow the same block to exist twice (well, that is actually already allowed, but not in a useful way).

However, it makes it much harder to detect problems, if we suddenly have two sources that update the remote store. In CY this is less of a problem as they can “see” all the blocks that exist, but because TI hides this inside the zip archives, it needs to download and unpack some files to get the same information.

But, it is not impossible that we can use some of the same strategy as CY uses for concurrently reclaiming old data. It is not on my short-list of things to do

TowerBR · January 3, 2018, 6:09pm

Hello Kenneth, I agree with the points you mentioned.

In summary, the weaknesses related to using databases are two:

occupied space;
time for rebuilding (when a problem occurs).

I agree that both aspects can be improved in future releases.

For other points you mentioned, I understand that they reflect a difference of philosophy between CY and TI: TI is much more “user-friendly”, and CY is more “on hand”. Example: TI checks backups automatically and randomly. In CY you have to do it by hand using the “check” command. But it works very well:

I include the above routine in my backup batches.

Additionally I created a routine to download random files and compare them with local files.

That is, things that TI does automatically I put in the backup batches.

tophee · January 6, 2018, 4:32pm

Good point. But I have had duplicati running on my machine for months now, and whenever I look at it, I see something like this:

It’s complaints about missing files is just never ending. In the beginning I did run repair, but since the errors kept coming back and since sometimes they seems to go away without using repair, I’ve stopped bothering.

In any case, I’d say it’s rather unlikely that all those missing files are missing because of some backend problem that duplicati thankfully identified while duplicacy would have missed them. I my eyes, duplicati either identifies problems that are none or it identifies problems that it created itself during upload.

BTW, I found another big minus for duplicacy, which I added to my list above: search and restore is not so great in duplicacy.

TowerBR · January 7, 2018, 12:29pm

How so?

Search to restore or the functions separately?

tophee · January 7, 2018, 4:15pm

I explained in Edit2 here, it turned out not quite as bad as I thought. But Duplicati is clearly superior here. To start with, duplicacy can only restore to the original location. You can’t tell it to restore something to your desktop or something like that. Big minus for me.

Then, you cannot search for a file across revisions using the GUI. If you want to search your entire archive for a particular file, you need to do it via the CLI and once you’ve found it, you need to restore it using an entirely separate command, i.e. you can’t just say: “yes, that one, please restore it.”

And if you don’t want/like/need the GUI, you cannot search to restore at all. You can search using the history command, and if you want, you can take not of what you found and restore it with the restore command. In fact, it may be so that you can’t search for a file using CLI on windows at all, because the command proposed by the developer includes grep which is a linux command and does not exist on the windows command line. In other words: duplicacy CLI doesn’t really offer any search itself but merely allows you to search the output of the history command using other tools.

TowerBR · January 7, 2018, 11:12pm

Well, it seems strange to discuss the Duplicacy commands in the Duplicati forum, but here we go I think is being an interesting conversation.

(Remembering that I’m using the CLI version in Windows 10).

This is not completely correct. I’ve set up a script that runs daily with the following steps:

checks the integrity of the snapshot chunks (Duplicacy check command);
randomly selects X files from local folders to be compared to backup;
creates a new temporary folder to download selected backup files (not the local original folder);
downloads the selected files from the backup to the temporary folder;
compares the files downloaded in the temporary folder with the files in the “official” folder via MD5 checksum;
record a log and send an email report with the result of the comparison;
erases all files downloaded for testing and the temporary folder;
performs new daily backups (new revision);
sends an email with the result of the backup (a single email with one line per job).

So by steps 3 and 4 above you can see that it is possible to restore the files to a different folder than the original one.

In Duplicacy you can save the backup settings to a centralized folder, they do not have to be in the original folders themselves (called repositories). So it’s easy for the script I described above to retrieve these settings.

This is a good example of what I commented some posts above, that some things that Duplicati does automatically, in Duplicacy have to be placed in the scripts.

About:

This is true (about grep), but you can easily send the file name by include pattern or by parameter when calling the command.

All this reinforces what I said above: Duplicati is more user friendly, but if you like to control things by scripts (like me), there are no major complications to using Duplicacy. But I recognize that not everyone likes scripting.

TowerBR · January 7, 2018, 11:21pm

Since we are talking about differences, there is a very useful Duplicacy command, which makes it very similar to Crashplan in terms of version maintenance:

$ duplicacy prune -keep 1:7       # Keep 1 snapshot per day for snapshots older than 7 days
$ duplicacy prune -keep 7:30      # Keep 1 snapshot every 7 days for snapshots older than 30 days
$ duplicacy prune -keep 30:180    # Keep 1 snapshot every 30 days for snapshots older than 180 days
$ duplicacy prune -keep 0:360     # Keep no snapshots older than 360 days

You can even use:

$ duplicacy prune -keep 0:360 -keep 30:180 -keep 7:30 -keep 1:7

(source: Duplicacy guide at GitHub)

I know that similar functionality is being developed for Duplicati (and I’m following).

This is an essential point to reduce the space used in remote storage.

tophee · January 8, 2018, 6:33am

The duplicacy developer said:

to restore to a different location you’ll need to initialize a new repository on the new location with the same repository id.

and I guess I overinterpreted that as basically meaning “you can’t restore to a new location”. I suppose that is what your script is doing (initialize a new repository)?

I still think this is a bit of an overkill if I simply want to restore a single file, even for someone who appreciates the benefits of scripting. It may work well as part of your housekeeping script, but if you just want that file?

I see scripting as an additional feature that adds flexibility to the product. When scripting becomes a philosophy of simplicity that actually demands the user to be flexible (and compose the right command to achieve simple things), then I think that scripting approach has gone wrong. Scripting should not mean: “well, then you have to script everything”.

I’m not sure I understand what you mean here.