Duplicati 2 vs. Duplicacy 2

kenkendk · September 13, 2017, 9:36am

Duplicati uses indicators (metadata, filesize, timestamp) to guess if a file has been changed. If one of these indicators have changed, the file contents are scanned.

As noted by @JonMikelV you can use --check-filetime-only to make the timestamp the only factor. You can also use --disable-filetime-check to always check file contents.

agrajaghh · September 13, 2017, 5:21pm

I didn’t know Duplicacy before, and am impressed with the simple concept which looks really elegant and fail safe!
I just tested it and was impressed with the performance/speed too. But the GUI is still really simple and a lot of featues are missing in it (e.g. just one backup job and one source directory is supported). So maybe we should include this in the table above?

JonMikelV · September 13, 2017, 5:29pm

It sounds like multiple are supported via the CLI?

agrajaghh · September 13, 2017, 5:34pm

yes, I was just talking about the GUI

dgcom · September 13, 2017, 5:37pm

The way things handled in duplicay cli aligns more with Unix/Linux based OSes - you have to choose a single folder, which will be a root of your source… You can do this at root of the filesystem or you can create a special folder and then link all your source folders there. This also works in Windows with symlinks or junctions. This requires much more housekeeping compared to usual way of specifying backup sources.

tophee · September 13, 2017, 5:44pm

The OP is a wiki. You can edit it yourself.

dgcom · September 14, 2017, 9:49pm

It took me a bit of time to put some scripts to do consistent testing, but now I have initial results.
Only local destination tested yet, but I plan to test with WebDav against box.com and also with Google Drive.
Depending on the results, I may test with Backblaze B2 as well.

I used mostly defaults for both products, enabled encryption and VSS.
One customization was to enable 4 upload threads for duplicacy - it supports multi-threaded upload and it would be very disadvantageous to restrict it.

I put data into Google Spreadsheet here:

JonMikelV · September 15, 2017, 1:01am

Awesome comparison!

Can you provide a “compressabilty” ratio for the source data? I’m curious if we can determine if the lower Duplicati destination space requirements are due to better compression or in how destination files are used.

Of course since CY apparently doesn’t use a local database like TI does it would also be interesting to see how much of the TI savings is “lost” to the client side database space needs.

dgcom · September 15, 2017, 5:35am

Using the most standard compression - infozip’s zip 3.0 with all defaults, source compresses to 998,508,177 bytes, which is on pair with 1st snapshot size of both tools.

I don’t think that compression and DB size matters much… The whole fact of having a separate DB in TI vs using file system as DB by CY makes the most interesting difference.

However - so far, results are a bit inconclusive. Obviously, source data set is relatively small (but large enough for test scenario testing).
I realized that CY does not support WebDav ot Box - so can’t test against these.
Instead, I used Google Drive. So far, CY is faster in all cases except restore, but data asks for more testing - with different provides and larger data set…
CY did not use more than one upload/download thread for Google Drive - I bet it can improve speed with different destination.
CY also does not require local DB, which might be beneficial in some recovery cases.

I’ll try to test Backblaze B2 and Wasabi next. If any of these can utilize multithreaded upload, I’ll test them with larger dataset (>10Gb).

The impression so far is that for most people TI is better due to free web-based UI, but for people, who like scripting and cli, CY might a better option due to the technical design…

jl_678 · September 15, 2017, 9:51am

Hi,

I have been doing something similar although with a relatively underpowered backup server. Perhaps you can incorporate my results in your spreadsheet? I can send the details later today.

Oh and all my tests are based on B2 as the target. My results show consistently faster performance for Cy. Ci shows about a 10% reduction in disk space vs cy.

JonMikelV · September 15, 2017, 12:13pm

I don’t recall, but if those aren’t mentioned in the OP wiki, would you mind adding them?

kenkendk · September 15, 2017, 12:16pm

I noticed that from @dgcom’s results. I think this is caused by the simpler approach in Cy where it just uses the filesystem as a lookup, compared to Duplicati doing a database lookup. I do have an idea for making Ci faster by using a sorted resultset, such that I can avoid making a query pr. file/block and just stream the expected results.

dgcom · September 15, 2017, 4:33pm

@jl_678 - if send me something I can put on separate tab in that spreadsheet - I will do that.

@kenkendk - I wouldn’t call my results very conclusive. Yes, CY is faster in most operations, but TI was able to restore faster, even if you add DB recovery.

But this is on relatively small data set. I expect that CY will slowdown once number of chunks grow and it may get more expansive as well with providers, who charge for requests. On the other hand, providers, which support concurrent upload/download may help CY maintain an edge…

I’ll continue running tests and adding results as my time permits - I now wonder how larger backup will be affected (>10Gb).

JonMikelV · September 15, 2017, 5:06pm

Not that I expect it matters a whole lot, but the “complexity” of the content being backed up could have an impact on this as well,

For example, if you’re backup up 10G of 1k files all with the exact same content I believe CY will be MUCH faster because it’ll be working with a single hash file while TI (probably with a single archive file) will still have a big DB of file name / hash lookups to process through.

Not very realistic, I know - but the potential is there.

dgcom · September 15, 2017, 6:01pm

Just added a comparison of backup destinations support between the two to the first post here.
Obviously, TI has much better coverage. I did not include experimental SIA storage yet.
And I repeat - adding multithreaded upload/download will have huge effect on performance.

@JonMikelV - I agree, that complexity matter, but we also need to test on realistic data sets.
My small backup source was mostly set of emails - something what you’d expect when backing up your profile.
I have another real world example with 10Gb diverse files - documents, archives and even some already compressed stuff - I’ll try to use that for a another round of testing (once I am done with other providers).

JonMikelV · September 15, 2017, 6:04pm

Good point - theoretical performance doesn’t matter nearly as much as real-world.

And thanks for adding the destination summary!

mbrijun · September 15, 2017, 6:43pm

Not sure how well it scales for very large backups. If you end up with a single directory containing hundreds of thousands of blocks, it may have a negative impact on the API’s ability to deal with the content of that directory. For example, the list operation may take too long and result in a timeout.

One thing that should be added to the comparison table is the resiliency of the backup. I believe @kenkendk has made a special effort to add resiliency into CI. CY on the other hand, seems to be lacking in that area.

JonMikelV · September 15, 2017, 6:59pm

Good point - I assume that having the blocks in compressed files adds some extra assurance that:

Errors in the block (due to storage or transfer) would be detected (native CRC in compression?)
They might be more “repairable”???

The link you referenced makes me wonder if it’s been said anywhere whether or not Duplicati can do at least partial restores from a “broken” (bad / missing dblock archive file) backup.

agrajaghh · September 15, 2017, 11:38pm

yeah thats a good question. During my test with 25GB using another hardrive as backup destination, CY created “some” (~6.000) nested subfolders: like

chunks/00/0f
chunks/00/1b
chunks/0a/2e
chunks/0b/0a
etc.

So there could be a maximum of 65.536 folders. Getting a “list” for all of them can take some time as well I guess…

dgcom · September 16, 2017, 1:28am

But the design of CY is that it probably never need to list all these files…
Remember - it uses file system as DB and running select * in DB is never a good idea
It keeps a list of files with corresponding chunks and uses that to retrieve specific files by name directly.
The only draw back here is that it may need to download more small chunks compared to TI, which may take longer and become more expensive. However, restoring one small file will definitely be faster.