Duplicati 2 vs. Duplicacy 2

TowerBR · January 8, 2018, 12:15pm

No, it is not necessary to initialize from scratch.

Duplicacy has two ways of storing the settings for each repository (the local folders) and the storages for which each repository sends the backups:

a “duplicacy” folder inside each repository (unpractical …)
a centralized folder with all configurations of all repositories.

I use the second format. And in this format, there is only a small file in each repository that contains only one line with the configuration path in the centralized folder.

So to get clearer:

I have the d:\myprojects folder that I want to back up (it’s a “repository”, by the nomenclature of Duplicacy);

I have a centralized folder of settings, in which there is a subfolder for each repository:

\centralized_configs
└── my…
└── my…
└── myprojects
└── my…

In the repository (d:\myprojects) there is only one .duplicacy file with the line:

\centralized_configs\myprojects

(no key, password, nothing, just the path)

So in the script I just create this little file inside the temporary folder for which I’m going to download the test files.

You just do this:

duplicacy restore -storage "storage_where_my_file_is" file1.txt

(without worrying about the subfolder, etc)

If you execute this command from the original folder, it will restore the file to the original subfolder (the original location). If you run the command on a temporary folder, it will restore to that temporary folder by rebuilding the folder structure for this file.

Or use patterns: Include Exclude Patterns

In this case (Duplicacy backups) I’m considering the philosophy of the scripts like this:
“let’s put in a file all the commands that I would have to type, and schedule the execution of this file”

tophee · January 16, 2018, 8:23pm

FYI, the duplicacy developer made a nice summary of this topic. In particular, his statement about trusting the backend is very enlightening:

In Duplicacy we took the opposite way – we assumed that all the cloud storage can be trusted and we based our design on that. In cases where a cloud storage violates this basic assumption (for example, Wasabi, OneDrive, and Hubic), we will work with their developers to fix the issue (unfortunately not everyone is responsive) and in the meantime roll out our own fix/workaround whenever possible.

greg.ewing · January 16, 2018, 8:44pm

I’d be curious to understand the deficiencies that make these specific cloud storage services fall short of that assumption.

TowerBR · January 16, 2018, 9:02pm

I’m following both topics. In the topic of the Duplicacy site you will find the links:

Wasabi issue
OneDrive issue
Hubic issue

greg.ewing · January 16, 2018, 9:20pm

Thanks. A little light evening reading…

kenkendk · January 18, 2018, 2:00pm

Yes, that is actually the situation I was thinking of. The problem here is that the backend (either in Duplicati or the actual destination) is broken, and returns an incorrect result. This really should be fixed, because otherwise you may end up having a partial backup (some files are really missing).

Not checking the destination is really a dangerous workaround.

kenkendk · January 18, 2018, 2:20pm

Thanks for that link!

When I wrote Duplicati 1.x I also had the same approach: of course the storage works, otherwise they will (should?) fix it.

Having seen the error reports from the 1.x users, I know this is just not the case.

One particularly interesting problem is that WebDAV on Apache has a slight race condition, where it will list a file as “existing, downloadable, and correct” and afterwards delete the file:
https://bz.apache.org/bugzilla/show_bug.cgi?id=54793

For this reason, Duplicati uses random filenames, and does not overwrite an existing file. Using the CY approach it is hard to do the same as the hash names are the filenames, but not checking if the remote chunck is really there will cause problems when you need to restore.

The OneDrive issue that @TowerBR links to is another example of a real problem that is not discovered unless you check. (should be fixed by writing a new backend that uses the update OneDrive API and as such is easy to fix for CY and TI alike).

But I agree that there are multiple issues that cause the database in TI to break, and they should of course all be fixed asap as they create a really poor user experience.

tophee · January 18, 2018, 9:36pm

Hm, I find it increasingly hard to make a judgement between the DY and DT approach. One reason for this difficulty is that it is really difficult to weigh the different pros and cons against each other. Another reason is that both sides seem to be addressing various issues, so that what is true today may not be true in a few months.

For example, duplicacy seems to be addressing at least some of the trust problem:

github.com/gilbertchen/duplicacy

Adding a feature to help with server side data integrity verification

opened 03:51AM - 16 Jan 18 UTC

kevinvinv

On Dec 19, 2017, at 7:10 PM, Gilbert (Gang) Chen <gchen@acrosync.com> wrote: …Hi, Kevin, Thank you for your support! Yes, I think that is a good idea. The implementation can be very simple -- after the snapshot file, say, 'snapshots/test/1', has been uploaded, upload another file named 'snapshots/test/1.chunks', which contains the names of all referenced chunks by '1'. <><><><><><> On Mon, Dec 18, 2017 at 10:53 PM, Kevin Vannorsdel <kv@vannorsdel.com> wrote: Hi Gilbert, I think that I really need a way to verify that the storage contains all the necessary chunks for each snapshot. I am operating an SFTP server and about 8 family members backup to me- and I need to know that their data is reliably present. The -check option doesnt work b/c I also dont want to know their passwords. What I am thinking is that I need an un-encrypted list of chunks to be uploaded with each snapshot… so that I can crawl through the list on the server-side and verify the presence of every chunk.

But again, I am not really able to judge what is being fixed here and what problems will remain and perhaps can’t be addressed without a database.

TowerBR · January 22, 2018, 12:37am

Continuing the tests …

One of the goals of my use for Duplicati and Duplicacy is to back up large files (some Gb) that are changed in small parts. Examples: Veracrypt volumes, mbox files, miscellaneous databases, and Evernote databases (basically an SQLite file).

My main concern is the growth of the space used in the backend as incremental backups are performed.

I had the impression that Duplicacy made large uploads for small changes, so I did a little test:

I created an Evernote installation from scratch and then downloaded all the notes from the Evernote servers (about 4,000 notes, the Evernote folder was then 871 Mb, of which 836 Mb from the database).

I did an initial backup with Duplicati and an initial backup with Duplicacy.

Results:

Duplicati:
672 Mb
115 min

and

Duplicacy:
691 Mb
123 min

Pretty much the same, but with some difference in size.

Then I opened Evernote, made a few minor changes to some notes (a few Kb), closed Evernote, and ran backups again.

Results:

Duplicati (from the log):

BeginTime: 21/01/2018 22:14:41
EndTime: 21/01/2018 22:19:30 (~ 5 min)
ModifiedFiles: 5
ExaminedFiles: 347
OpenedFiles: 7
AddedFiles: 2
SizeOfModifiedFiles: 877320341
SizeOfAddedFiles: 2961
SizeOfExaminedFiles: 913872062
SizeOfOpenedFiles: 877323330

and

Duplicacy (from the log):

Files: 345 total, 892,453K bytes; 7 new, 856,761K bytes
File chunks: 176 total, 903,523K bytes; 64 new, 447,615K bytes, 338,894K bytes uploaded
Metadata chunks: 3 total, 86K bytes; 3 new, 86K bytes, 46K bytes uploaded
All chunks: 179 total, 903,610K bytes; 67 new, 447,702K bytes, 338,940K bytes uploaded
Total running time: 01:03:50

Of course it jumped several chunks, but still uploaded 64 chunks of a total of 176!

I decided to do a new test: I opened Evernote and changed one letter of the contents of one note.

And I ran the backups again. Results:

Duplicati:

BeginTime: 21/01/2018 23:37:43
EndTime: 21/01/2018 23:39:08 (~1,5 min)
ModifiedFiles: 4
ExaminedFiles: 347
OpenedFiles: 4
AddedFiles: 0
SizeOfModifiedFiles: 877457315
SizeOfAddedFiles: 0
SizeOfExaminedFiles: 914009136
SizeOfOpenedFiles: 877457343

and

Duplicacy (remembering: only one letter changed):

Files: 345 total, 892,586K bytes; 4 new, 856,891K bytes
File chunks: 178 total, 922,605K bytes; 26 new, 176,002K bytes, 124,391K bytes uploaded
Metadata chunks: 3 total, 86K bytes; 3 new, 86K bytes, 46K bytes uploaded
All chunks: 181 total, 922,692K bytes; 29 new, 176,088K bytes, 124,437K bytes uploaded
Total running time: 00:22:32

In the end, the space used in the backend (contemplating the 3 versions, of course) was:

Duplicati: 696 Mb
Duplicacy: 1,117 Mb

That is, with these few (tiny) changes Duplicati added 24 Mb to the backend and Duplicacy 425 Mb.

Only problem: even with a backup so simple and small, in the second and third execution Duplicati showed me a “warning”, but I checked the log and:

Warnings: []
Errors: []

It seems to me a behavior already known from Duplicati (this kind of “false-positives”). What worries me is to ignore the warnings and fail to see a real warning.

Now I’m here evaluating the technical reason for such a big difference, thinking about how Duplicati and Duplicacy backups are structured. Any suggestion?

kenkendk · January 22, 2018, 10:07am

I agree, it is bad to report warnings when none are shown.

The current logging system is a bit split with 3 or 4 different log systems.
Most of these end up in the same place, but there must be at least one case where the warning is hidden from the output.
I hope to rewrite it and use a single logging system, so there is one place where log data is viewed.

I don’t have any ideas here. The algorithm in Duplicati is very simple: fixed offset with fixed size chunks. Any kind of dynamic algorithm should be able to re-use more data than Duplicati (at speed expense perhaps).

I see no reason that Duplicacy should not be able to achieve the same upload size.

Best guess is it has to do with compression being more aggressive with Duplicati.

TowerBR · January 22, 2018, 8:42pm

For anyone interested, I’m continuing this test with other parameters (chunk size) and putting the results in this topic.

TowerBR · January 24, 2018, 10:58am

Interestingly, @mdwyer commented on this topic in much the same way I did: “When it’s working, great, but when a problem occurs, it’s a nightmare.”

TowerBR · January 28, 2018, 10:02pm

Well, this is probably my final test with the Evernote folder, and some results were strange, and i really need your help to understand what happened at the end…

Some notes:

Upload and time values were obtained from the log of each software.

Duplicati:
bytes uploaded = “SizeOfAddedFiles”
upload time = “Duration”

Duplicacy:
bytes uploaded = “bytes uploaded” in log line “All chunks”
upload time = “Total running time”

Rclone was used to obtain the total size of each remote backend.

In the first three days I used Evernote normally (adding a few notes a day, a few kb), the result was as expected:

graph01

graph02

graph03

BUT, this is the first point I didn’t understand:

1) How does the total size of Duplicacy backend grows faster if daily uploading is smaller (graph 1 x graph 2)?

Then on 26th I decided to organize my tags in Evernote (something I already wanted to do). So I standardized the nomenclature, deleted obsolete tags, rearranged, etc. That is, I didn’ t add anything (bytes), but probably all the notes were affected.

And the effect of this was:

graph04

graph05

graph06

That is, something similar to the first few days, just greater.

Then, on day 27, I ran an internal Evernote command to “optimize” the database (rebuild the search indexes, etc.) and the result (disastrous in terms of backup) was:

(and at the end there are the other points for which I would like your help to understand)

graph07

graph08

graph09

2) How could Duplicati upload have been so small (57844) if it took almost 3 hours?

3) Should I also consider the “SizeOfModifiedFiles” Duplicati log variable?

4) If the last Duplicati upload was so small, why the total size of the remote has grown so much?

5) Why did Duplicacy last upload take so long? Was it Dropbox’s fault?

I would like to understand all these doubts because this was a test to evaluate the backup of large files with minor daily modifications, and the real objective is to identify the best way to backup a set of 20Gb folders with mbox files, some with several Gb , others with only a few kb.

I look forward to hearing from you all (especially @kenkendk).

(P.S.: Also posted on the Duplicacy forum)

tomcloyd · January 29, 2018, 12:22am

Nice to know, but not easy to find. Not good. Documentation needs to be out front, so to speak, to speed use of the program. Just my opinion. (Elsewhere, I have proposed a use wiki for this very purpose and reason.

JonMikelV · January 30, 2018, 4:23am

As usual, great work on the research and numbers!

Normally I’d say I suspect Duplicati’s longer times are due to the encryption and compression, but I assume you “equalized” those with Duplicacy… (Sorry if you said so in your post - I’m just not seeing it).

My guess on the higher Duplicati bandwidth usage (was that really JUST uploads?) is due to compacting (download, extract, re-compress, upload) that I don’t believe Duplicacie’s design “needs”. I think yYou could test that by turning off Duplicati’s auto-compact.

TowerBR · January 30, 2018, 6:37pm

But Duplicati times are smaller …I didn’t understand your point …

I’m not sure I understand what you mean here.

If you’re referring to the size of the chunks, the previous tests above were done with the standard sizes:
TI = 100k (and I set the volumes to 5M)
CY = 4M (variable).

This last test above (the graphs) was done with:
TI = 100k (and I set the volumes to 5M) (same)
CY = 128k (fixed).

And both softwares are with the default settings for encryption and compression.

I’m going to do some more tests that Gilbert suggested in the Duplicacy forum post, but only in a few days, now I’m a little busy. For now, the two backup jobs are suspended to avoid data collection problems out of controlled conditions.

Is that a good idea? Will not it make uploading worse?

JonMikelV · January 31, 2018, 5:04pm

I think I was responding specifically to item #2 “how can 58744 take almost 3 hours”, not a comparison of the two apps.

I meant using things like similar block sizes, which you did do with the last test.

Auto-compact is what will download multiple archives from the destination to be re-compacted into fewer files then re-uploaded - it’s not related to the actual compression of backup data.

So depending on how you’re measuring your times and bandwidth, it’s possible you’re tracking both the backup (same as Duplicacy) AND the auto-compact maintenance (something I think Duplicacy doesn’t need to do). While this is a valid thing to know in terms of total run duration, it can make it difficult to compare the two apps.

Thank-you again for all your time comparing these two great tools!

TowerBR · February 6, 2018, 5:54pm

I decided to set up a repository in GitHub to put the results of the tests I’m executing. For anyone interested: link

sylerner · June 18, 2018, 4:49pm

Am I correct in that another difference between DT and DC is that DC supports rolling/variable sized blocks, so that if data is inserted mid-file (that is not a multiple of the block size), DC doesn’t have to back up everything from that point forward, just the new data, while DT would have to back up everything from the insertion point to the end of the file?

Also, does DC support compression like DT?

Thanks!

JonMikelV · June 18, 2018, 4:57pm

I’m not an expert in Duplicacy, but I believe you are correct - it does support variable sized blocks. However, exactly how they’re implemented I couldn’t say.

Oh, and I think Duplicacy DOES support compression (some users have posted compression level tests previously in this thread).