you are given good information about Backup source support comparison…
thanks a lot…
you are given good information about Backup source support comparison…
thanks a lot…
Hint: If you refer to some other post on this forum, it would be great if you could provide a link to that post.
The link will automatically be visible from both sides (i.e. there will also be a link back from the post you’re linking to). Those links will make it much easier for people to navigate the forum and find relevant information.
So far in this discussion, the focus has been on speed and it looks like we can expect duplicati to eventually catch up with duplicacy in single-threaded uploads and perhaps even in multi-threaded uploads. Good.
Another conclusion I draw from the above discussion is that, compared to duplicacy, duplicati saves a lot of remote storage space (which usually costs money). Also good.
But there is another killer feature on the duplicacy side: cross-source deduplication. Duplicacy can use the same backup archive for backup jobs on different machines. Depending on the scenario, this may well make up for duplicati’s more efficient storage use.
I seem to remember someone (@JonMikelV?) saying that this could become possible in duplicati too. Is that correct and if so, is it on the road map?
I think databases use is the weak point of Duplicati. It’s a very “sophisticated” approach, and though elegant, it creates several points of failure.
When everything is ok, there is the problem of the occupied space (several hundred megabytes).
When a problem occurs (database corrupted, eg), it is a nightmare. Several hours (sometimes days) to rebuild the database from the backup. Exactly the time you don’t have when needs a restore.
Ok, just backup the databases after each backup job with another tool, but it’s strange to back up the backup databases (?!).
If it was me that was probably in my early days so I may not have correctly understood how stuff worked.
While in theory cross source deduplication could happen, it would require a huge rewrite so multiple backups could share the same destination.
For that to happen a destination log / database of some sort to handle block tracking across the multiple sources would need to be created.
For example, if sources A and B both have the same file (or at least a matching block) and it’s debate deleted from source A something has to stop source A backup from deleting the still-in-use-at-source-B block.
Similarly, if you set up two jobs sharing the same destination but with different retention schedules something needs to keep stuff from being deleted h until it’s flagged as deletable in all backup jobs.
Okay, I was indeed wondering how it would be possible, but I guess I was hoping to much to see clearly…
There is actually a pretty good description of the challenges in cross-source deduplication and how duplicacy solves it here. Quite ingenious, actually.
This is starting to dawn on me too. I have had so many issues with duplicati and basically all of them seem to be related to the database (though I can’t really tell whether the “missing 1 file in backend, please run repair” error is a db problem but since it tells me to repair the database, I guess it is.
Well, I think that is a viable solution (or it could be one, if duplicati took care of that by itself). But I agree, it’s definitely not a plus for duplicati.
So to boil this comparison down even further, I’d say that, at the moment, there are really only three (or maybe four) things that speak for duplicati compared to duplicacy:
Depending on taste, you could add the UI as a fourth advantage. Personally, I really like the Web UI and the way it makes command line options available without forcing you to use the CLI.
On the minus side, duplicati is (currently)
Edit: I just discovered that duplicacy allows you to change the password for your archive. I’m inclined to add this as another plus for duplicacy.
Edit2: I just discovered the very (!) basic restore functionality of duplicacy: it is very cumbersome to search for and restore a file whose exact location you don’t know.
You can only do it via CLI (as described here) and even then, you have to run a separate command to then restore the identified file. Edit: Sorry, it’s not as bad. Search also works in the GUI version, but you can’t search across revisions. So the difference to duplicati is not as big, but duplicati still seems better here.
Very well summarized.
Just a small note:
The CLI version of Duplicacy is completely free and opensource as well.
I personally prefer to use softwares in the CLI version. Things get more “under control” and “in the hand.” And after the initial setup, they are “setup and forget” type.
Rclone is another good example of CLI software (and I use it for some types of files) (including for the full backup of the Duplicati databases )
Yes, of course! Hence:
Looks like this is not necessarily the case:
Here are informal results. By informal I mean that I would kill the ongoing backup, then resume at the same spot with a different thread count, wait until it stabilized on a long stretch where there were minimal duplicates, and then recorded the MB/s. It is not rigorous, but at least an indicator.
-threads 1 ~12MB/s - Not great, but at least it will finish the initial backup in about a day or two.
-threads 2 ~1.93MB/s - very slow, it was slated to take well over a week
-threads 64 ~1.1MB/s - extremely slow, earlier on before it stabilized, it was running at only 100KB/s
-threads 16 ~619KB/s - seems even slower, but I did not let this one run very long, so it may not have stabilized at full speed
-threads 32 ~2.0MB/s - This was a longer test than for 16 threads, and converged to a somewhat higher rate, though not significantly different than 2 threads
The bottom line is that only one clear winner emerged: 1 single backup thread.
In thinking about it, this may make more sense than it seems for a spinning hard drive. That’s because writing each chunk requires a head seek, and if there are multiple threads, that’s a LOT more seeks going on. In fact, the way I pinpointed what was going on is running a separate speed test on the drive while the multi-threaded processes were running. The array normally writes at 200MB/s, but with more than 4 backup threads running, it slowed down to less than 9MB/s.
With only 1 thread running, it is still writing at over 130MB/s for separate processes.
Based upon this, I am guessing that an SSD would actually benefit far more from multiple threads, because there is no seek involved.
Nice analytics there!
Were you using a temp folder on the same drive as the source data?
I wonder how much speed difference there would be using a 2nd drive or a ramdisk as the temp location…
I have noticed this also. Theoretically, it should be much faster to run queries agains a database, and it is also more crash resistant.
In Duplicati, there are quite a few problems related to the database.
However, many of the issues will not be fixed by simply removing the database.
In some cases there are failures because the remote store reports different contents (for whatever reason). Removing the database here will just make the problem invisible until you really need the files.
Another problem with the database is consistency issues. Again, removing the database will not fix these, just make them show up later on where they are more likely to prevent correct restores. The errors that produce an inconsistent database should really be found and fixed asap.
Then there is the problem with recreating the local database. This is primarily a speed issue (and there are some recreate errors as well). This could be solved by not using a database, or at least using a much simpler database structure. But it could perhaps also be fixed by storing data in the
dindex files that allow a much faster database rebuild. I am more in favor of the latter option here, as the first one is already handled by the
Without implying that there are problems with CY, simply not having a local database means there are a number of checks that CY cannot do. When TI does these checks and they fail, it looks like TI is more fragile than CY.
I really like the simplicity in CY, it contains way less code that needs to be maintained, but I think TI will benefit from the local database in the long run.
The size of the database can also be reduced in the future (path compression will do a great size reduction). But note that CY also stores a large number of files on disk to make a cache of remote blocks.
Actually, it would not require a great deal, if we simply allow the same block to exist twice (well, that is actually already allowed, but not in a useful way).
However, it makes it much harder to detect problems, if we suddenly have two sources that update the remote store. In CY this is less of a problem as they can “see” all the blocks that exist, but because TI hides this inside the zip archives, it needs to download and unpack some files to get the same information.
But, it is not impossible that we can use some of the same strategy as CY uses for concurrently reclaiming old data. It is not on my short-list of things to do
Hello Kenneth, I agree with the points you mentioned.
In summary, the weaknesses related to using databases are two:
I agree that both aspects can be improved in future releases.
For other points you mentioned, I understand that they reflect a difference of philosophy between CY and TI: TI is much more “user-friendly”, and CY is more “on hand”. Example: TI checks backups automatically and randomly. In CY you have to do it by hand using the “check” command. But it works very well:
I include the above routine in my backup batches.
Additionally I created a routine to download random files and compare them with local files.
That is, things that TI does automatically I put in the backup batches.
Good point. But I have had duplicati running on my machine for months now, and whenever I look at it, I see something like this:
It’s complaints about missing files is just never ending. In the beginning I did run repair, but since the errors kept coming back and since sometimes they seems to go away without using repair, I’ve stopped bothering.
In any case, I’d say it’s rather unlikely that all those missing files are missing because of some backend problem that duplicati thankfully identified while duplicacy would have missed them. I my eyes, duplicati either identifies problems that are none or it identifies problems that it created itself during upload.
BTW, I found another big minus for duplicacy, which I added to my list above: search and restore is not so great in duplicacy.
Search to restore or the functions separately?
I explained in Edit2 here, it turned out not quite as bad as I thought. But Duplicati is clearly superior here. To start with, duplicacy can only restore to the original location. You can’t tell it to restore something to your desktop or something like that. Big minus for me.
Then, you cannot search for a file across revisions using the GUI. If you want to search your entire archive for a particular file, you need to do it via the CLI and once you’ve found it, you need to restore it using an entirely separate command, i.e. you can’t just say: “yes, that one, please restore it.”
And if you don’t want/like/need the GUI, you cannot search to restore at all. You can search using the
history command, and if you want, you can take not of what you found and restore it with the
restore command. In fact, it may be so that you can’t search for a file using CLI on windows at all, because the command proposed by the developer includes
grep which is a linux command and does not exist on the windows command line. In other words: duplicacy CLI doesn’t really offer any search itself but merely allows you to search the output of the
history command using other tools.
Well, it seems strange to discuss the Duplicacy commands in the Duplicati forum, but here we go I think is being an interesting conversation.
(Remembering that I’m using the CLI version in Windows 10).
This is not completely correct. I’ve set up a script that runs daily with the following steps:
So by steps 3 and 4 above you can see that it is possible to restore the files to a different folder than the original one.
In Duplicacy you can save the backup settings to a centralized folder, they do not have to be in the original folders themselves (called repositories). So it’s easy for the script I described above to retrieve these settings.
This is a good example of what I commented some posts above, that some things that Duplicati does automatically, in Duplicacy have to be placed in the scripts.
This is true (about grep), but you can easily send the file name by include pattern or by parameter when calling the command.
All this reinforces what I said above: Duplicati is more user friendly, but if you like to control things by scripts (like me), there are no major complications to using Duplicacy. But I recognize that not everyone likes scripting.
Since we are talking about differences, there is a very useful Duplicacy command, which makes it very similar to Crashplan in terms of version maintenance:
$ duplicacy prune -keep 1:7 # Keep 1 snapshot per day for snapshots older than 7 days $ duplicacy prune -keep 7:30 # Keep 1 snapshot every 7 days for snapshots older than 30 days $ duplicacy prune -keep 30:180 # Keep 1 snapshot every 30 days for snapshots older than 180 days $ duplicacy prune -keep 0:360 # Keep no snapshots older than 360 days
You can even use:
$ duplicacy prune -keep 0:360 -keep 30:180 -keep 7:30 -keep 1:7
(source: Duplicacy guide at GitHub)
I know that similar functionality is being developed for Duplicati (and I’m following).
This is an essential point to reduce the space used in remote storage.
The duplicacy developer said:
to restore to a different location you’ll need to initialize a new repository on the new location with the same repository id.
and I guess I overinterpreted that as basically meaning “you can’t restore to a new location”. I suppose that is what your script is doing (initialize a new repository)?
I still think this is a bit of an overkill if I simply want to restore a single file, even for someone who appreciates the benefits of scripting. It may work well as part of your housekeeping script, but if you just want that file?
I see scripting as an additional feature that adds flexibility to the product. When scripting becomes a philosophy of simplicity that actually demands the user to be flexible (and compose the right command to achieve simple things), then I think that scripting approach has gone wrong. Scripting should not mean: “well, then you have to script everything”.
I’m not sure I understand what you mean here.
No, it is not necessary to initialize from scratch.
Duplicacy has two ways of storing the settings for each repository (the local folders) and the storages for which each repository sends the backups:
I use the second format. And in this format, there is only a small file in each repository that contains only one line with the configuration path in the centralized folder.
So to get clearer:
I have the
d:\myprojects folder that I want to back up (it’s a “repository”, by the nomenclature of Duplicacy);
I have a centralized folder of settings, in which there is a subfolder for each repository:
In the repository (
d:\myprojects) there is only one
.duplicacy file with the line:
(no key, password, nothing, just the path)
So in the script I just create this little file inside the temporary folder for which I’m going to download the test files.
You just do this:
duplicacy restore -storage "storage_where_my_file_is" file1.txt
(without worrying about the subfolder, etc)
If you execute this command from the original folder, it will restore the file to the original subfolder (the original location). If you run the command on a temporary folder, it will restore to that temporary folder by rebuilding the folder structure for this file.
Or use patterns: Include Exclude Patterns
In this case (Duplicacy backups) I’m considering the philosophy of the scripts like this:
“let’s put in a file all the commands that I would have to type, and schedule the execution of this file”
FYI, the duplicacy developer made a nice summary of this topic. In particular, his statement about trusting the backend is very enlightening:
In Duplicacy we took the opposite way – we assumed that all the cloud storage can be trusted and we based our design on that. In cases where a cloud storage violates this basic assumption (for example, Wasabi, OneDrive, and Hubic), we will work with their developers to fix the issue (unfortunately not everyone is responsive) and in the meantime roll out our own fix/workaround whenever possible.