(Construtive) Criticism of Duplicati

Hi all,

Big and long term user of Duplicati but want to share some observations of faults - want to promote some healthy discussion and hopefully get some thoughts/energy to improve Duplicati further.

So I was experimenting with Arqbackup recently (gasp!) - while it is a bit rough around the edges and a bit vulnerable to small developer risk - I really like a couple of features:

  1. Ability to get to backups from other computer

So my main file server uses Arq/Duplicati as a backup in case of fire/disaster - but occasionally I need to get at those files from outside home; usually impossible coz the fileserver is just for my LAN but with Arq I can get to the backup and pull out the files. It doesnā€™t happen often but it is helpful to have when I need it (crashplan used to allow this as well)

For Arq, this is viable coz the configuration is mostly self contained (I just point to the backup location; put in password and can easily see the backup records). I find Duplicati much more trouble one-off adhoc get at backups.

  1. Ability to easily trim the backup

Me being OCD; sometimes I want to optimize out the space usage. It is easy in Arq to choose what backups to keep (and discard all older) and then either through the normal periodic clean up or by own initiation, I can do a drop-unreferenced-objects and minimize my backend storage use.

It isnā€™t too clear how to do this easily in Duplicati (at least; I couldnā€™t find option for deleting certain backups) and Duplicatiā€™s clean/compress seems quite unstable/error prone that I have much less comfort to use it

  1. Sort of related to (1) and (2) above; but Arq seems to be a little more graceful when hitting errors - eg missing data files and/or errors accessing the backend. On a few occasions, I get the red error messages about corrupt / missing files and even if I choose the repair/rebuild the errors keep on coming back (in fact; one time I got so frustrated in trying to find the right ā€œjust go recreate itā€ I deleted the whole backup and redid it. viable for my 100 gb backup; but I have another close to 1Tb which would be really painful)

  2. On a more comestic piece; Arq makes it really easy to see in the restore menu/file-picker for each backup which files were added/modified/deleted and that is rather handy sometimes.

As I said; aim is for healthy discussion - I like Duplicati ability to blob up the data files to one big file (Arq uses many little files and it really cripples performance of basically everything else when running intensive operations) and Duplicati supports more backends which is immensely useful. I just hope that some of the blemishes could be addressed :slight_smile:

I am not involved in the development of Duplicati, I am just a private user.

From my point of view, I would not call these viewpoints problems but rather feature requests.

What I see in your numbered topics, especially numbers 2 and 4, is enhancing the use of metadata that is already captured in the local databases and providing optional actions that are not part of the basic task of backing up files. As Duplicati is in beta and still has a lot of problems that need solving, my own priority is to get the core engine working reliably before spending any effort on the useful but non-essential features you are proposing. These might even make deciphering bug reports much harder.

On the other hand, as far as these features are based on data already captured, it could be possible to establish a separate project for implementing your proposals, in the manner that duplicati-client is completely separate. The difference is that duplicati-client only reads data and renders it, but does not modify the local databases or the backup data. Your proposals require such modifications, which makes it much trickier to implement, whether as part of Duplicati project itself or as a separate project.

Personally, I really do not think Duplicati at this point is mature enough for your proposals, but please do record them as feature requests so they wonā€™t be forgotten.

I like this idea, but to be secure I think youā€™d need 2 passwords. One to actually be able to connect to the remote storage, and another for decryption. It would be nice for Duplicati to store backup job configuration data in the remote location to help facilitate quicker restores, but itā€™d have to be encrypted.

Not only could you use this for quicker, one off restores from a different PC, but the feature could be extended to allow restoring the job configuration itself. If your main PC died and you had to set up a new one, you could just point it at remote storage, provide access password, provide decryption password, and then it could restore the job config with all settings.

Right now I accomplish this by exporting all my Duplicati job configs on all my PCs and storing them in a safe place, but itā€™d be nice if it were a built-in feature.

I have used Cloudberry in the past and it does something like this.

Iā€™m not sure I understand the value with that. It is interesting on some level but for normal usage I donā€™t really care about it. (It could just be me.) Duplicati can show you this information but requires command line usage.

What Iā€™d love to see is for the file restore dialog to prompt you to select the version LAST instead of FIRST - I really like how CrashPlan 4.x worked in this regard.

Duplicati is too. Although itā€™s more than one volunteer, there arenā€™t a lot, and this also limits progress.

Thatā€™s been the focus of work leading to 2.0.5.1 Beta, which has a lot of fixes compared to 2.0.4.5/23.

Restoring files if your Duplicati installation is lost can do this from another computer that has Duplicati, however itā€™s not super simple (needs some info) and not super fast (must create temporary database).

This feature is probably easier to do with server-side help, but Duplicati (and Arq) use simple storage.
Restoring encrypted files from backup to an iPhone is a similar feature request with other explanation.

sounds a lot like info below on ā€œDirect restore from backup filesā€, provided you have Duplicati installed.

If you donā€™t have an exported configuration file, you need to know the backend URL, credentials and the backup passphrase. Once entered all needed information, you can start restoring your files.

Feature request: List all backup sets with their size is how to do manual deletions exactly as you want, however there are also five options you can configure on screen 5 Options if you will let Duplicati do it.

has been there for a long time, and is Delete backups that are older than, however thereā€™s also New retention policy deletes old backups in a smart way, which provides even more custom options to:

There was at least one compact bug fixed in 2.0.5.1 Beta, and compact also tends to trip over damage of other sorts that one might find if, e.g. one does a restore. There have been many other fixes as well.

seems more valuable to me (for those who donā€™t know or want to Export their own). Good DIY info is in
How to restore data and then setup backup again?

How much easier than the current Direct restore could having a remote config be? Restore doesnā€™t need the full configuration. If you know enough to get the remote config, what more does restore need? Here again, services that provide their own storage and servers (e.g. CrashPlan) have a big advantage.

It would be nice for Duplicati to store backup job configuration data in the remote location to help facilitate quicker restores, but itā€™d have to be encrypted.

Not only could you use this for quicker, one off restores from a different PC, but the feature could be extended to allow restoring the job configuration itself. If your main PC died and you had to set up a new one, you could just point it at remote storage, provide access password, provide decryption password, and then it could restore the job config with all settings.

I 100% agree with all that. It would be a wonderful feature and make everything much simpler!

Automatic configuration and database backup? sort of gets at this, and links to support requests in the area. There might also be feature requests in Issues on GitHub (along with lots of open bugs to solve).

The database side of above request is probably harder, because the database can become very large. Fortunately, database Recreate is better in 2.0.5.1, so backup of database in its entirety has less gain.

Add ability to import backup configurations from command-line #3595 has gotten its developer at least somewhat more familiar with import code. I donā€™t know about export, and the web UI is also not known.

Iā€™d note that recent emphasis has been on reliability. If that push ends, maybe features can be addedā€¦

(wow post something before sleep and wake up with so many detailed replies. The power of the internet)

Thanks for all the pointers - to one of the posters - it was more of a feature request given when I have seen as strong points with Arq.

From the various replies - seems the most popular is the ā€œputting the database with the backupā€ - this allows quite a few different things (easier restores to main computer if it craps out; my request on pulling files out from backup from a 3rd party computer).

I wonder what is the architectural difference between Arq and Duplicati - in Arq case; the files in the backup store is enough to see the backup history and can pull out files without a large initial ā€˜buildā€™ step. It seems Duplicati canā€™t do that without the database.

Is it because Duplicati amalgamates many small files into one large data file ? If that is the reason for the need for database then it seems a irreconcilable choice needed - Arq with database-free but many small little data files vs. Duplicati with database but easier to handle backup store since larger blob files.

I think the main reason is the deduplication. If Arq supports deduplication then I wonder how it does it without a database. There has the be some method to track blocks and hashes when dedupe is used.

I need to research Arq more ā€¦

Arq definiitely does so some deduping - I tested it by making a copy of a 4gb file and the storage backend doesnā€™t grow that much. Also did some testing where I make a file which is basically two copies concatenated of another file and the backup is both fast and space efficient.

The big wish list would be if the Arq guy could rethink the data block setup as right now Arq tends to create a lot of little data block files ā€” it is really weird that a backup of 200k files can result in 450k of data block filesā€¦ ā€¦ ā€¦

I trust Duplicati to be reliable at restoring my most important data. A ā€œsmallā€ dataset of just a few Gigabytes. For the 1 TB sized stuff with millions? of files, I got too afraid of Duplicati never would even start restore in my lifetime, just doing database business. So I let someone else do the big stuff.
Tried Arq before, but canā€™t remember the reason why I left the boat.

It might be less popular if people understood that it might make hugely slower backups due to upload. There were once some forum reports that found pretty much the whole DB gets uploaded because of (probably) the semi-random distribution of updated 4096 byte SQLite pages that defeat deduplication.

Download of a big database is also going to be slow, and you might prefer the current design which is attempting to download only what it needs for the particular request. The ā€œpartial temporaryā€ database unfortunately can only be used once, and a full database Recreate is kind of awkward and risky, since generally it would go on a backup job, and if the user hits ā€œbackupā€ they might wipe out the old backup.

I just did this on a different computer with Duplicati installed. I authenticated with the (very hard to type) AuthID, gave it my encryption password, used the version dropdown and file tree, then restored my file.

How much better is the traditional client version of Arq? Newer cloud version claims to do web restores, probably because itā€™s an integrated system, not bring-your-own-storage like Duplicati and standard Arq.

Is this just a speed difference? The hard-to-type AuthID that Duplicati uses is for providers using OAuth authentication. Others use other schemes, but any secure scheme will require somewhat long typingā€¦

Recreate on 2.0.5.1 is hugely faster then 2.0.4.5/23 could have been, so be sure to test it, when timing.

I donā€™t think the AuthID piece is a must have - from my understanding; there is no need to carry around the long AuthID number - you can authenticate in on the new computer seperately (ie have a different independent AuthID number) and once you have access to the duplicati backup-store files then you can do the rest of the resture apps.

The traditional Arq standalone client works much the same - you add the backup store in the list (ie the authentication step) and afterwards you can access backups from other computers on that store.

For the cloud version - I donā€™t have access (it is chargeable and mac-only) but I think it would be roughly the same - the integration isnā€™t blackbox tight; part of the Arq marketing story is that you have full access to the storage backend (they use wasabi) and the format is fully documented, so you can verify the backup independently and can theoretically do restore by hand - their web front end is just to make things easier.

Donā€™t get me wrong - I think Duplicati is much stronger. Due to reviewing my backup methodology; I am running both Arq/Duplicati side-by-side right on both pcloud/Onedrive and Duplicati really hoses Arq in terms of speed (we are talking half a day for DT vs almost a week for Arq).

I donā€™t know the method Arq uses - but not having the backup-store having everything seems less to make it less error prone. While it is dumb slow - I feel more confident on the remove-unreferences-objects and validate-objects on Arq than the similar functions on Duplicati. When I do any backup management tasks on Duplicati - I do seem to regularly hit the ā€˜red error boxā€™ and not easy to resolveā€¦

Frequently so, and thatā€™s even simpler, but some services have a limit and kill older refresh tokens.
Looking it up now, I see that Google Drive will let me get 50 refresh tokens before tossing old ones.

Duplicatiā€™s Direct restore from backup files has a link to make an AuthID, but I didnā€™t use it, wanting to try my typing skill. As expected, some letters and numbers are too similar in some fonts. Unexpected issue was that the colon in the AuthID turns into a %3A but youā€™d better type in a colon.

Feel free to post details in a topic or GitHub Issue, but 2.0.5.1 is more interesting than old releases.

I am not a big security guy - but I think the intention with the AuthID tokens is to be one use identifiers and not carried around as an alternative to passwords. I suppose if somehow the limits (eg the 50 you mentioned from Google) is getting in the way then workarounds like this is needed.

will do - just waiting a big job to finish up in a few hours and then can start addressing the latest (!) red box error.

AuthID tokens are a Duplicati invention (not mine) explained in How we get along with OAuth. OAuth (technically OAuth 2 here) has more goals and uses than I can describe, but in Duplicati a big one is avoiding the need for Duplicati to ever know your username and password. You might notice that the screen that you type it into is directly at the service provider, such as Google if you use Google Drive.

While a refresh token isnā€™t ā€œone useā€, itā€™s less powerful than username & password because itā€™s tied in various ways, e.g. to Duplicati. The indirect design allows you to change your password without killing existing tokens, to revoke access to a given app without changing your password, to limit loss if token becomes known, etc. And Duplicati AuthID is one step removed from OAuth. It only works in Duplicati.

1 Like

There are many ways to improve this. Duplicati has two design goals that make this harder to improve:

  1. Backend compatibility for ā€œdumbā€ backends (list/upload/download/delete only)
  2. Confidentiallity from the storage provider (i.e. encryption, trust-no-one)

Most backends deal poorly with millions of files so we need some way to group them together. Duplicati uses compressed archives for this, other systems use sub-folders, which works great on object-storage backends like S3.
The use of compressed archives reduces the number of remote files (and number of remote calls) while also adding a layer of anonymization (you cannot see if it is many small files, one large file, etc). Encrypting larger blocks with different keys also makes it harder to extract information about the real key, as opposed to encrypting many smaller files with the same key.

The downsides to this approach are that you need something that can ā€œlook insideā€ the compressed archives. Since we cannot expect to run anything on the server, we need to download the entire archive, decrypt, extract, and parse the contents. This is not a fast process, but required. Duplicati ā€œsolvesā€ this by keeping a local database as a replica of the remote data, but this makes it really hard to access from another machine.

One suggestion is to also make a backup of the local database, so you only need to download one file, but it can be quite large.

If there are specific operations you want to support, we can perhaps upload just that metadata.

Again, the grouping into compressed archives makes this harder. You might be able to delete a small chunk that is no longer needed, but since it is located inside a compressed and encrypted file, you need to download the file (or recreate it) remove the unwanted entry, and upload it again.

This is what the compact feature does, and you can control when Duplicati considers it worth the trouble to rewrite a file, based on how much wasted space it takes up.

Since Duplicati is using a ā€œshared dataā€ approach, it is vital that the backup is intact. If a single remote file is broken, it might ripple into many files. My strategy is to be as lenient as possible with errors, but always guarantee that the backup is intact or stop. The worst possible scenario for me would be running the backup diligently, only to later find out that I cannot restore due to some error that was just a warning or expected to be a temporary backend error.

The errors are usually caused by an internal consistency check failing. Since the check fails, there is no telling what would happen if you made another backup on top of faulty data, and this is why it prevents you from continuing.

This situation should not happen, and like @ts678 mentions, there has been much progress on the stability in the latest beta release.

While my comments on 1-3 were not offering solutions, but merely explaining why the structure and goals of Duplicati prevents the features you want, this request is ā€œeasilyā€ doable. All the information is in the local database, so we can simply query it and show it. It is mostly a matter of finding a way to display a lot of information in a compact and concise manner in the UI.

Hope I answered that above.

1 Like

Very quite clear at least on the duplicati side. I might have a bit of a poke around more to confirm the Arq dedup behaviour. I vaguely recall testing before things like this some time ago:

  1. Big file A
  2. Two copies of Big file A
  3. File B = Two copies of big file A concatentaed
  4. File C = File B with some little changes at the start
  5. File D = File B with some little changes at the end

In all cases the storage backend was much the same size which made me believe that Arq is doing some block-level deduping. I wonder how it does it all without databaseā€¦

I do not have insight on how Arq does things, so I only commented on the Duplicati stuff. Arq it looks very Mac-specific, so it probably creates a remote-mounted sparse-bundle of sorts, like what Time Machine does.

File formats are nicely documented on their site, but I havenā€™t studied enough to commentā€¦