Should Duplicati 1.3.x still be used? Remote corruption, local database size


#1

Hi guys. I’m glad I found Duplicati software and this forum, which looks active compared to other open source projects.

I’m starting to install Duplicati on my home and family computers and was wondering if is there any reason that I should use the stable 1.3.x version, or should I just trust backups in the 2.0 beta release.

Honestly, I just tested for couple of days the version 2.0. The first test I did was to setup a backup job, let it ran couple times and tried to simulate a corruption or missing files at remote destination. I tried to simulate an event of failure where the remote host would corrupt or lose any file for any reason. I expected Duplicati to identify the missing or corrupt file and perform the accurate repair (re-upload) to fix it, back to expected state. Unfortunately, I discovered that this won’t happen, as Duplicati can only repair local database, and even having local database repaired, backup job will not run anymore, needing to create a new job and thus, upload whole backup set/files again. I don’t know if Duplicati 1.3.x perform the same way in this case.

I thought it would be able to restore remote corruption mainly after reading its description/features page:

"Technology: Fail-Safe Design
Duplicati is designed to handle various kinds of issues: Network hiccups, interrupted backups, unavailable or corrupt storage systems. Even if a backup run was interrupted, it can be continued at a later time. Duplicati will then backup everything that was missed in the last backup. And even if remote files get corrupted, Duplicati can try to repair them if local data is still present or restore as much as possible."
Fact Sheet • Duplicati

I researched about it and found a couple of commands to list and purge broken files, but in my quick tests it ended up having to purge all files (or most of it) and upload again. Not sure if it was because the backup set was very small.

Another “sad” thing I notice (using 2.0), is the big size of local database. I did an initial job uploading to Google Drive with 10GB and local database resulted in 44MB. Currently it means 4,4MB (44M/10GB) for each 1GB uploaded. So if I backup my laptop data (about 100GB), it would grow up to 440MB, if growing is proportional. Now, if I setup multiple backup jobs, with local NAS (~2TB), I would have to spare 9,2GB for just the local database. I think it’s really big space requirements, mainly when used on laptops with disk space constraints. I believe that currently the only trick to lower this huge size would be setup a higher blocksize (defaults to 100KB), maybe 500KB or up to 1MB. I don’t know what would be the expected local database size instead, but I’m a bit afraid of changing this default settings and having issues in future, maybe when restoring backup files (from original source or from another laptop, as I have seeing others having difficulties; here).

I understand that I used the same topic to talk about different things (which version to use and features, such as size of database and corruption recovery), but I thought it was better to post it together, as it’s related to which version should be better also on this issues.

Thanks in advance for any thoughts.


#2

If Duplicati 1.3 does what you need, then you should feed confident using it - however there is no more developer (and very little support) for it, so you shouldn’t expect any new features (or even bug fixes) to be made for it.


As you said in your post, Duplicati 2 is active both in the forum and in development, so there’s a good chance issues you find can be resolved either through support or code updates.

In your failure simulation (removing / corrupting a destination file) Duplicati’s ability to “recover” depends on what file TYPE was affected. If a dindex or dlist file is lost, it should be able to be recreated - however if a dblock file goes missing, then that data is gone.

How to handle the “gone” data varies depending on what you’re doing with it. If you’re doing a restore, Duplicati will restore all the file parts it can and fill in the “gone data” blocks with 0.

If you’re trying to continue doing backups, then Duplicati will complain because each new backup assumes previous backups are valid. To get around this, the purge option is used to remove all trace of the corrupt / missing files (and what they may have stored) from the backup. In this instance, you will have lost files from your backup - but new backups can be made without errors caused by the missing dblock files.


Database size is a known “issue” with Duplicati. There are some discussions about how to minimize it (increased block size is one of them) and there are some talks of rewriting the database structure so it’s more efficient.

At present, I think “most” of the unaccounted for space is used to store the paths associated with backed up files. If you have 10G of data in a lot of files (or deep folders), this can make for a larger database than if you had only 10 files each 1G in size.

Since most of the database contents can be estimated, it might be useful to add an “estimated database size” feature to the job edit process that would scan provided source locations, apply filters, then report what the EXPECTED database size could be…


#3

This would be cool, but it can only be done after the job has already run a file count and updated the metadata.


#4

It is not just storage. There is currently a fairly large overhead for storing paths, so if you have long paths and many small files, this will cause larger databases, and of course the inverse means smaller a database.

I can’t think of any reason to start using it. If you are already using it and it works, then fine. But if you are just starting, I would not recommend it as it is not maintained.

That is very much by design. Duplicati tries very hard to make sure that you can restore your files again when you need them. The worst failure a backup program can make is to make everything seem in order, and then fail when you desperately need your data back.

For this reason, when you mess with the remote storage, Duplicati cannot guarantee that anymore and it stops, and refuses to run more backups until you have fixed whatever the problem is. Note that you can still restore from such a tainted backup, but you cannot continue the backup.

Sometimes such errors happen for legitimate reasons, and you can use the purge-broken-files feature to remove all traces of the missing remote volumes, which will bring Duplicati back in a state where it can guarantee correct operation.

Duplicati 1.3.x relies 100% on remote file storage, so it just looks at that and continues from there. This is a much weaker guarantee, as some storage providers occasionally loose files, but it does allow you to manually remove a complete set of backups.

That is very likely because of the size. The remote volumes means that files are grouped into volumes, and this has the upside that you have fewer remote files, but the downside that you cannot just delete a single piece of data.

It is still labelled “beta” because there are people who report various not-so-nice results.

Generally it works well, and we see a lot of daily activity:
https://usage-reporter.duplicati.com/


#5

Thanks @kenkendk! I already decided to go with 2.0 version and started to setup duplicati in my home Mac and Windows PCs. I’ll be using it mainly with Google Drive and OneDrive (for Business).

Unfortunately OD4B seems not to work for me (read error) using official API, but I managed to use it mounting it as network drive (free version of CloudMounter), as stated here. I believe its already fixed in canary release, but I’m still not so sure how long it will reach to public beta version. Meanwhile, I’ll try use it as network drive and in future, when fixed, just change to OD4B API connection.

After long indecision and thinking, I decided to go with 500KB block size (and 200MB dblock volume size) with backups up to 500GB. For backups set of NAS and external disks (1TB ~ 2TB) which doesn’t have a lot of changes in files, I will use 1MB block size with about 500MB up to 750MB dblock volume size.

I think these settings will be fine, as I have a 50Mbps/5Mbps broadband connection and speeds to Google Drive and OneDrive are excellent here, mostly 80~100% of my connection capacity.

Thanks again for awesome software. I’m very excited to see future versions coming up, with better and more polished UI and features. With just a couple of days playing with Duplicati, I feel that sometimes it just doesn’t give accurate realtime status to the user, in the feeling that we don’t know if it’s stuck or not. For example, if you have a job running and click “Run backup” of another job, it doesn’t say anything, if it has been schedule to run afterwards or not. Perhaps its internals are working fine, but It could be just some UI tweaking to users feels it more responsive.

For example, I’ve seen this once; It was apparently idle, I clicked on “Run backup” couple of times and it was just “silent”, then I waited a couple of minutes and tried again and it started the backup job.


#6

It’s worth noting that at at 500MB block sizes Dupliciati could easily spend 15-20 minutes uploading changes even if you change just 1 bit in your source data. And that’s assuming your 5Mbps upload is the bottleneck and not the disks or the CPU that has to encrypt 500MB.

Also worth noting that at 500MB volume size Duplicati will be using up to 2GB of local disk space to prepare volumes before uploading them.

I can confirm that the scheduler works completely as you expect, but as you note there is no way to tell in practice other than using debug mode :wink:
There are a couple of github tickets related to this




#7

Yes, I understand… but I’m thinking using such high volume size only with “archive collection” (photos, etc) NAS, mainly trying to avoid having about 40,960 dblock files plus 40,960 dindex files (2TB/50MB). Maybe with 500MB it would be better reasonable number of files in just one backup destination folder (2TB/500MB = 4,096 dblock files + 4,096 dindex files).

Just to fully understand, if I have a total 2TB backup with Duplicati using 500MB volume size and delete some files in source, it won’t need to upload or re-process dblock files, right? It will only mark them for deletion and only fully remove when retention policy applies. So, duplicati will only have to download dblock and reprocess them, if files are changed as you noted. Just wondering because I have a NAS and an external HD that rarely have edit/updates, most of them are new files or files deletion.


#8

Duplicati may redownload and repack volumes that are almost empty (luckily these are small) and it may also repack volumes with a lot of “deleted file” blocks (these could be 500MB). But if you rarely delete or change existing files you should not experience this often.