Daily backups without all the backups files stored

xyz · March 28, 2019, 8:31pm

Hello!
I cannot find the answer to what I want to do in the search bar, so I apologize if my question has already been answered.

Is there any way to perform daily backups to the cloud without having all the backups files stored in the cloud, but on a private server? i.e., only the list of the file hashes, not the files themselves (which occupy more space). If the duplicati policy is “keep all backups,” there would be no need to delete or modify old data, and therefore no need to store them in the cloud. Only the hashes should be stored in the cloud, right? Am I missing something?

Thanks in advance for your answers!

ts678 · March 29, 2019, 1:36pm

Possibly. Duplicati, as a backup program, considers your data valuable and worth backing up somewhere else at a distance of your choice. Remote gives better protection against local disaster, but can be slower. Duplicati’s hashes and other administrative data locally record what’s stored at the destination, because to access the destination constantly during backup would be slow. Duplicati only uploads changes so it must know what’s already there so it can be referenced, including back-referenced if data did not have changes.

Block-based storage engine is a simple overview, but in terms of stored hashes, technical datails are here.

Possibly you have a different model in mind? Duplicati’s destination can certainly be a private server, if you have one and it supports one of the storage types Duplicati supports, but how do hashes fit in your model? Some of the comments remind me of the cloud cold storage wish (data in cold storage, but metadata not).

xyz · March 29, 2019, 2:09pm

Thanks for your reply @ts678 . You are right, it’s kind of a cold storage model. I want to use the cloud as an intermediary that contains only the files changed during the last week. In more detail:

I’d like to have a certain level of versioning in my backups. I will use the “keep all backups” option, and I will upload all the versions to the cloud from my PC, but I will keep in the cloud only the last 10 syncs.
I want to perform the first backup in local, from my PC to a NAS. I don’t want to have the NAS as an online server all the time, so it will wake up automatically only once a week (for security, and to take care of the HDDs, which will be powered off).
I don’t want to store the full backup in the cloud, only the “incremental” changes of the last week. Once every week, the server will wake up, and it will copy the incremental backups of the last 7 days from the cloud to the NAS. It’s worth noting that this way there would be no need to enable remote access to the NAS since it only has to copy files from the cloud to the HDDs.

The problem is that duplicati shows an error with “missing files” as soon as I delete any file from the cloud. In theory, if there are no deletions or modifications on the old files, what I propose can be done only by checking the hash list of the files. In this way, you could write what data to delete for which version (without actually removing it), or add the modified files only without having all the old files in the same directory to do that (only its hashes).

In this case, it would not be necessary to use a lot of space in the cloud, only the space needed to store the files of the last week, it could be used with any free cloud service, and our data would not entirely depend on the cloud servers (but on the offline copies). AFAIK, it’s a more centralized proposal than fully-cloud (depending on the number of replicas), but more private, secure and cheap. And you can program as many servers doing the same “copy and offline backup” task as you wish.

EDIT: I have fixed the issue by enabling the “no-backend-verification” option this way it doesn’t show any error, but it depends completely on the local db.

I’d like to have something in between, that also verifies the last dlist and index files, but doesn’t even need the dblock files. Is it possible?

ts678 · March 29, 2019, 10:32pm

This can probably be treated like do-it-yourself cold storage where in the event of a disaster you put the cloud back together with a huge upload from the NAS (assuming the NAS has survived, if local disaster happens). Meanwhile, you have to keep Duplicati from wanting access to old data even to sanity-check things. –no-backend-verification can do that but it will raise risk. –no-auto-compact should also be used.

Basically, it’s blind upload of your file changes on a block-based level so you probably can’t even restore current files if blocks are on NAS. You can certainly try proof-of-concept, but this may sacrifice reliability. How would you test the backup if you can’t restore sample source files or even verify sample backend?

Here are ideas of doing a local backup as a quick-restore base, and a more current backup to the cloud:
2 Backups available, which to choose?
Restoring a backup with deletion of files not contained in backup
If you keep fewer cloud versions than local, it would stay smaller, but it would normally backup all the files.

Is it possible to set a date limit? is a possible way to backup only files more recent than the local baseline. There might be other posts about this around, for example if one wants to use an offline image as a base.

Further posts on the cold storage idea are also around. Some might refer to brand name such as Glacier.

I don’t understand all the talk about hashes and hash lists. There are many hashes. Which do you mean?

Remote files are already not deleted when you delete a version. The marking is just done in the database. Eventually (if needed and allowed) a compact is done, which reclaims some storage from the destination.

Similarly I don’t believe (but you could test if you like) that Duplicati needs to have files in the destination to upload changes. It relies on its database, which contains info about destination, and can be rebuilt from it. Recreates can be slow (depending on size and other factors), and your method requires NAS upload too.

Looking at RemoteListAnalysis, it combines verifying what’s there with updating its remote volume states, and you certainly want dblock states tracked properly. VerifyRemoteList runs this, and is run both after the backup, and maybe before (to make sure files look as expected) unless –no-backend-verification was set.

Conclusion is that there doesn’t seem to be an option to set this up. I’m not sure whether developers even would want to. It seems an unusual case. Technically, you have a severely broken backup at that moment. Manually messing with the destination files is considered a bad idea, and I know cold storage does this.

From a high-level view, you have a NAS with files on it that can’t directly be restored, and you have a cloud that’s possibly the same way for files that have some or all of their blocks on the NAS. The NAS is trying to help by reducing cloud storage amounts. How is the cloud helping in this scenario? If you don’t care about possible local disasters, just backup directly to the NAS, and worry less about somebody breaching cloud. Note, though, the encryption is local at the client, so while cloud can be destroyed, reading data is tougher.

xyz · March 30, 2019, 4:24pm

Thank you for such a complete response and for your time.

Meanwhile, you have to keep Duplicati from wanting access to old data even to sanity-check things. –no-backend-verification can do that but it will raise risk. –no-auto-compact should also be used.

Thanks for pointing that out!

How would you test the backup if you can’t restore sample source files or even verify sample backend?

With this great script running in the NAS: Complete, incremental or differential - #12 by kenkendk

I don’t understand all the talk about hashes and hash lists. There are many hashes. Which do you mean?

The hashes of the backed data (I think dblock files store the backed data, right?), which allow verifying that the information is the same and that its content has not changed without storing the data itself. Therefore, from the verification point of view, having the hashes is equivalent to having the data (but with much smaller storage of information, and assuming both that the data has been correctly stored and that it matches the hashes).

Similarly I don’t believe (but you could test if you like) that Duplicati needs to have files in the destination to upload changes.

If I run duplicati without --no-backend-verification it raises an error of missing files, I think because it verifies the content of the destination before uploading further changes.

I’ll tell you about the POC when --no-backend-verification is enabled. After uploading a couple of backup files from the could, I have deleted some information, and I could continue uploading the backup with further changes. Similarly, I was able to restore to a previous version only (as you pointed) if I upload the necessary version files from the NAS to the cloud. I did the same process but deleting all the files from the cloud instead of only some, and I was able to upload and restore backups as well. I didn’t need the --no-compact option so far in Keep all backups mode, but I will enable it anyway.

The NAS is trying to help by reducing cloud storage amounts. How is the cloud helping in this scenario?

Unlike the NAS, the cloud provides high availability with zero maintenance. It also isolates the NAS from the outside world, since it does not need to accept incoming connections as a server (except in case of a disaster), but to download the necessary files from the cloud once a week. Finally, it opens the door to 1-to-many backup schemes. I think these are a set of interesting advantages to consider

ts678 · April 1, 2019, 9:52pm

That can probably work pretty well for integrity checking to make sure the files are as they should be, and it’s not just dblock files you care about – the dlist files are critical. The dindex can probably be rebuilt from the database after using the dblock files to rebuild the database, but that could get quite time-consuming.

–full-remote-verification (which you can’t directly do) would do a more thorough verification than just hash, but best verification (which Duplicati doesn’t automate) is to test restore, especially with –no-local-blocks.

This section hints at clearer picture (right or wrong) of the situation being distributed clients, some outside of the NAS LAN, and with the main emphasis being disaster recovery as opposed to ordinary file restores.

That sort of fits with the painful-upload-to-restore being less painful than total loss of source data if remote disaster occurs, however depending on locations, carrying over a portable hard drive might take less time.

Less time still can be taken if the NAS site always has the job database matching the latest job, e.g. using –run-script-after to pass it through a different cloud folder using Duplicati.CommandLine.BackendTool.exe.

Assuming you know the encryption passphrase, you then have everything you need for fast file restores to then pass to the client somehow. You also have a wider ability to test, but don’t do a repair with a database older than the remote files on the NAS, because it can reconcile the difference by deleting the excess files.

You could aggressively clean the backend unless you also want to allow self-service restores at the client for files that fit entirely-from-creation into the still-present timeframe of 0 to 7 days. But if you really want to allow self-service restores, it might be easier to backup to client disk then find some way to copy remotely somewhere else, maybe still taking a just-passing-through style with cloud storage on the way to the NAS.

I just hope the schemes don’t get more interesting-in-a-bad-way than you plan. There’s more to go wrong.

Good luck!

JonMikelV · April 2, 2019, 12:32pm

Do I have this right?

So the cloud is a thin backup destination and the NAS is the cold storage.

Keeping dindex and dllist files on both and dblock only on cold allows general functionality (not restores or dblock checks) to work without complaint.

Restores would have to be done from the NAS or EVERYTHING copied back to the cloud and restore from there.

The cloud restore could get ugly if your NAS has more room than your cloud…

xyz · April 8, 2019, 6:32pm

Hi Jon,
It is correct, I want to use the cloud as a temporary and free gateway that is always available.

As you say, I managed to make a successful backup uploading everything back to the cloud. Nevertheless, you could also make a restoration directly from the NAS, right? I have to try this, but in theory it should be a scenario similar to recovering from a complete loss of data.

xyz · April 8, 2019, 6:37pm

With your answer, it is clear to me how precautious I have to be when dealing with the backup. I hope to achieve what I want without breaking anything, and if it goes well, I’ll share my experiments in this thread.

Thanks again for your response!

JonMikelV · April 8, 2019, 9:44pm

Yes and no.

Yes, you can restore direct from the NAS - Duplicati doesn’t really care whether or not the destination files are in the same place to which they were backed up.

As for it being like a total days loss restore, that doesn’t need to be the case. Since you still have the local database you can leverage that with restoring from the “alternate” destination location.

Note that in doing restore tests there are some additional parameters you may want to enable - such as the one that tells Duplicati to NOT getch blocks from the local source files if available…