Real-time backup

Daniel_Gehriger · April 25, 2018, 7:10am

Ken - just a quick question: when using the USN journal, the source size will be displayed incorrectly, as it’s only based on the changed files. I see two work-arounds:

This information can be calculated for a file-set from the DB. if yes: then how?
I keep track of the the sources size in the new ChaneJournaData table, updating it based on the previously recorded size and the detected changes.

What’s your advice?

samw · April 25, 2018, 11:57am

@Daniel_Gehriger … I would imagine this would only affect the first “full” backup. The rest are technically incremental. It’s not appropriate to use the USN for a first backup as it’s a rolling log.

Daniel_Gehriger · April 25, 2018, 12:14pm

Sam

This PR uses USN only if an uninterrupted change sequence since the previous (possibly full scan) backup is available, due to the fact that the journal is rolling, as you’ve mentioned. Whenever there is a gap, a full scan will be performed to establish a sane baseline.

Anyway, the size issue is not related to this. It’s due to duplicati calculating the reported backup size based on the scanned files. But when the journal is used, unmodified files won’t be scanned, their information is carried over from the previous fileset. And since the database doesn’t store the file sizes, we don’t know how much data they represent.

Hence the question if should keep track of the total backup size in the DB.

samw · April 25, 2018, 12:20pm

Yep, you’re 100% spot on. Sorry I misunderstood your first comment. It would seem the only way to pull the accurate file sizes would be from the DB itself as that’s the only place they’re kept locally.

abacus · April 26, 2018, 8:00pm

I’d very much like to encourage you to keep on working on this one. As a long term CrashPlan user, real time backup based on file system journaling and without the need to always scan the entire backup set is for sure one of the very few features I’m missing with Duplicaty. I’m very much looking forward to get this feature implemented. You’re really awesome, guys.

By the way: I very much like your “safe” approach: never try to work around any issue with the integrity of the journaling information on new or changed files. If in any doubt, just do a full scan. In this case, safety beats performance.

Daniel_Gehriger · May 1, 2018, 8:22am

@abacus: this is already implemented and waiting to be merged into the main code base by @kenkendk. I am successfully running it on several systems in my office. It seems quiet stable after about two weeks of testing, although I did have to fix a few issues.

~~One thing that needs improvement, though, is how the backup size is reported.~~ Fixed now.

skidvd · June 1, 2018, 12:59pm

@Daniel_Gehriger … apologies for the basic question, but I am just wanting to confirm that your efforts are targeted at providing real-time/continuous backups for Windows platforms only - correct?

Are there any parallel efforts to support the same on Linux platforms?

Also, I am curious how this (real-time/continuous backup) may impact download requests from cloud storage (as some providers charge for each download - perhaps above a certain threshold?). As Duplicati appears to download with each scheduled backup task to validate some random files, will this “download and validate” behavior essentially be triggered more frequently in response to local file edits (resulting in increased download requests), or can that somehow be throttled?

TIA!

Daniel_Gehriger · June 1, 2018, 2:56pm

@skidvd:

The “real-time” backup feature has been merged into the main branch, and should be part of an upcoming release.
The current implementation is Windows only, and I am not working on extending it to Linux. However, the implementation is generic enough, that one can build upon it to add Linux (or rather: a journaled filesystem for Linux, as it’s more file system than OS dependent).
“Real-time” is a bit misleading: this feature does not affect, in any way, the frequency of updates and hence connections to the cloud. What it does, is to shorten file scanning, reducing the time needed to find all new/modified/deleted files in the backup file set. Because backup time is reduced, one could increase the backup frequency to as little as every five minutes, hence achieving close to real-time backup. But this is not part of the feature, and it would be a conscious decision by the user.

danielharvey · June 24, 2018, 10:33am

Work on this feature sounds excellent. What is involved in achieving this for Mac?

Daniel_Gehriger · August 13, 2018, 2:09pm

@danielharvey: porting this to a different platform basically involves writing a class capable of providing a list of potentially modified files and directories since a previous scan. This is dependent on the file system.

If you can find documentation about journaling capabilities of the files system(s) used on Mac, I can guide you.

JonMikelV · August 14, 2018, 5:25pm

Just to be clear (and please correct me if I’m wrong) but this feature hasn’t even rolled in a canary version yet (at least as of 2.0.3.9) so please don’t feel like you’re missing out on the feature just because you’re not running Windows, @danielharvey.

warwickmm · October 13, 2018, 9:48pm

True support for a change journal is one area where Linux is lagging behind, perhaps due to the large number of filesystems supported. inotify (which Crashplan used) has the problem where one may need to increase the number of watchers, otherwise file change events may be missed. A quick search on stack overflow reveals the unreliability and large number of issues that plague the FileSystemWatcher class.

However, it appears that work is being done to provide an API to provide exactly what would be needed here.

It’s not clear what state the work is in and what parts have made it into the mainline kernel, but the various mailing lists do show evidence of related activity.

https://marc.info/?l=linux-fsdevel&m=148224725519139&w=2
https://lwn.net/Articles/716973/
https://marc.info/?l=linux-fsdevel&w=2&r=1&s=fanotify&q=b

solf · October 17, 2018, 12:12pm

Would it be possible to clarify how USN Journal reading interacts with NTFS links (junctions)? I use these extensively and I’d like to know what will/will not be backed up under these conditions.

Also – if my reading of this discussion and relevant documentation in the application is correct – it seems that if USN option is turned on, it is always used (assuming it is available)? Is this correct? I think it would be much better to have an option somewhere that forces code path to fallback to standard scanning at e.g. specified times (for example, in the often-mentioned CrashPlan, there’s a specific time when ‘full rescan’ is done) – this is to ensure that something doesn’t fall through the cracks indefinitely – such as NTFS junctions or something.

I assume this might be possible to implement via an external script that would manipulate USN option before backup execution, but I think UI option would be a much better solution (assuming my proposal even makes sense).

aricamartin · November 12, 2018, 5:22am

I am also getting the issue to back up my data while updating the windows 10 as well as I get the error code 0x80004005. I also try on windows error code 0x80004005 for the help purpose but did not get any response.

samw · November 12, 2018, 8:09am

@aricamartin

Where are you getting this error from. This is generic error code thrown by many Windows applications.

Daniel_Gehriger · December 3, 2018, 10:48am

@solf: The USN code will fall back to full scan in the following cases:

First scan
A gap in the USN is encountered (this happens when the last backup is too far back, because the USN has a limited size)
An error / inconsistency has been encountered
The Duplicati settings have been modified

Furthermore, the algorithm uses the USN only to obtain a list of modified (touched) folders and files. If any folder is found to be modified, its content will be included in a partial scan. So we are not relying on the USN to monitor the exact transactions to files in a modified folder, but do a partial scan.

Your question regarding hard links / junction points is interesting, though. I don’t know how the change journal records files present under several junction points. Under the first junction point? Or under the path the modification was performed. I’ve found this information, which suggests that this is indeed an issue, and it may be wise not to use junction points with the USN, or, as you suggest, force a full-scan every “n” backups.

kenkendk · December 3, 2018, 11:21am

Is there a PR that I missed? AFAIK all changes related to USN are merged and in the beta release?

solf · December 3, 2018, 12:00pm

So would you consider adding a UI option for this? I think something like setting the time of day when stored USN information is discarded (so it’ll act as on ‘first scan’ and will do a full scan) would be great.

Daniel_Gehriger · December 3, 2018, 12:00pm

Ken

You didn’t miss anything. The post you mention is from May.

Daniel

Daniel_Gehriger · December 3, 2018, 1:24pm

Unfortunately, I’m not familiar with Duplicati’s UI. Someone else would need to implement this.