Real-time backup

JonMikelV · October 25, 2017, 6:42pm

I’ve added a Display Duplicati 2.0.2.1 beta command line help How-To topic that includes the output of the advanced option descriptions for 2.0.2.1 beta.

It’s not ALL the information, but it’s a decent chunk. I’m hoping to get more added.

petertirrell · November 29, 2017, 9:23pm

I came across this and found it interesting as I, too, am coming from CrashPlan and working on migrating my current setup. I’ve been trying to dig further into what feature parity I might need to work out.

Anyway, the original CrashPlan link (Real-Time Backup For Network-Attached Storage - CrashPlan Support) shed a little light on how they are implementing it; seeming to hook into corresponding OS filesystem hooks to get notified of changes (Spotlight on Mac, NTFS on Windows, a kernel module on linux).

On the Windows side some more digging let me to the “NTFS Change Journal” (Change Journals (Windows))… at this point I’m not quite clear on it’s relation to a FileSystemWatcher, but it sounds like it might be more lower level? I wonder if that’s how others are accomplishing things like real-time backup implementations. This link describes an interesting process around that, too.

Makes me want to read up on it more and see if there’d be some way to integrate some of those ideas, at least on the Windows side, to get a more real-time backup solution…

JonMikelV · November 29, 2017, 9:49pm

Sounds like a great way to contribute!

kenkendk · December 9, 2017, 9:14pm

There are two ways to get the “continuous backup”.

The first is to use the Windows USN to list files and folders.
I implemented it for Duplicati 1.3.x, but some users reported missing files, so I disabled it. If we can get it fixed, it is really fast, as Windows can return a list of changed files/folders by querying the NTFS metadata in one go. I recall it as throwing paths out at thousands/second.

The other approach is to use the FileSystemWatcher which works on all OS’es. With a watcher it is possible to record what files and folders have changed. The watcher should record from the time the previous backup starts, to make sure it catches files changed during the backup as well.

As mentioned above, Duplicati has the options --changed-files and --deleted-files, which will bypass the normal directory scanning approach and just look at the files supplied through these options. We may need to do something more clever than sending a huge list of paths in, but the basic functionality is there.

petertirrell · December 9, 2017, 9:32pm

Interesting! I might check out the 1.3 source just to see how it was doing it…maybe there’s something to salvage. Do you have a pointer to where in the source tree that work was being done? Thanks!

kenkendk · December 10, 2017, 11:03am

It is still part of the 2.0 codebase, there is just nothing that calls into it yet:
https://github.com/duplicati/duplicati/blob/master/Duplicati/Library/Snapshots/USNHelper.cs#L207

Daniel_Gehriger · February 12, 2018, 8:29am

Ken - I finally had some time to play with this, and implemented (actually built upon an open source .cs implementation for the USN API) code to return the full path of modified/deleted/added files (including their full path) between a point in past (given the USN serial number) and now.

My plan was to update NoSnapshotWindows.ListFiles() / SnapshotWindows.ListFiles() such that they only enumerate those files if a USN journal is available for that folder.

However, I would have to store the USN serial number for each volume once the back has successfully completed. Where (and how) in the DB would you recommend to store this information on a per volume basis? It’s important that this is only committed to the database once the backup has been completed successfully (no files missing or anything), or otherwise the missing files won’t be backed up the next time.

Could the above be the reason that your implementation was missing files?

Alternatively, the database could also keep track of files that failed to back up, and add them to the one enumerated by the USN code. Again, where could I store that information?

For your information: enumerating my files on my laptop takes 20 minutes. And most of the time only 10-20 files will be backed up. I cannot reduce the backup set size, and it’s a real problem that Duplicati uses that many resources for so much time.

Daniel_Gehriger · February 12, 2018, 8:33am

I did not realize that you already coded USN journal access. I probably prefer to fix your code then. How do you keep track of the current USN serial number?

kenkendk · February 12, 2018, 9:35am

That code was not updated for 2.0.
In 1.3.x the USN number was written to the manifest files. In 2.0, we should store it in the database as well as in the dlist files.

My strategy was to grab the current USN, den do the listing, such that changes could not be lost (USN may update after grabbing it, but this will only cause the files to be scanned again later).

There is as call in the backup method that creates the new filelist, and I think that would be the best place to record it.

Daniel_Gehriger · April 10, 2018, 3:41pm

Ken - I am trying to find my way around the codebase, but without inside knowledge it’s difficult. Would it be possible to provide me with the following information:

Where exactly in the Sqlite DB (which table, field) can I record the current USN?
How is the DB accessed in the code?
At which point should I grab the previous USN from the DB; and at which point is it safe to commit it to the DB.
I storing in the DLIST really required? I would assume that if the DB is corrupted, we can just enumerate all files as if there wasn’t a previous USN.

Thank you!

Daniel

Daniel_Gehriger · April 13, 2018, 10:14am

I’m making progress with this, and settled on a DB schema layout to accommodate the USN data. I’ll be submitting a pull-request soon. As for storing in the dlist, I still believe that this is not required, because if the local DB is missing, it’s probably best to simply rescan everything.

Pectojin · April 13, 2018, 10:41am

I completely agree. Better be safe than sorry with DB issues. I imagine it’s the same reasoning behind not being able to repair a DB twice (if the first attempt fails).

I’m super excited to see this feature

JonMikelV · April 13, 2018, 4:44pm

Just out of curiosity, is the USN stuff you’re doing NTFS specific or can it be used with other journaling file systems?

Daniel_Gehriger · April 13, 2018, 5:00pm

I’m only working on NTFS support. But I’ll try to make it generic enough, by defining a suitable interface, to enable adding other journaling file systems later.

Daniel_Gehriger · April 19, 2018, 9:21pm

See pull-request: 3184

First implementation of NTFS USN journal optimization.

Folder renames are correctly handled (old name is removed, new one added)
I don’t actually rely on the USN journal to determine if files have changed. Instead, I use it to filter the source list (reduce it) to the set of possibly modified files and folders, which are then scanned recursively using FilterHandler.EnumerateFilesAndFolders(). That way, there is no risk of incorrectly interpreting the journal entries in complex rename / delete / re-create scenarios. Whenever a file / folder is “touched” in the journal, it will be fully scanned.
The state of the USN journal is recorded in a new table ChangeJournalData in the local database. This table also records a hash value for each fileset, representing the active source files and filter set. An initial full scan is re-triggered whenever the backup configuration is modified.
A full scan is also triggered if the USN journal is incomplete (has been overwritten / truncated), and of course in case of errors.
In case of DB loss, recovery, the USN data is not recovered, and a full scan is triggered to establish a sane baseline for subsequent backups.

TODO:

The USN journal records a limited amount of changes, and if backups are spaced too far apart, full scans are required as the data will be incomplete. This has the following implications:

Frequent (realtime) backups avoid this problem. If nothing has changed, the fileset will be compacted. Duplicati may need optimizing if this becomes a common scenario (compacting before uploading).
Frequent backups result in numerous filesets, and this may interfere with retention policy. Maybe for the current day, many more filesets would make sense in the “automatic” retention policy mode, to avoid deleting changing data.
Less frequent backups with USN support could be made possible by scanning the USN journal at regular intervals, and recording the changes, using a process / thread separate from the backup thread. When the backup is run, this data is then used instead of reading the journal at that time. There is no risk that modifications will be missed during reboots / Duplicati not running, as the USN numbers allow us to ensure that the journal is recorded w/o gaps.

JonMikelV · April 19, 2018, 11:13pm

This all looks great to me, but honestly I know nothing about USN and this level of journaling. With that in mind I’m going to ask what’s likely a stupid question.

Does your implementation support multiple drives having mixed cases of with / without USN support?

Daniel_Gehriger · April 20, 2018, 6:19am

Multiple volumes are supported, and tracked separately.
USN support is only available for NTFS, and the case is handled accordingly for that filesystem (currently not case sensitive).
The actual behavior if USN isn’t available for one of the volumes (file system not NTFS, platform Linux, …) depends on the value of -usn-policy:
- off = USN not used
- auto = Tries to use USN, silently (info message) use full scan for any volume where USN cannot be used
- on = Tries to use USN, use full scan for any volume where USN cannot be used, and issue warning.
- required = Tries to use USN, abort backup with error if it fails for any of the volumes.

JonMikelV · April 20, 2018, 2:04pm

Makes sense, thanks!

Are you thinking auto for the default setting?

Daniel_Gehriger · April 20, 2018, 2:06pm

Jon - I think we need to test this for some time, and when the implementation is found to be stable, it certainly make sense to default to auto. But I don’t think it’s up to me to decide.

JonMikelV · April 20, 2018, 2:15pm

I suppose that might be a good idea.

I once worked with a person who decided they didn’t need to test because they “wrote it to work”. Spoiler alert - they were wrong.