Real-time backup

Sounds awesome!

The implementation is based on the two options --changed-files and --deleted-files options. Using the commandline, you can set these two to a path separated list (using ; on Windows and : on others). Once they are set, the file scanner is not used, and it only scans the --changed-files and removes the --deleted-files.

If you are using USN, it should be possible to store the previous USN time-stamp and then query USN to give the change lists. This will work even if you restart Duplicati in the meantime.

For the FileWatcher approach, you have the problem that it is not guaranteed to keep running, so you need to handle the case where it is started (and thus cannot help you and must run a full scan). If it has been running since the last backup, you can use the list and avoid the scan.

The list may be really large (many files are changed), so perhaps you need something like a database to store the filenames. It is also possible that we need to re-design the way we pass the list of filenames to avoid storing them in-memory( i.e. using a file or similar).

I am not sure how to do this the best, but perhaps it would make sense to start the FileWatcher when running a backup. This first run will then use the scan, but keep the FileWatcher running, such that the next backup is using the data from the FileWatcher.

If this is the “right way”, I think you can inject this into the “Runner.cs” class in the Duplicati.Server project:

Let me know if you have any questions.

1 Like

Ken,

Thanks for the introduction. I agree that running the FileSystemWatcher in a separate process, and then passing the changed / deleted file on the command line (or using some other mechanism, such as IPC) is risky.

I’d rather favour the approach of an in-process task, with an initial scan and then updating the file list using the watcher thread.

I will look into this over the next weeks.

  • Daniel
1 Like

Buy if you do it this way you will

  • have to keep multiple lists for different tasks and
    every first scan per task will have to be done from scratch and
  • or after that on every run you would have to check the change list and double check if it’s indeed newer then the last backup because it might have been there from before the last full scan.

Understand what I try to say? :innocent:

I’m really looking forward to this real-time backup thing.

This part is very useful. I was wondering how this worked. I still don’t completely understand how to use it though. I’ve searched the forum for information about --changed-files, but could only find this topic discussion about it so far.

Are there any ideas for other file systems that Duplicati can take advantage of?

I believe --changed-files is a “;” (Windows) or “:” (non-Windows) separated list of file paths previously determined to have been changed. Similarly, --deleted-files is a list of file paths previously determined to have been deleted.

My guess would be that a third party “file watcher” tool can be monitoring the file system between Duplicati runs, then when Duplicati starts a backup, a --run-script-before script queries the file-watcher to provide a list of changed and deleted files which are then passed into Duplicati likely using DUPLICATI__changed_files and DUPLICATI__deleted_files environment variables.


Here are the descriptions of the two file list parameters as of version 2.0.2.12_canary.

Duplicati.CommandLine.exe help changed-files
  --changed-files (Path): List of files to examine for changes
    This option can be used to limit the scan to only files that are known to
    have changed. This is usually only activated in combination with a
    filesystem watcher that keeps track of file changes.

Duplicati.CommandLine.exe help deleted-files
  --deleted-files (Path): List of deleted files
    This option can be used to supply a list of deleted files. This option
    will be ignored unless the option --changed-files is also set.
1 Like

I wish all of this usage information was easily accessible in one place.

Well, technically it (like the above) IS all in one place - it just happens to be the command-line do isn’t searchable.

Hang in there - you might see a topic about it soon and there is some wonderful work on an actual manual going on!

Can I be able to get that information using macOS or Linux?

Yep. Just run Duplicati.CommandLine.exe help.

I’ve added a Display Duplicati command line help #howto topic that includes the output of the advanced option descriptions for 2.0.2.1 beta.

It’s not ALL the information, but it’s a decent chunk. I’m hoping to get more added.

1 Like

I came across this and found it interesting as I, too, am coming from CrashPlan and working on migrating my current setup. I’ve been trying to dig further into what feature parity I might need to work out.

Anyway, the original CrashPlan link (Real-Time Backup For Network-Attached Storage - CrashPlan Support) shed a little light on how they are implementing it; seeming to hook into corresponding OS filesystem hooks to get notified of changes (Spotlight on Mac, NTFS on Windows, a kernel module on linux).

On the Windows side some more digging let me to the “NTFS Change Journal” (Change Journals (Windows))… at this point I’m not quite clear on it’s relation to a FileSystemWatcher, but it sounds like it might be more lower level? I wonder if that’s how others are accomplishing things like real-time backup implementations. This link describes an interesting process around that, too.

Makes me want to read up on it more and see if there’d be some way to integrate some of those ideas, at least on the Windows side, to get a more real-time backup solution…

2 Likes

Sounds like a great way to contribute!

2 Likes

There are two ways to get the “continuous backup”.

The first is to use the Windows USN to list files and folders.
I implemented it for Duplicati 1.3.x, but some users reported missing files, so I disabled it. If we can get it fixed, it is really fast, as Windows can return a list of changed files/folders by querying the NTFS metadata in one go. I recall it as throwing paths out at thousands/second.

The other approach is to use the FileSystemWatcher which works on all OS’es. With a watcher it is possible to record what files and folders have changed. The watcher should record from the time the previous backup starts, to make sure it catches files changed during the backup as well.

As mentioned above, Duplicati has the options --changed-files and --deleted-files, which will bypass the normal directory scanning approach and just look at the files supplied through these options. We may need to do something more clever than sending a huge list of paths in, but the basic functionality is there.

2 Likes

Interesting! I might check out the 1.3 source just to see how it was doing it…maybe there’s something to salvage. Do you have a pointer to where in the source tree that work was being done? Thanks!

2 Likes

It is still part of the 2.0 codebase, there is just nothing that calls into it yet:

Ken - I finally had some time to play with this, and implemented (actually built upon an open source .cs implementation for the USN API) code to return the full path of modified/deleted/added files (including their full path) between a point in past (given the USN serial number) and now.

My plan was to update NoSnapshotWindows.ListFiles() / SnapshotWindows.ListFiles() such that they only enumerate those files if a USN journal is available for that folder.

However, I would have to store the USN serial number for each volume once the back has successfully completed. Where (and how) in the DB would you recommend to store this information on a per volume basis? It’s important that this is only committed to the database once the backup has been completed successfully (no files missing or anything), or otherwise the missing files won’t be backed up the next time.

Could the above be the reason that your implementation was missing files?

Alternatively, the database could also keep track of files that failed to back up, and add them to the one enumerated by the USN code. Again, where could I store that information?

For your information: enumerating my files on my laptop takes 20 minutes. And most of the time only 10-20 files will be backed up. I cannot reduce the backup set size, and it’s a real problem that Duplicati uses that many resources for so much time.

I did not realize that you already coded USN journal access. I probably prefer to fix your code then. How do you keep track of the current USN serial number?

That code was not updated for 2.0.
In 1.3.x the USN number was written to the manifest files. In 2.0, we should store it in the database as well as in the dlist files.

My strategy was to grab the current USN, den do the listing, such that changes could not be lost (USN may update after grabbing it, but this will only cause the files to be scanned again later).

There is as call in the backup method that creates the new filelist, and I think that would be the best place to record it.

Ken - I am trying to find my way around the codebase, but without inside knowledge it’s difficult. Would it be possible to provide me with the following information:

  1. Where exactly in the Sqlite DB (which table, field) can I record the current USN?
  2. How is the DB accessed in the code?
  3. At which point should I grab the previous USN from the DB; and at which point is it safe to commit it to the DB.
  4. I storing in the DLIST really required? I would assume that if the DB is corrupted, we can just enumerate all files as if there wasn’t a previous USN.

Thank you!

  • Daniel