Real-time backup

Ken - I am trying to find my way around the codebase, but without inside knowledge it’s difficult. Would it be possible to provide me with the following information:

  1. Where exactly in the Sqlite DB (which table, field) can I record the current USN?
  2. How is the DB accessed in the code?
  3. At which point should I grab the previous USN from the DB; and at which point is it safe to commit it to the DB.
  4. I storing in the DLIST really required? I would assume that if the DB is corrupted, we can just enumerate all files as if there wasn’t a previous USN.

Thank you!

  • Daniel

I’m making progress with this, and settled on a DB schema layout to accommodate the USN data. I’ll be submitting a pull-request soon. As for storing in the dlist, I still believe that this is not required, because if the local DB is missing, it’s probably best to simply rescan everything.

2 Likes

I completely agree. Better be safe than sorry with DB issues. I imagine it’s the same reasoning behind not being able to repair a DB twice (if the first attempt fails).

I’m super excited to see this feature :slight_smile:

Just out of curiosity, is the USN stuff you’re doing NTFS specific or can it be used with other journaling file systems?

1 Like

I’m only working on NTFS support. But I’ll try to make it generic enough, by defining a suitable interface, to enable adding other journaling file systems later.

2 Likes

See pull-request: 3184

First implementation of NTFS USN journal optimization.

  • Folder renames are correctly handled (old name is removed, new one added)
  • I don’t actually rely on the USN journal to determine if files have changed. Instead, I use it to filter the source list (reduce it) to the set of possibly modified files and folders, which are then scanned recursively using FilterHandler.EnumerateFilesAndFolders(). That way, there is no risk of incorrectly interpreting the journal entries in complex rename / delete / re-create scenarios. Whenever a file / folder is “touched” in the journal, it will be fully scanned.
  • The state of the USN journal is recorded in a new table ChangeJournalData in the local database. This table also records a hash value for each fileset, representing the active source files and filter set. An initial full scan is re-triggered whenever the backup configuration is modified.
  • A full scan is also triggered if the USN journal is incomplete (has been overwritten / truncated), and of course in case of errors.
  • In case of DB loss, recovery, the USN data is not recovered, and a full scan is triggered to establish a sane baseline for subsequent backups.

TODO:

The USN journal records a limited amount of changes, and if backups are spaced too far apart, full scans are required as the data will be incomplete. This has the following implications:

  • Frequent (realtime) backups avoid this problem. If nothing has changed, the fileset will be compacted. Duplicati may need optimizing if this becomes a common scenario (compacting before uploading).
  • Frequent backups result in numerous filesets, and this may interfere with retention policy. Maybe for the current day, many more filesets would make sense in the “automatic” retention policy mode, to avoid deleting changing data.
  • Less frequent backups with USN support could be made possible by scanning the USN journal at regular intervals, and recording the changes, using a process / thread separate from the backup thread. When the backup is run, this data is then used instead of reading the journal at that time. There is no risk that modifications will be missed during reboots / Duplicati not running, as the USN numbers allow us to ensure that the journal is recorded w/o gaps.
2 Likes

This all looks great to me, but honestly I know nothing about USN and this level of journaling. With that in mind I’m going to ask what’s likely a stupid question. :blush:

Does your implementation support multiple drives having mixed cases of with / without USN support?

1 Like
  • Multiple volumes are supported, and tracked separately.
  • USN support is only available for NTFS, and the case is handled accordingly for that filesystem (currently not case sensitive).
  • The actual behavior if USN isn’t available for one of the volumes (file system not NTFS, platform Linux, …) depends on the value of -usn-policy:
    • off = USN not used
    • auto = Tries to use USN, silently (info message) use full scan for any volume where USN cannot be used
    • on = Tries to use USN, use full scan for any volume where USN cannot be used, and issue warning.
    • required = Tries to use USN, abort backup with error if it fails for any of the volumes.
2 Likes

Makes sense, thanks!

Are you thinking auto for the default setting?

Jon - I think we need to test this for some time, and when the implementation is found to be stable, it certainly make sense to default to auto. But I don’t think it’s up to me to decide.

I suppose that might be a good idea. :wink:

I once worked with a person who decided they didn’t need to test because they “wrote it to work”. Spoiler alert - they were wrong. :smiley:

1 Like

Ken - just a quick question: when using the USN journal, the source size will be displayed incorrectly, as it’s only based on the changed files. I see two work-arounds:

  • This information can be calculated for a file-set from the DB. if yes: then how?
  • I keep track of the the sources size in the new ChaneJournaData table, updating it based on the previously recorded size and the detected changes.

What’s your advice?

@Daniel_Gehriger … I would imagine this would only affect the first “full” backup. The rest are technically incremental. It’s not appropriate to use the USN for a first backup as it’s a rolling log.

Sam

This PR uses USN only if an uninterrupted change sequence since the previous (possibly full scan) backup is available, due to the fact that the journal is rolling, as you’ve mentioned. Whenever there is a gap, a full scan will be performed to establish a sane baseline.

Anyway, the size issue is not related to this. It’s due to duplicati calculating the reported backup size based on the scanned files. But when the journal is used, unmodified files won’t be scanned, their information is carried over from the previous fileset. And since the database doesn’t store the file sizes, we don’t know how much data they represent.

Hence the question if should keep track of the total backup size in the DB.

Yep, you’re 100% spot on. Sorry I misunderstood your first comment. It would seem the only way to pull the accurate file sizes would be from the DB itself as that’s the only place they’re kept locally.

I’d very much like to encourage you to keep on working on this one. As a long term CrashPlan user, real time backup based on file system journaling and without the need to always scan the entire backup set is for sure one of the very few features I’m missing with Duplicaty. I’m very much looking forward to get this feature implemented. You’re really awesome, guys.

By the way: I very much like your “safe” approach: never try to work around any issue with the integrity of the journaling information on new or changed files. If in any doubt, just do a full scan. In this case, safety beats performance.

1 Like

@abacus: this is already implemented and waiting to be merged into the main code base by @kenkendk. I am successfully running it on several systems in my office. It seems quiet stable after about two weeks of testing, although I did have to fix a few issues.

One thing that needs improvement, though, is how the backup size is reported. :arrow_left:Fixed now.

1 Like

@Daniel_Gehriger … apologies for the basic question, but I am just wanting to confirm that your efforts are targeted at providing real-time/continuous backups for Windows platforms only - correct?

Are there any parallel efforts to support the same on Linux platforms?

Also, I am curious how this (real-time/continuous backup) may impact download requests from cloud storage (as some providers charge for each download - perhaps above a certain threshold?). As Duplicati appears to download with each scheduled backup task to validate some random files, will this “download and validate” behavior essentially be triggered more frequently in response to local file edits (resulting in increased download requests), or can that somehow be throttled?

TIA!

@skidvd:

  • The “real-time” backup feature has been merged into the main branch, and should be part of an upcoming release.
  • The current implementation is Windows only, and I am not working on extending it to Linux. However, the implementation is generic enough, that one can build upon it to add Linux (or rather: a journaled filesystem for Linux, as it’s more file system than OS dependent).
  • “Real-time” is a bit misleading: this feature does not affect, in any way, the frequency of updates and hence connections to the cloud. What it does, is to shorten file scanning, reducing the time needed to find all new/modified/deleted files in the backup file set. Because backup time is reduced, one could increase the backup frequency to as little as every five minutes, hence achieving close to real-time backup. But this is not part of the feature, and it would be a conscious decision by the user.
1 Like

Work on this feature sounds excellent. What is involved in achieving this for Mac?