Moved duplicati DB + source disk to another host - whole source data got recalculated

Catfriend1 · June 25, 2021, 8:02am

Hi,

I’m using Duplicati BETA 2.0.6.1 on host A and Duplicati BETA 2.0.6.3_2021-06-17 on host B. I had planned to put (file server) host A out of service and move the disk “exactly as is” to host B and taking duplicati (and existing DB + backup sets) along. I expected my backup job execution on hostB to run through (with almost no Kilobytes changed) within the normally logged job duration of about 8 minutes. BUT: It took about 5 hours.

Question:

Does Duplicati detect a host change, e.g. when the hostname changes from hostA to hostB?
Does it trigger a full re-read and re-calculation of file checksums by itself?
– Under which preconditions?

So I did the following:

stopped all services on host A ( ISCSI /datapool , duplicati BETA 2.0.6.1 )
unmounted the /datapool so nothing got changed since this moment
installed fresh services on host B ( ISCSI /datapool , duplicati BETA 2.0.6.3 )
mounted /datapool to host B (exactly with the same options host A mounted it before)
ensured the fresh duplicati install on host B is stopped (it was after install)
moved duplicati service config files from host A to host B (/etc/default/duplicati, /etc/systemd/system/duplicati.service.d/override.conf )
I did not have to move my duplicati .config dir because it’s part of “/datapool0/.config/Duplicati”
host A “systemctl daemon-reload; systemctl enable duplicati; systemctl start duplicati”
Duplicati started fine, showed all my existing configuration and jobs

What Duplicati did:

It saw a (major file server “all files”) backup job was due to execute and it executed it by itself.
The job ran successfully through, but took very much longer than normal. After the /datapool was accessible again for users, only 2 files (some Kilobytes) were changed before the job executed.

Here is my log of the last backup execution on hostA (“before moving disk + duplicati to hostB”):

Here is my log of the first backup execution on hostB (notice the changed: 5.75 KB):

Here is a log of the next regularly scheduled backup job execution on hostB which took the expected backup duration again:

Summary: I observe a backup job taking very long, re-reading all data with (from admin perspective) no obvious reason and then making “fast differential” scan+backup again as normally expected.

Can this behaviour be fixed or is it intended to recheck everything on “host change”? I again checked twice and my mount point chown/chmod/getfacl of /datapool0 was not modified during “the move”. I also use Syncthing which itself keeps a database with file permissions and checksums and it didn’t report any single changed/re-read file after starting it back up with the existing database of hostA on hostB.

Thanks for your answer.

Kind regards,
Catfriend1

Catfriend1 · June 27, 2021, 3:50pm

I’ve hit that “full re-read after no files have changed” problem again and think I can shed some light into the cause now. When I moved the ZFS disk from host A to host B, it occured the first time. I later noticed that I did NOT change any file data nor file permissions nor file ACL.

But: The move from my SAMBA domain member (host A) to the SAMBA AD DC (host B) led the system to output UID/GID numbers running “getfacl /zfs-storage/path/to/folder” on host B instead of the expected “DOMAIN\username” and “DOMAIN\groupname” output I previously had on host A.

I then re-mapped the IDs via my SAMBA AD DC LDAP using SAMBA NIS extensions to get the “same state and output” again on host B like it was on host A (before moving the disk).

Now: Backup incrementals ran fine, with very little data re-read and just the data got read which was really changed. One day later, I’ve decided to add another user via setfacl to my /storage/path/to/subfolder tree and Duplicati again started to re-read everything in there on backup execution.

I’ve googled and found the Duplicati advanced options page, where it is mentioned Duplicati does the “re-read of files” by default behaviour if ACL or other metadata changed.

My question now: Can I somehow force Duplicati to only re-read file contents if the file size or the modTime changed?

The documentation contains:
https://duplicati.readthedocs.io/en/latest/06-advanced-options/#skip-metadata

--skip-metadata = false
Use this option to disable the storage of metadata, such as file timestamps. Disabling metadata storage will speed up the backup and restore operations, but does not affect file size much.

Does this completely turn off backing up metadata like ACL and modTime in Duplicati? Or is the metadata still written to the backup but not used for deciding if a file has to be re-read completely?

A quick note: I don’t want to loose ACL metadata when backing up with Duplicati. Is there maybe a better option/way to take to speed the backup process up after I just changed ACL on a whole subtree?

ts678 · June 27, 2021, 5:38pm

Does Linux have ACL inheritance like NTFS does (probably the typical case), or is it on every file, meaning setfacl --recursive I suppose? I don’t use Linux ACLs, but if it’s kept per-file, there’s also your request

which I think would be violated if you went the --skip-metadata route.

github.com

duplicati/duplicati/blob/5af46b6eecc9fe04c140ec5304dabdd50bdaa3af/Duplicati/Library/Main/Operation/Backup/MetadataGenerator.cs#L26-L41


      
          /// <summary>

          /// This class encapsulates the generation of metadata for a filesystem entry

          /// </summary>

          internal static class MetadataGenerator

          {

              private static readonly string METALOGTAG = Logging.Log.LogTagFromType(typeof(MetadataGenerator)) + ".Metadata";

          

              public static Dictionary<string, string> GenerateMetadata(string path, System.IO.FileAttributes attributes, Options options, Snapshots.ISnapshotService snapshot)

              {

                  try

                  {

                      Dictionary<string, string> metadata;

          

                      if (!options.SkipMetadata)

                      {

                          metadata = snapshot.GetMetadata(path, snapshot.IsSymlink(path, attributes), options.SymlinkPolicy == Options.SymlinkStrategy.Follow);

EDIT:

check-filetime-only

--check-filetime-only = false
This flag instructs Duplicati to not look at metadata or filesize when deciding to scan a file for changes. Use this option if you have a large number of files and notice that the scanning takes a long time with unmodified files.

can reduce the scanning time, but it will defeat an objective of noticing and backing up metadata change.

BTW the log at verbose level shows the reason(s) why a file was scanned. An example of such output is

2021-06-27 08:41:01 -04 - [Verbose-Duplicati.Library.Main.Operation.Backup.FilePreFilterProcess.FileEntry-CheckFileForChanges]: Checking file for changes C:\PortableApps\Notepad++Portable\Data\Config\backup\webpages.txt@2021-06-23_193314, new: False, timestamp changed: True, size changed: True, metadatachanged: True, 6/27/2021 12:31:52 PM vs 6/26/2021 3:48:08 PM

and this is probably from the code below, where you can look at the rest of the decision logic if you like:

github.com

duplicati/duplicati/blob/5af46b6eecc9fe04c140ec5304dabdd50bdaa3af/Duplicati/Library/Main/Operation/Backup/FilePreFilterProcess.cs#L127-L137


      
          // Compute current metadata

          e.MetaHashAndSize = SKIPMETADATA ? EMPTY_METADATA : Utility.WrapMetadata(MetadataGenerator.GenerateMetadata(e.Path, e.Attributes, options, snapshot), options);

          e.MetadataChanged = !SKIPMETADATA && (e.MetaHashAndSize.Blob.Length != e.OldMetaSize || e.MetaHashAndSize.FileHash != e.OldMetaHash);

          

          // Check if the file is new, or something indicates a change

          var filesizeChanged = filestatsize < 0 || e.LastFileSize < 0 || filestatsize != e.LastFileSize;

          if (isNewFile || timestampChanged || filesizeChanged || e.MetadataChanged)

          {

              Logging.Log.WriteVerboseMessage(FILELOGTAG, "CheckFileForChanges", "Checking file for changes {0}, new: {1}, timestamp changed: {2}, size changed: {3}, metadatachanged: {4}, {5} vs {6}", e.Path, isNewFile, timestampChanged, filesizeChanged, e.MetadataChanged, e.LastWrite, e.OldModified);

              await self.Output.WriteAsync(e);

          }

Catfriend1 · July 6, 2021, 12:51pm

Thank you @ts678 , those are very helpful and easy to understand insights for a newbie like me :). Definitely helps me to use the product as is the right way.

If it “fits the project / code” , maybe a --decide-changed-by-filetime-only still storing metadata for restore could be introduced somewhere in the future. But even without it, Duplicati is great for me!