What files are actually read from destination during a backup?

mihi · April 24, 2022, 9:07pm

I’m currently trying to move the Windows backup of my laptop from Windows File History to Duplicati. (File History forgot one of my target paths during a Windows feature upgrade and so I am no longer confident in using it).

At first glance, it looks great, I like the UI and the backups are (thanks to deduplication) smaller than File History’s.

However, it does not seem to be possible to run a backup when the destination folder is not available (File History has an option to cache changed files in a cache directory and then upload them when the destination folder is available again). Which does happen since it is a laptop which is not always in my home network and not always online. Yet still I’d like to be able to restore file versions that were backed up while I was offline.

My question: From the design of duplicati, would such a feature be possible? I noticed that the destination folder contains lots of large dblock files (totalling about 100GB after the initial backup and larger than the free space on my local drive) and lots of small dindex and dlist files (totalling about 100MB). If the dblock files are not read during a backup, one could try to implement a backend that stores dindex/dlist files on two places and dblock files only once (initially on local disk but moved to remote storage when available again)? Having the small files permanently on my disk would not be a problem, it’s the large files that I’d like to have offloaded.

For restore, it is obviously fine for me to have access to the destination - that is not different with File History, either.

ts678 · April 25, 2022, 12:08am

Welcome to the forum @mihi

You’re very unlikely to hear from the original developer. You might not hear from any. They’re scarce.
If you have any interest in code, test, docs, or forum, volunteers are what keep things in motion here.

Backup only uploads changed blocks, but the database records all blocks and what dblock they’re in.
Blocks are known by their hash. Look in your Block table with an SQLite browser if you want to see it.

is unfortunate, as one easy path is to have a local backup but then remote it with rclone or something.
Similar solution might be possible using some cloud sync software such as OneDrive or Google Drive.

You can view a live log at About → Show log → Information for a small incremental backup to see the backend activity. Usually there will be a list to see if things look sane, then a series of put of dblock interleaved with their dindex files (one per dblock), then a dlist at the end to show what’s in backup.

There can be some delete of dlist that should be removed due to retention policy, and there’s list finally to make sure things still look sane. There will typically be three get to verify actual file contents.

As backup versions get deleted by retention policy, compact may run to download, repack, and upload.

I think it’s possible to disable everything except the core backup which only uploads files, but the other things improve reliability and reduce wasted space, so be careful. I don’t recommend disabling all that.

What current remote backend would this one go with, or is the idea that it could wrap any existing one?
The rclone crypt remote looks like such a wrapper, but I don’t think Duplicati has any such concept in it.

Duplicati allows making of dblocks for upload up to the asynchronous-upload-limit. I’m not sure if it can make use of them in next backup after an interruption. If it can, you might be able to build on that ability because it’s already an upload queue – the difference is that on normal completion, it’s fully emptied…

github.com

duplicati/duplicati/blob/de13cbcbd0f85492e8b8603def0ced7d7472a8e4/Duplicati/Library/Main/Operation/BackupHandler.cs#L532-L534


      
          // Wait for upload completion

          m_result.OperationProgressUpdater.UpdatePhase(OperationPhase.Backup_WaitForUpload);

          var lastVolumeSize = await FlushBackend(m_result, uploadtarget, uploaderTask).ConfigureAwait(false);

ts678 · April 25, 2022, 12:23pm

What languages are you proficient in? While adding good C# developers will help Duplicati work greatly, extreme modifications at this point (for any feature) will go against the goal of getting Duplicati to Stable.

You might be better off trying to do this externally (which might make support harder, but at least cannot break Duplicati for regular users). For example, the OneDrive route may keep local space under control through careful use of the attrib command to set everything online-only sometime before your backup:

Query and set Files On-Demand states in Windows

Scripting options can run your scripts before and after backup, but writing the scripts would be up to you.

A more ambitious scripting effort would use the Rclone destination storage type to ideally use an existing rclone feature to achieve this (but I don’t think it’s quite there yet – I could be wrong), or just as an API for some program that you write that supports the little bit of the rclone command syntax that Duplicati needs.

https://github.com/duplicati/duplicati/tree/master/Duplicati/Library/Backend/Rclone

Easiest and most reliable approach might be to get a drive that’s big enough to hold your backups locally. Let someone else’s cloud sync software deal with (and be responsible for) simulating a write-then-list FS. Drawback of not writing directly to the destination is that delayed or partial uploads add some uncertainty.

mihi · April 25, 2022, 8:30pm

I don’t mind. Your answers have been very valuable.

Do you know if these get are always blocks from the current backup run, or may they also be from older blocks?

I’ve read that the verify and compact step can be run manually, so it may be an option to exclude those from the default backup and run them manually when needed or storage gets scarce (and destination is available).

[quote]
What current remote backend would this one go with, or is the idea that it could wrap any existing one?[/quote]

Wrapping any backend (i. e. a “decorator”) would be a nice idea, but I assume from the UI that the list of options for a backend need to be fixed. So one would need to “explode” the number of backends.

For my own usage (backups go to a USB hard disk or to local network via SMB), the File backend would be sufficient.

That’s also an interesting idea, but I won’t expect the queue to last across reboots (or forced reboots/crashes due to battery of laptop going flat).

C# and Java are the languages I’m most comfortable coding in. If needed, short Perl or Python scripts are also possible.

I was assuming that backends were pluggable (so do not need to be added to the binary at compile time), but from your reply I guess that this assumption was wrong.

That sounds like a nice option, too.

When talking about money vs. time tradeoffs, purchasing a commercial backup software with the right feature set would also be an option. But first I’d like to explore the options that are zero money and only my time (with the added benefit that I might learn something interesting during the process).

Thank you again for your very valuable feedback.

mihi

ts678 · April 25, 2022, 9:24pm

I’m not certain the Verify result is entirely equivalent when run later. You might miss some problems that could have been cleaned up at start of next backup, except you didn’t. Any damage may become worse.

github.com

duplicati/duplicati/blob/de13cbcbd0f85492e8b8603def0ced7d7472a8e4/Duplicati/Library/Main/Operation/FilelistProcessor.cs#L34-L39


      
          /// <summary>

          /// Helper method that verifies uploaded volumes and updates their state in the database.

          /// Throws an error if there are issues with the remote storage

          /// </summary>

          /// <param name="database">The database to compare with</param>

          public static void VerifyLocalList(BackendManager backend, LocalDatabase database)

I think you’ll see this get especially active when a backup is interrupted before it runs to a clean ending.

Manual compact should be fine, I think.

The challenge also shows up in the Target URL used by command line and GUI internally. Some syntax would have to be invented to say have-this-wrap-that. GUI could use same plan. It still takes much work.

Branding and OEM customization talks about custom backend configuration but also says the following:

Since the backends are loaded dynamically this would simply require that the unwanted backend files are deleted before repacking and signing.

I don’t think they’re as easily added as removed. You can look at how Rclone backend was added here.

It’d be interesting to know what (if anything) has this use case covered (and with what usage limitations).

Glad to help.

mihi · April 25, 2022, 10:08pm

That looks quite good. Most changes outside the Rclone dll itself are made to hook up the project in the build process and make sure it is present when debugging. The only real changes in core are in the javascript and html of the web interface (which I probably could get around by importing my configuration from JSON file).

The loader looks at each DLL in the main directory and the backend directory if any DLL’s exported type implements IBackend interface and has a no-arg constructor. So just dropping a DLL with the correct exported type in there should be sufficient to add a single backend. So really easy to build. Disadvantage of this approach (from the more traditional PluginLoader approach where the exported type is a IPluginLoader implementation that decides what actually to load) is that you cannot dynamically register multiple plugins from a single DLL (which was my initial idea to “wrap each plugin that is already there”). But anyway, I think this approach is still simpler than “hacking up” a fake rclone binary.

ts678 · April 26, 2022, 1:56am

You’re beyond me on the C#, but I’d counter the simplicity claim. Here’s my one-line core:

system('C:\tmp\rclone.exe', @ARGV);

and my full Perl script (a test program) is here, basically just something to do random fails.

rclone storage type may hang when rclone errors unless disable-piped-streaming is used #4328
was a problem that the script found, but I don’t think it needs the script (just needs rclone).

Another mild drawback of going through Rclone backend is that it’s an IBackend but not an IStreamingBackend, which means that its capabilities are a bit less than the usual backend.

Obviously, do it whichever way you like. I’m just claiming a script wrapper might just need to
edit the rclone command line for an upload depending on which location was available then.

There are lots of things to consider, e.g. what happens if SMB dies halfway through upload?
I’m not sure if that returns an error or a hang. An error would be better, as Duplicati will retry.

ts678 · April 26, 2022, 3:20pm

Continuing the descent into possible oversimplification you could move a directory symlink:

mklink /d destination smb
rmdir destination
mklink /d destination local

My folder named smb isn’t really smb for this test, but I found documentation saying it will work.

I sure wouldn’t want to try to fake the list while offline this way, but Perl/Python might be able.
I’ve only used Python for a few days (another test tool…) but I see mine has an sqlite3 module.

You seemed willing to try doing what I call “flying blind” though, so for testing I used this option: