Filtering other jobs from source: a catch-all job

dgasparri · February 20, 2018, 3:51pm

Hi all,

I’m setting up a backup for my HD (some 2Tb of data), and I read in many places that a few small jobs are “better” (more performing, more secure) than a single huge job.

I have folders such as music that have a lot of data that rarely changes, so I am backing them up in separate jobs.

It would be nice to have a “project” to coordinate those single jobs, but I imagine that would require a lot of work to implement.

So I am manually creating a “catch-all job” that backs up everything that is left behind from the other jobs.

In this job I’m manually excluding each folder through a filter, but it would be easier to just filter out “everything that should be backed up by job XYZ”

JonMikelV · February 22, 2018, 9:13pm

If I understand your idea correctly, you’re proposing the ability for job B to be told to "don’t back up anything already covered by job A - is that correct?

If so, doesn’t that cause a potential “explosion” in backup size of job B if job A is deleted?

dgasparri · February 23, 2018, 7:26am

Indeed.

The purpose is to create a “whole disk” backup, without having the 2-3 TB backup size. For example, one could create

Job 1: Music
Job 2: Video
Job 3; Pictures
Job 4: Everything else that was not backed up by Job 1, 2 and 3

That would be equivalent to have a single “main-job” that splits the backup into multiple smaller targets.

From my understanding of how Duplicati works (still learning C sharp to be able to contribute), that seems to be the easiest way to implement it, as it only requires a (meta-?)filter that reads the other job’s config file and enqueue the appropriate parts into the current job filters. All the interpreters for config files and filters are already developed.

A nice feature would be to exclude the paths/files included in the other job, but NOT to include the paths specifically excluded in the other job. For example, I set up the “pictures” job to exclude the “.thumbnail” subfolders. Those subfolders should not be backed up also in the main job.

What do you think of it?

kees-z · February 23, 2018, 1:21pm

I guess this is a feature that is very tricky to implement. In general, I think it’s not recommended to create backup jobs that are dependent on other backup jobs. Also, I expect a lot of doubtful situations how it behaves.

What to do if one of the other backup jobs is deleted? Include all files from that job in the “Everything else” job? That will cost a lot of storage space at the backend, which probably is not intended.
What to do with filters? A single backup job can have a lot of filters. Each filter can have an advanced value. It can be very difficult to calculate and to understand which files are actually included/excluded in the “Everything else” backup job.
How to handle cross-backup job filters? In backup job A, a filter that excludes extensions JPG and PNG could be applied. This filter probably is unintended for job B, that contains your picture collection. On the other hand, you want to exclude “.thumbnail” folders in all backup jobs.
And what to do with conflicting filters in individual backup jobs? If backup job A excluded *.jpg and backup job B includes this extension, what should the “Everything else” backup job do?
How to handle backup jobs that have overlap in the source selection, but have different filters? Which source selection and/or filters to apply to the “Everything else” job?

I guess this feature makes it more complicated instead of easier in many situations. In most cases, it’s much easier to exclude the folders manually that are defined as source in the other backup jobs.

That being said, generally it is not recommended to backup a complete harddisk/volume. Instead, select the folders that contain data that you don’t want to lose and select these folders as backup source. Including OS files and system files in your backup will cost a lot of backend storage space that you will never need.

JonMikelV · February 23, 2018, 5:16pm

I think I understand why you want such a feature, but I agree with @kees-z that it would be very tricky to implement.

Just out curiosity, are you thinking of such a feature solely because of performance issues with very large backups?

If so, it might make more sense to spend time fixing that known issue rather than adding functionality solely to “get around” it. However, if there are other use cases beyond that then it might sense to look more closely at the idea.