Ignore file like .gitignore

verglor · October 27, 2018, 1:25pm

Hi,

I think it would be great if it is possible to define which files to ignore in a file in any folder (e.g. .backupignore).
It could work just like .gitignore works.
It’s quite hard to setup exclusions globally, since in each folder we may need to ignore something else.

Thanks.

JonMikelV · October 28, 2018, 1:38pm

HI @verglor, welcome to the forum - and thanks for the suggestion!

There are a few similar discussions (such as Support pluggable filter addons to ease implementation of special filtering and [Suggestion] Filter “Exclude folder if it contains a file xy” just like .nomedia) but I believe the main reason there isn’t a .duplicatiignore type feature due to how Duplicati processes folders.

Currently the way Duplicati currently processes folder structures is that once a folder is ignored, nothing inside that folder is scanned. This is done for performance reasons, but a side effect is that an ignore filter at a higher folder might cause ignore filters at lower folders to never actually be seen.

It can also start getting messy if people have multiple backup jobs looking at different content in the same sets of folders.

verglor · October 29, 2018, 6:15pm

Hi @JonMikelV,

I think that .gitignore ignores only files since there is not really a concept of folders in git. In that case there should not be a problem with ignored folders since only files are ignored.

However in Duplicati if some exclusion ends with / it may be interpreted as ignoring whole folder. And once the folder is ignored, any rule in higher level, that ignores any part of that folder is irrelevant since everything in that folder should be already ignored. And since .gitignore can ignore only content of it’s folder (not the folder itself) there won’t be any problem not scanning ignored folder.

Multiple backup jobs should also not be a problem, since this ignore file should mean, that exclusions should be applied for any backup - it’s kind of metadata of the folder independent of any specific job (or even backup software if format is equally understood by all of them - hence my initial suggestion of software agnostic file name .backupignore).

The possibility to specify the name of the ignore file (per job?) would be also nice to have, since one can reuse .gitignore if he don’t want to backup untracked files, but will allow to have different name if this is not the case.

My use case is that I need to backup folder where each subfolder contains application data for a windows program. Many programs saves there some kind of cache or temporary files (with huge size) with completely different naming so it is quite cumbersome to exclude them on job level.

JonMikelV · October 29, 2018, 6:34pm

I like the idea of a job-specific .ignore feature, but implementing it could be a pain.

Would your requirements be met if Dulicati paid attention to the contents of .gitignore files themselves?

verglor · October 29, 2018, 8:36pm

Ignoring according to .gitignore (in any folder) would be a huge step forward indeed. However I think that possibility to specifiy the filename instead of hardcoding .gitignore should be quite easy in comparison to that.

JonMikelV · October 29, 2018, 9:03pm

Oh, right - there I go overthinking things again.

OK - are we talking something like this?

if .duplicati-ignore exists, ignore all files AND subfolders
If .duplicati-ignore-files exists, ignore all the files in the folder but continue process subfolders
if .duplicati-ignore-folder exists, process all the files in the folder, but ignore subfolders

Questions:

should symlinks be treated as files or folders?
should pointers to cloud storage files be treated as files or folders?
should the .duplicati-ignore* files themselves be included in the backups

verglor · October 29, 2018, 9:29pm

I hoped that the ignore file (.gitignore or .duplicatiignore or user specified filename) could work really like .gitignore works - there will be only one ignore file in any folder and exclusions will be defined inside the file ideally with syntax compatible with git’s .gitignore.

Maybe my previous post was a little confusing. I meant that once parsing of .gitignore and ignoring according to that is done, then the possibility to specify custom name of ignore file should be the easier part.

should symlinks be treated as files or folders?

I think symlinks should be handled transparently. Folder symlinks handled as folders and file symlinks as files.

should pointers to cloud storage files be treated as files or folders?

Honestly I didn’t know there is such thing in Duplicati, but I guess it could be handled like symlinks.

should the .duplicati-ignore* files themselves be included in the backups

I believe that ignore files should definitely be backed up just like .gitignore is always tracked in git. From my perspective it is folder’s metadata that must be retained.

molecular_eskimo · October 29, 2018, 9:31pm

I think the way .gitignore files work is more along the lines of read through the .gitignore file (only one in any single directory), and apply the filter therein to the current directory.

So if you’re in a directory with .duplicati-ignore, dirA, dirB, fileA, and fileB and the contents of .duplicati-ignore is

#System data I don't want to back up
/dirA/

#Some other file that I don't mind losing
fileB

then you’d add fileA to the list to be backed up and descend into dirB.

JonMikelV · October 29, 2018, 10:01pm

So I suppose the easiest way to implement this would be to the add it into the initial folder scan and say that if an ignore file exists, parse it into “standard” Duplicati exclude rules then continue with the folder.

That shouldn’t play too much havoc with the progress bar…

Of course that means logging off filter exclusions might start include ignore exclusions as well…

verglor · October 29, 2018, 10:49pm

This sounds good.
Are you going to implemet it ?

JonMikelV · October 30, 2018, 3:45am

Erm… It shouldn’t take much of a look around to realize I’m more of an enabler than a doer.

verhoek · October 30, 2018, 12:58pm

Git has file-based configuration, like the .gitignore mentioned here. Duplicati stores it’s configuration in an SQLite database. Having configurations in two places can make things complicated and is not so nice from a design perspective. What happens if the ignore filter stored in SQLite is not aligned with the .duplicatiignore files? I guess it would make sense to take the union of the ignore rules. I think there could be a lot of subtleties here, also with respect to restoring (previous backup sets without these additional filters) and handling these files.

Though I understand it’s easier to just edit locally some files and more dynamically at that too.

molecular_eskimo · October 30, 2018, 2:30pm

What happens if the ignore filter stored in SQLite is not aligned with the .duplicatiignore files?

I’d think this feature should be an additional layer on top of Duplicati’s filter since the .gitignore file never specifies files/directories to include, only ones to filter out of the data set. If I set Duplicati to back up C:\Users\myUser\ but have a .duplicati-ignore file in there that lists /Downloads/ then it feels intuitive to understand that Duplicati shouldn’t back up C:\Users\myUser\Downloads\ or anything in it.

also with respect to restoring (previous backup sets without these additional filters)

I believe the way git handles this would translate to anything backed up that is later added to an ignore file is still valid to restore as long as that backup exists, just stop recording changes if it’s listed in an ignore file.

Come to think of it, would it be easier to add a multi-line input option to the UI when specifying filters? It’d allow someone to keep a personal “.duplicati-ignore”-type file and just copy/paste the list into the existing filter option to minimize impact on the application.

EDIT: This already exists. I guess the question is less about entering filter parameters and more about updating filter parameters (potentially across multiple backup jobs) in a convenient way as new files/directories are created? Or maybe, like myself, @verglor was unaware of this option?

JonMikelV · October 30, 2018, 4:04pm

I think a goofier example would be what happens when a user specifically says to include C:\Users\myUser\Downloads\ but then puts creates C:\Users\myuser\.duplicatignore that says to IGNORE the Downloads folder.

I’m not sure how Duplicati handles that now with included folders vs. exclude filters, but I would assume the similar .duplicatignore scenario would be handled the same way.

drakar2007 · October 30, 2018, 4:08pm

FWIW: if it were me, i’d make the naming convention slightly more clear on this piece and call it .duplicati-ignore-subfolders.

verglor · October 30, 2018, 6:02pm

I would consider .duplicatiignore file as part of folder’s metadata saying what should not be backed up in any circumstance. From Duplicati’s perspective files and folders ignored in ignore files don’t exist. It is lower level than exclusions configured in job so there is no conflict. Current job level exclusions are also useful for “global” job specific exclusions on top of files that are not already ignored in ignore file (or deleted which should be equivalent for Duplicati).

If you look at ignore file as metadata, it should be as tightly connected to the data as possible. If I reorganize folder structure (or just rename folder) I want metadata to stay with the folder. I want all jobs to ignore according to this metadata. This is very difficult to achieve without ignore files (or other form of metadata connected directly to the folder)

verglor · October 30, 2018, 6:20pm

Theoretically we should scan for ignore files up to the root to find out what should be ignored in job’s folder. But I think that we can go without this complication and just ignore everything outside job’s context (folder) including ignore files assuming that if someone configure job folder that is ignored in upper context, he knows what he is doing (it would be like moving this folder somewhere else - out of reach of ignore files in upper folders).

JonMikelV · October 30, 2018, 7:57pm

Speaking of scanning - this adds complexity to the “real time backup file scanner” USN type stuff. If USN says “file X has changed” we now have to scan the entire path tree for file X to check for .duplicatiignore files with contents that might exclude file X or a parent path of X.

drakar2007 · October 30, 2018, 8:52pm

Would it be possible for duplicati to pick up any .duplicatiignore files at run time and save some back-end data for each which would register similarly to how excluded paths work? It would add trivial extra processing during a first run to notice and do this change for each ignore file, and trivial extra processing during subsequent runs to find and verify that previously-indexed ignore files are still there - but it seems to me that it would allow the problem you mention above to be almost completely sidestepped.

JonMikelV · October 30, 2018, 9:10pm

Performance wise that would make sense, but as with most things that improve performance I think it would make implementation more difficult be cause now we’re keeping essentially a local copy of .duplicatignore file locations that must be validated / maintained with each run.

So faster, yes - harder to do, also yes.