I think it would be great if it is possible to define which files to ignore in a file in any folder (e.g. .backupignore).
It could work just like .gitignore works.
It’s quite hard to setup exclusions globally, since in each folder we may need to ignore something else.
Currently the way Duplicati currently processes folder structures is that once a folder is ignored, nothing inside that folder is scanned. This is done for performance reasons, but a side effect is that an ignore filter at a higher folder might cause ignore filters at lower folders to never actually be seen.
It can also start getting messy if people have multiple backup jobs looking at different content in the same sets of folders.
I think that .gitignore ignores only files since there is not really a concept of folders in git. In that case there should not be a problem with ignored folders since only files are ignored.
However in Duplicati if some exclusion ends with / it may be interpreted as ignoring whole folder. And once the folder is ignored, any rule in higher level, that ignores any part of that folder is irrelevant since everything in that folder should be already ignored. And since .gitignore can ignore only content of it’s folder (not the folder itself) there won’t be any problem not scanning ignored folder.
Multiple backup jobs should also not be a problem, since this ignore file should mean, that exclusions should be applied for any backup - it’s kind of metadata of the folder independent of any specific job (or even backup software if format is equally understood by all of them - hence my initial suggestion of software agnostic file name .backupignore).
The possibility to specify the name of the ignore file (per job?) would be also nice to have, since one can reuse .gitignore if he don’t want to backup untracked files, but will allow to have different name if this is not the case.
My use case is that I need to backup folder where each subfolder contains application data for a windows program. Many programs saves there some kind of cache or temporary files (with huge size) with completely different naming so it is quite cumbersome to exclude them on job level.
Ignoring according to .gitignore (in any folder) would be a huge step forward indeed. However I think that possibility to specifiy the filename instead of hardcoding .gitignore should be quite easy in comparison to that.
I hoped that the ignore file (.gitignore or .duplicatiignore or user specified filename) could work really like .gitignore works - there will be only one ignore file in any folder and exclusions will be defined inside the file ideally with syntax compatible with git’s .gitignore.
Maybe my previous post was a little confusing. I meant that once parsing of .gitignore and ignoring according to that is done, then the possibility to specify custom name of ignore file should be the easier part.
should symlinks be treated as files or folders?
I think symlinks should be handled transparently. Folder symlinks handled as folders and file symlinks as files.
should pointers to cloud storage files be treated as files or folders?
Honestly I didn’t know there is such thing in Duplicati, but I guess it could be handled like symlinks.
should the .duplicati-ignore* files themselves be included in the backups
I believe that ignore files should definitely be backed up just like .gitignore is always tracked in git. From my perspective it is folder’s metadata that must be retained.
So I suppose the easiest way to implement this would be to the add it into the initial folder scan and say that if an ignore file exists, parse it into “standard” Duplicati exclude rules then continue with the folder.
That shouldn’t play too much havoc with the progress bar…
Of course that means logging off filter exclusions might start include ignore exclusions as well…
Git has file-based configuration, like the .gitignore mentioned here. Duplicati stores it’s configuration in an SQLite database. Having configurations in two places can make things complicated and is not so nice from a design perspective. What happens if the ignore filter stored in SQLite is not aligned with the .duplicatiignore files? I guess it would make sense to take the union of the ignore rules. I think there could be a lot of subtleties here, also with respect to restoring (previous backup sets without these additional filters) and handling these files.
Though I understand it’s easier to just edit locally some files and more dynamically at that too.
What happens if the ignore filter stored in SQLite is not aligned with the .duplicatiignore files?
I’d think this feature should be an additional layer on top of Duplicati’s filter since the .gitignore file never specifies files/directories to include, only ones to filter out of the data set. If I set Duplicati to back up C:\Users\myUser\ but have a .duplicati-ignore file in there that lists /Downloads/ then it feels intuitive to understand that Duplicati shouldn’t back up C:\Users\myUser\Downloads\ or anything in it.
also with respect to restoring (previous backup sets without these additional filters)
I believe the way git handles this would translate to anything backed up that is later added to an ignore file is still valid to restore as long as that backup exists, just stop recording changes if it’s listed in an ignore file.
Come to think of it, would it be easier to add a multi-line input option to the UI when specifying filters? It’d allow someone to keep a personal “.duplicati-ignore”-type file and just copy/paste the list into the existing filter option to minimize impact on the application.
EDIT: This already exists. I guess the question is less about entering filter parameters and more about updating filter parameters (potentially across multiple backup jobs) in a convenient way as new files/directories are created? Or maybe, like myself, @verglor was unaware of this option?
I think a goofier example would be what happens when a user specifically says to include C:\Users\myUser\Downloads\ but then puts creates C:\Users\myuser\.duplicatignore that says to IGNORE the Downloads folder.
I’m not sure how Duplicati handles that now with included folders vs. exclude filters, but I would assume the similar .duplicatignore scenario would be handled the same way.
I would consider .duplicatiignore file as part of folder’s metadata saying what should not be backed up in any circumstance. From Duplicati’s perspective files and folders ignored in ignore files don’t exist. It is lower level than exclusions configured in job so there is no conflict. Current job level exclusions are also useful for “global” job specific exclusions on top of files that are not already ignored in ignore file (or deleted which should be equivalent for Duplicati).
If you look at ignore file as metadata, it should be as tightly connected to the data as possible. If I reorganize folder structure (or just rename folder) I want metadata to stay with the folder. I want all jobs to ignore according to this metadata. This is very difficult to achieve without ignore files (or other form of metadata connected directly to the folder)
Theoretically we should scan for ignore files up to the root to find out what should be ignored in job’s folder. But I think that we can go without this complication and just ignore everything outside job’s context (folder) including ignore files assuming that if someone configure job folder that is ignored in upper context, he knows what he is doing (it would be like moving this folder somewhere else - out of reach of ignore files in upper folders).
Speaking of scanning - this adds complexity to the “real time backup file scanner” USN type stuff. If USN says “file X has changed” we now have to scan the entire path tree for file X to check for .duplicatiignore files with contents that might exclude file X or a parent path of X.
Would it be possible for duplicati to pick up any .duplicatiignore files at run time and save some back-end data for each which would register similarly to how excluded paths work? It would add trivial extra processing during a first run to notice and do this change for each ignore file, and trivial extra processing during subsequent runs to find and verify that previously-indexed ignore files are still there - but it seems to me that it would allow the problem you mention above to be almost completely sidestepped.
Performance wise that would make sense, but as with most things that improve performance I think it would make implementation more difficult be cause now we’re keeping essentially a local copy of .duplicatignore file locations that must be validated / maintained with each run.