Backup only last 2 years files

jeferson · November 10, 2023, 4:27am

Hi there,
I have a client with about 12TB of data, but that data exists over more than a decade.
I will be migrating to duplicati, but I would like to back up the Windows share, only the last 2 years of files, because above all I need to guarantee the most recent files.
I searched the forums but didn’t find an exact answer.
can you help me?

ts678 · November 10, 2023, 8:02pm

Welcome to the forum @jeferson

This worries me. Is this the one backup (risky)? Is it local or remote (slow)? Is long DR downtime OK?

Also not advised, as it prevents use of VSS (locked file semi-solution) and USN (can speed up backup) rather than a slow scan over the network. Any idea how many files there are in that rather large 12 TB?

I don’t think there’s any such feature, but use of Features category can request them (can take awhile).

is a workaround, but I’m not sure how reliable use of an archive bit is. Again, how critical (or not) is this?

Do you have long experience with it? Although it’s best to backup on source system, what’s the backup system OS and capacity? You might do better having a drive on that do some date-filtering during sync, thereby allowing Duplicati to backup a local drive. It “looks” like robocopy and rclone permit age filtering.

You also need to think about how to restore, and how long to keep Duplicati versions (2 years as well?).

jeferson · November 10, 2023, 11:53pm

Hi there!

I already have a safe backup for 3+ years.

The data structure from this client is very complicated, and for now, we can’t simply remove old files. They need to stay on file servers until I can negotiate with the directors, but it’s a total of 12TB; recent data (less than 3 years) is only 2.5 TB.

I will start the first backup with Duplicati this weekend, so it would be very useful to not spend a lot of time with data for which I already have a backup.

ts678 · November 11, 2023, 12:12am

That’s fine, but your wish was to NOT back those up. I gave you a couple of (workaround) ideas.
There was no suggestion made to go and delete them, but you did explain why they’re still there.

You can certainly see if anybody has better ideas in few hours before weekend. Might not happen.

You’re not commenting on much that I said, but note Duplicati by default slows down after 100 GB.
Scaling up blocksize can help it, but not if there are millions of files, as those drive block counts up.

For disaster recovery, keep in mind that there’s a database, and recreate time can slow recoveries.
Cloud storage, if used, can do that too. How long can you stand to be out if a disaster takes place?

EDIT 1:

TBH your schedule sounds pretty rushed, but here are some more details on age-filtered sync idea:

robocopy (see /maxage)
rclone (see max-age)

You would have to sync to several TB of space, but drives are quite cheap these days if all else fails.
You would probably also want to sync deletions from source, and not copy unchanged source again.
This means you’d have to do some testing in order to make the file sync work the way you want it to.

EDIT 2:

Although I was concerned about Archive attribute plan, there are various other ones you could set on complete (lots of TB though) sync of the original. Set them in a script like one cited, then exclude like:

exclude-files-attributes

In exchange for having a large drive, you have a convenient local copy of original files for fast access. Might want to undo whatever changes you do to the files to mark them as excluded though. If Archive attribute is chosen, I’d consider it pretty safe in a limited environment such as your sync of originals…

ts678 · November 11, 2023, 10:46pm

Getting further into the test-it-well ideas (previous ones, especially the date-limited sync, seem tame):

cp(1) - Linux man page

-l, --link
    link files instead of copying 

-s, --symbolic-link
    make symbolic links instead of copying

is a trick sometimes also used on Windows, e.g. to make a series of backups take less actual space.
A script could be created that enforces the maximum age of the link view. Duplicati backs up by links.

  --symlink-policy (Enumeration): Symlink handling
    Use this option to handle symlinks differently. The "store" option will
    simply record a symlink with its name and destination, and a restore will
    recreate the symlink as a link. Use the option "ignore" to ignore all
    symlinks and not store any information about them. The option "follow"
    will cause the symlinked target to be backed up and restored as a normal
    file with the symlink name. Early versions of Duplicati did not support
    this option and bevhaved as if "follow" was specified.
    * values: Store, Follow, Ignore
    * default value: Store

would probably need adjusting, then the question is – if it’s set to follow, is that the first or every one?
Windows also has a more complex view of linking (more varieties, more capabilities), as mentioned in:

“directory junction” vs “directory symbolic link”?

but if you can get all the stuff to land correctly, a tree of links or junctions might be a very compact view, thereby converting storage costs into labor costs. It might make more sense for a home user to attempt.

An even more questionable tactic is to write one’s own scanner to try to feed Duplicati files using option:

  --changed-files (Path): List of files to examine for changes
    This option can be used to limit the scan to only files that are known to
    have changed. This is usually only activated in combination with a
    filesystem watcher that keeps track of file changes.