Retention question

mshillam · September 18, 2018, 12:32am

Hi all,

I am just in the process of evaluating Duplicati, i have a question about how retention works.

For Example if set to 30 versions. After the initial upload all changes to a file would be backed up daily, then once the 30 version retention cap has been met, what then happens?

Does it remove all references to this file and re-upload afresh?

or

Does it Remove the oldest version, which would then mean it has to be merged with the following version?

I have huge datasets hence the question, i would not want to be re-backing up Terra-bytes of data every 30 days.

Also, what kind of impact space wise does the versioning have, is the metadata relative to this excessive?

Any help/advice is appreciated. Thanks in advance
Mat.

ts678 · September 18, 2018, 2:19am

Hello @mshillam and welcome to the forum!

When a backup version is deleted, that “view” of all the files (at some point in time) is no longer available for restore, however more recent views (and their files) remain. Duplicati’s Block-based storage engine doesn’t do merges because any version of a file is represented as a list of blocks, and its list may change over time. When a block is no longer in use at all, its storage at the destination is reclaimed by automatic compactions.

Basically you’re uploading file deltas, but you might not even upload a block if it’s there from some other file.

A version is basically a list of all your files, plus for every file a list of all its blocks (which are known by their hash). How the backup process works discusses this, and you can also see How the restore process works, and Choosing sizes in Duplicati. Files that are massively changed tend to defeat block-based deduplication and the delta-upload plan, whereas files that change repeatedly can find their blocks scattered, requiring a download of many dblock files (maybe containing irrelevant blocks) to collect the blocks for the restored file.

How huge are the datasets? Duplicati is sometimes not so fast on huge datasets because (the theory goes) the local SQLite database that tracks all the pieces slows down (e.g. on inserts) as the tables become huge. Losing the database (e.g. by disk crash) can mean a lengthy rebuild from the information at the destination.

Hearing TB of data makes me a bit nervous, although you can read Best practice for large data set (18TB)?

mshillam · September 18, 2018, 12:52pm

Hi @ts678

Many thanks for your response. This is making more sense now, I will read up further at the links you have provided.

The datasets are anything between a couple of gig to 1TB per machine that needs backing up.

Regards
Mat.