Mysql and deduplication

gsloop.treo_gsloop.t · February 28, 2018, 4:54pm

So, I’ve got a production system I’m testing Duplicati on.
Essentially the client has a very large mysql database. The application itself handles backup, properly dumping the mysql databases for us/Duplicati to handle. [We don’t have to manage doing backups of in-use DB files, etc.]

So, we’re just backing up the “backups” the application makes.

The files we get are both tar.gz and tar [only] files.

The “problem” is deduplication.
The size of the “backup” before we hand it to Duplicati is ~100G. A daily delta is about 50G - so dedup is doing something, but not nearly as much as I’d expect.

So, some questions about dedup.
Is tar going to impact deduplication substantially?
I assume tar.gz may well do so; since, as the data changes, the compression will produce [I think] pretty substantially different output files, and that will make dedup a lot more difficult. [And Duplicati, IIRC, uses a fairly simple dedup process that likely won’t “catch” this.]

So, suggestions as to maximize the dedup process would be helpful.

TIA
-Greg

Pectojin · February 28, 2018, 5:24pm

Tar may impact deduplication. It depends on how your data is changed, but essentially, if you put some stuff into the beginning of the tar it will affect how Duplicati cuts the file into blocks and then fewer blocks will be “duplicates”.
You’ll get more consistent block deduplication with many smaller .sql files (if your application dumps to separate sql files anyway).

Yup. gunzip completely screws over dedup because any tiny change can change the entire result. Additionally, Duplicati won’t compress .gz files, so by compressing it to .gz you’ve forced Duplicati to put raw gunzip data into it’s zip archives.

I would dump each table into it’s own sql file and leave them untouched in a folder.