Disable deduplication

sherman · January 31, 2018, 4:44pm

Hi,

Is it possible to disable deduplication when performing backup jobs?

thanks

kees-z · January 31, 2018, 5:30pm

No, this is not possible. Duplicati’s core logic is built around deduplication.
But why on earth would someone disable this feature? I can’t think of any reason to do this.

sherman · January 31, 2018, 5:53pm

Why should disable this feature ?

Simply to optimize backup times.

kees-z · January 31, 2018, 6:25pm

Re-uploading all data every time you run a backup job would heavily increase backup time. Also you will soon run out of backend storage space or lose versioning.
Disabling deduplication is not possible, but you will not benefit from it in any way.

If you want to decrease backup times, try playing with compression (set compressiobn level to 1 or disable it completely) and disable encryption if you backup to local storage.

Pectojin · January 31, 2018, 10:53pm

It’s not deduplication in the classical sense. It doesn’t put a dedup table in memory. It’s just that the way the data is stored it can’t write the same data to your backup twice because it’s already there.

It’s kind of like putting things into a hash table. You just can’t create duplicates because it has to be stored in the same spot. Which, I just realized, is probably the worst way to explain this to someone who didn’t have a class on algorithms and datastructures… Sorry.

JonMikelV · February 8, 2018, 8:20pm

I can think of one - functional isolation testing. But it’s not that valid for real world backups.

The way Duplicati uses the local database it’s possible any benefits seen by not doing deduplication calcs could be lost to decreased performance due to a larger database.

But that’s just a guess, I don’t know this for sure.

jordantrizz · July 14, 2022, 2:19pm

Sorry for digging up this thead.

Exactly, just use tar instead with no compression

However, there is another reason you’d want to do this, and that’s to limit CPU utilization.

ts678 · July 16, 2022, 2:43pm

Welcome to the forum @jordantrizz

Are you making a point, or looking for solutions? Per earlier notes, this is quite central.

The backup process explained
How the backup process works
How the restore process works
A block based storage model for remote online backups in a trust no one environment
Block-based storage engine
Developer documentation
Documentation for the local database format

If you prefer a non-block-based file copier, Duplicati is not the program you should use.
Overview in the manual describes the intended usage. Possibly your need is different?

jordantrizz · July 18, 2022, 2:08pm

Just making a point, that is all. Either use another backup solution or cpulimit.

ts678 · July 19, 2022, 4:35pm

This looks like it uses SIGSTOP and SIGCONT to limit CPU use. I’m not sure how “bumpy” that feels.
Duplicati has a thread-priority option. For an ambitious setup, you could try systemd resource control.
Either way, the CPU gets used. It’s just a question of how much it gets in the way of other CPU need.

I sure don’t follow everything Linux can run, but several popular ones I know of run deduplication too.
That tends to go with block-oriented backups. There might still be some file-oriented backups around.
On Windows, Cobian Backup and now Cobian Reflector can do that. Here’s a page on Linux options:

Cobian Reflector Alternatives for Linux

Above actually said something for me that I was about to say – maybe a fancier sync program will do, depending on your versioning needs. For low-versioning CLI usage, even rclone --backup-dir may do.

jordantrizz · July 19, 2022, 6:10pm

IBM TSM use to do deduplication at the storage level so the clients simply streamed the data, and then the deduplication was offloaded to the backup server. You could do the same with duplicati, and have a dedicated backup server that pulls down and compares over ssh.

drwtsn32 · July 19, 2022, 6:49pm

One of the main design goals of Duplicati was to have no server side component and to be able to utilize dumb storage. As such all processing has to be done client-side.

Some features are sacrificed by not having any server side processing, of course. Which is better depends on the features you need.

jordantrizz · July 19, 2022, 7:02pm

Which is awesome, and I have no problem with this in certain circumstances. But on servers that do run high load, it does cause issues hence looking for an alternative setup.

ts678 · July 19, 2022, 9:15pm

I wasn’t sure if these were really slow small systems, or just really busy ones. Sounds like the latter.

This might not go anywhere, but now that you talk of a secondary system what do you have lots of?
Somebody with no constraints on network or storage has it easy, but most environments are limited.

Do you think you need deduplication, but want it on some other system, or do you not need dedup?
Client-side deduplication reduces the loading on everything downstream, but it does occupy a client.

Time of day might work to your advantage, unless there is no client idle time to use cycles to backup.
You might also be able to do some hybrid approach, but that’s trading cycles for design time and risk.

gpatel-fr · July 19, 2022, 10:01pm

yeah, that’s a staging server. Come standard with big boys systems such as Veeam, or Amanda. No one would support such a complex business software without either making it proprietary or making it impossible for small teams to install and manage by themselves (arcane procedures, cryptic documentation, …). Just fork out some dough to Veeam if that’s what you want.

ts678 · July 19, 2022, 11:36pm

… which Duplicati is not. This is spare-time (which seems super-rare at the moment) volunteer coding.
I was kind of wondering what the budget for this is when I read comparison to Tivoli Storage Manager.

At a technical level, I’m wondering how known it is that local hash + lookup is costlier than file transfer.
The former is basically the deduplication cost. You can make it faster but bigger with use-block-cache. Using a larger blocksize can speed up large backups. Try not to have more than a few million blocks…

If you’re backing up to storage that knows hard links, some people do file-level deduplication that way.
If you want to try for block-level deduplication, some filesystems (e.g. ZFS) do that, at cost of memory. Maybe other backup software could use less CPU than Duplicati which is in C#, not using native code.

UrBackup FAQ sounds like it transfers first and processes slowly on server. See forum comments too.
Possibly you run a larger scale operation where it would be worth setting up a more complex backup?

This is kind of foreign to me, so I’m glad to hear some input from someone who knows costly systems.
As a side-note, Duplicati is probably not reliable enough to bet any business on it as a solitary backup.

jordantrizz · July 20, 2022, 3:25pm

Yes, CPU intensive at times.

Backblaze, Wasabi, Cloudflare R2, Synology, Minio on commodity storage running ZFS or Btrfs

Not really; having the option to turn it off simply would be helpful but maintaining all the other features.

Unfortunately, the time of day won’t work due to the load being consistent 24x7.

TSM is a massive boy system, and super complex but worked well. Hence why it’s proprietary and hard to manage unless you understand it inside and out. Veeam is great. You could still operate a staging server using Duplicati, the setup would simply be different, that’s all.

I have experience for TSM but this is mostly for a low-cost non-profit project So the budget is small.

Will try this out.

I’ve tried it, but love Duplicati This isn’t large, small and unfortunately just peg’s CPU all the time. So might just do tar backups or snapshot.

I’ve had to deal with Backupexec which was a nightmare, and TSM which was a massively steep learning curve but bullet proof once in operation. But requires a full FTE to deploy, manage and maintain when you get over 500TB of data.

ts678 · July 20, 2022, 8:30pm

The first three are pure storage (no compute) in cloud. The others might be nearby, with compute.
One question would be if there’s enough storage to sync server files to for Duplicati to do backup.
Non-disaster-recovery restore would be awkward copying back to the original CPU-scarce server.
I’m not sure how often a copy-back-to-source-server would run though, short of disaster recovery.
Duplicati could disaster-recover right onto the new source server, but DB building can slow things.

ZFS and Btrfs can (I think) deduplicate, but I’m not sure how bad the cost in memory (etc.) will be.
But you said “not really” to needing deduplication, so maybe it’s just get those unique files copied,
like every now and then a file appears or changes hugely, so just notice that, and stash to backup.

I can see how you don’t want to steal cycles from it for other purposes, such as doing the backup.

I’m not sure if that’s just grab-everything (whether it’s changed or not), but it might use some CPU.
If you find it CPU-effective to just copy everything, that’s certainly an easy way for limited backups.
If storage is ample, then the need to squeeze goes away. I pay for cloud storage, so like it small…

If a smallish portion of files changed each time, I suppose find could find them and feed into tar.
This would probably get back into an old-school differential and incremental backup management.
I never had the joy (?) of managing one of them, although I know a few of the practices employed.

Thanks for for the kindness, but lots of what Duplicati offers to typical backups doesn’t fit well here.
You can turn off encryption and compression if you like, but deduplication goes with a block design.
A big file is represented in the backup as a list of its blocks, every one in one known remote dblock.
It’s a little like saving space by using hard links instead of file copies when the file has not changed.

If you think the blocks are quite unique within and between files, I suppose you could set blocksize
very high to basically get little deduplication, and you could set Remote volume size higher as well.
This pushes Duplicati into being a full-file copier-and-tracker, but compact might still get you, as the
mostly-now-wasted volume that used to hold a deleted file might be read for a few still-used blocks.

I think I’ve given about all the ideas I have, so good luck to you in figuring out how to do the backup.

ts678 · July 21, 2022, 6:56pm

For example, FreeFileSync to one of your local systems, set versioning up, and do periodic trimming.
That would probably get to your seeming goal of transferring files without doing the block dedup part.

UrBackup still looks interesting if you prefer a client agent with a server to help with the heavier lifting.
This is the only thing I’ve seen that talks about CPU cycles, which apparently are scarce on the client.

Bacula might be another option for a client/server setup (if you like that). but I’ve looked at it very little.

Any of the above might leave you with no offsite backup. There’s an advantage to going right to cloud.

I “think” those both support replication. I don’t know if it’s CPU-efficient, or if the source servers have it.

Duplicati does have several other advanced options to make it less disruptive, but it’s basically slowing processing. Eventually the same amount of work gets done, but backup will disrupt other activities less.

Possible to disable deduplication? might be you? Duplicacy is compiled Go, as is restic. Still high CPU?