Server side implementation (for ssh backends) - for cli delete&compact - when using low bandwidth links

Hi Everyone,

I have recently starting using duplicati as a replacement for crashplan. Have found it to be robust and easy enough to use.

The client environments are windows/linux, with the server environment running linux.
The duplicati version used on all environments is the latest 2 branch beta.

As most of the data I’m backing up is being sent using adsl2+ upstream, I’m limited to about 40Kbit’s a second in order to not impact general internet connectivity. However once an initial backup is completed (for sites with many machines - sometimes seeded backups) the delta’s are low and manageable.

However there a two scenarios where allot of data will need to be sent again, when:

  • delete
    is run to purge old versions
  • compact
    is run to compact after the above operation

If I have missed any other operations that would be made, that use significant upstream traffic other than the above, can you please call them out.

Long story short, as I’m pushing the data to equipment I own and trust (can store the encryption password in another user outside the jailed sftp context), I can’t see a reason why someone couldn’t use the duplicati-cli to implement server side delete&compact operations.

I see the following steps would be required:

  • clients would need to be configured with a keep-time that’s arbitrarily long (haven’t been able to confirm if 0 just disables it deleting) and the no-auto-compact flag set.
  • the server would need to have a cron job that walks the backups weekly, implementing:
    ** duplicati-cli delete /path/to/backup --keep-time=6M
    ** duplicati-cli compact /path/to/backup (possibly also tweak any options as throughput is no object)

Can someone please provide some feedback on the above, am I on the right track ?
Able to fill in any of the gaps on:

  • the methodology
  • recommendations on what options / flags should be used when executing client or server
  • best way to handle passing the encryption password (my guess just pipe it straight into stdin).

JonathanM, I’m no expert in how the underlying stuff works in Duplicati but it sounds to me like you wouldn’t necessarily even need such a complicated server side maintenance job as you describe.

All of this assumes (and I could be wrong here) that the client side has all the historical information at it’s disposal…

What if the client side still did all the deciding of what needs to happen and merely ask the server side to do take care of it. For example, the client side says “heh, it’s time to verify a backup file…let’s pick file number 12345” and instead of downloading and verifying it like is currently done the client instead said to the server “please to a compression validation on file number 12345 using this encryption key (if necessary)”.

This allows for keeping configs and the like local but still passing off general maintenance to the remote destination.

For your two examples I envision:

  • delete AND compact = “Hey server, could you please re-compress these archives leaving the following files out {a, b, c, d} and let me know when you’re done (or perhaps I’ll check in every minute until you’re done)”

Of course this completely breaks the currently philosphy of Duplicati in that it doesn’t need a smart destination (thus can work with lots of different backends) however as an OPTION for those that do have a smart backend it might make sense.

On a side note, I don’t think delete requests take much bandwidth. They’re just calling APIs and not downloading anything. As for compacting, here’s how to disable it.

Yes, this is quite a big break from the design idea. If you do have the option to run code server-side (and don’t mind doing so), there are other solutions, e.g. Duplicacy and similar: GitHub - gilbertchen/benchmarking: A performance comparison of Duplicacy, restic, Attic, and duplicity

If you prefer the Duplicati approach with as little as possible downloads, you can set the option --upload-verification-file which will place a json file next to the uploaded files. The file contains a list of all files that Duplicati thinks should be there and their hashes. If you can run a check server-side that verifies the files, you can disable the remote verification as well.

The script is here:

Interestingly, duplicacy does not have server side component either… It’s free cli version quite similar to Duplicati, I’d say.

But regardless of other tools - shouldn’t it be really easy to setup appropriate command line on the server to run compact/purge on the server side, locally to the backup? Would it require rebuilding local DB before each run?

The server side verification script looks like a great option for people that might need it.

Unless it starts appearing that a lot of people need this type of functionality it makes sense having it available as an advanced feature.

Perhaps somebody using it will put a How-To post in place. :slight_smile:

I will have to look at it some more then! I was under the impression that it used the same methods as Attic where it has a server component, but can run somewhat without it.

It would require that the server side knows the backup password at least. It would also require some kind of re-sync mechanism, otherwise Duplicati would not be able to figure out if files just dropped out of the backend, or if the compact occurred.

Generally, Duplicati has a “do not touch the storage folder” approach, and enforces/checks this to ensure that the backup is fully working.

Yes, it would be great if you can check duplicay again - very interesting, competing tool.
It does have several drawbacks and also some wins compared to Duplicati - and one of the best features is lock-free multi-source single-target de-duplication:

On the cons side, besides the paid UI tool, it has some issues with the way source is configured and handling of VSS on Windows (only single drive works).

I feel that Duplicati is much better tool as it stands right now and if you eventually decide to add similar de-duplication, it would be great (yes, I know that you do not have current plans to implement it).

As for the server side - I agree with you on the approach in general, but having the (advanced) ability to implement at least hash verification is a good option.

BTW, you may also want to look at adding hash verification at remote storage, similar to what rclone does - it can verify files against Backblaze B2 without downloading data…

dgcom, you seem to know a bit about Duplicacy - how would you feel about starting a Comparison category Topic comparing it to Duplicati?

Interesting idea, of course, but I am only familiar with duplicacy cli version which I evaluated just recently.
I’ll see how much useful info I can gather and post if I decide that it might be interesting to others :slight_smile:

1 Like

Yes, I read it and it seems clever. However, I imagine that it fills the remote destination with millions of blocks? Am I missing some important detail here?

dgcom, well if you change your mind feel free to add your input to the (now) existing Comparison topic. :slight_smile:

kenkendk, yes - I suppose if you have a 100G source location with a BEST case scenario of a single 100G file and a 100k block size you’d end up with 1,048,576 individual block files on the destination (assuming there was nothing to de-dupe).

And that’s just the starting file count with no history!

Yes, it creates smaller blocks, but size is configurable. Are you thinking about performance impact? From my (very quick and dirty) testing with network share I did not see a problem.

Yes, I have not seen many filesystems that like millions of files. Performance really starts to degrade when there are many files or folders.

That is what I am wondering, there are people reporting slow filelistings (say with B2 that lists 200 files pr. call) and they are using 100kb blocks wrapped in 50mb volumes, it must be really bad if they are all just stored in plain.

I haven’t actually tried Duplicacy yet but I know when I’ve dealt with lots of files with structured names I end up building a folder structure to break them up - perhaps that’s how they’re implementing it.

But that’s when you have all files in the same folder, no? If you use folder structure to index your blocks, processing can be pretty fast.
I was planning to do a bit more performance comparison - now I am even more curious…

Me too, but I don’t have the time, so I hope to hear what you find out.
We have discussed splitting Duplicati backups into sub-folders, but not pursued it because with subfolders, you need to send multiple list requests to get the full file list.

That is not a problem with a local/network disk, but really sucks on some high-latency WebDAV interface.

Now I feel challenged :slight_smile:
I was reading duplicacy forums and saw people successfully backing up 500Gb to Wasabi, this means it is doable.
But I guess, restore might be more challenging.
I’ll see if I can test 10Gb against B2 and compare timing.

P.S. Just remembered that I saw a good discussion on B2 efficiency in duplicacy github… The question was not about performance, but about number of requests - and, hence, the cost:

Not sure how they do exploration (figuring out another client has uploaded a chunk), but it does look like they use subfolders to store all chunks as files:

Yes, they do… But I think, they flatten this for some cloud providers…
I am working on some tests to compare both approaches and shall see how folder looks on B2 and similar…
One drawback of duplicacy approach is that you cannot use 3rd-party tool (like rclone) to copy backup sets between providers - but this is solved with the copy option of the tool.