On a side note, I don’t think delete requests take much bandwidth. They’re just calling APIs and not downloading anything. As for compacting, here’s how to disable it.
Server side implementation (for ssh backends) - for cli delete&compact - when using low bandwidth links
Yes, this is quite a big break from the design idea. If you do have the option to run code server-side (and don’t mind doing so), there are other solutions, e.g. Duplicacy and similar: GitHub - gilbertchen/benchmarking: A performance comparison of Duplicacy, restic, Attic, and duplicity
If you prefer the Duplicati approach with as little as possible downloads, you can set the option
--upload-verification-file which will place a
json file next to the uploaded files. The file contains a list of all files that Duplicati thinks should be there and their hashes. If you can run a check server-side that verifies the files, you can disable the remote verification as well.
The script is here:
Interestingly, duplicacy does not have server side component either… It’s free cli version quite similar to Duplicati, I’d say.
But regardless of other tools - shouldn’t it be really easy to setup appropriate command line on the server to run compact/purge on the server side, locally to the backup? Would it require rebuilding local DB before each run?
The server side verification script looks like a great option for people that might need it.
Unless it starts appearing that a lot of people need this type of functionality it makes sense having it available as an advanced feature.
Perhaps somebody using it will put a How-To post in place.
I will have to look at it some more then! I was under the impression that it used the same methods as Attic where it has a server component, but can run somewhat without it.
It would require that the server side knows the backup password at least. It would also require some kind of re-sync mechanism, otherwise Duplicati would not be able to figure out if files just dropped out of the backend, or if the compact occurred.
Generally, Duplicati has a “do not touch the storage folder” approach, and enforces/checks this to ensure that the backup is fully working.
Yes, it would be great if you can check duplicay again - very interesting, competing tool.
It does have several drawbacks and also some wins compared to Duplicati - and one of the best features is lock-free multi-source single-target de-duplication:
On the cons side, besides the paid UI tool, it has some issues with the way source is configured and handling of VSS on Windows (only single drive works).
I feel that Duplicati is much better tool as it stands right now and if you eventually decide to add similar de-duplication, it would be great (yes, I know that you do not have current plans to implement it).
As for the server side - I agree with you on the approach in general, but having the (advanced) ability to implement at least hash verification is a good option.
BTW, you may also want to look at adding hash verification at remote storage, similar to what rclone does - it can verify files against Backblaze B2 without downloading data…
dgcom, you seem to know a bit about Duplicacy - how would you feel about starting a Comparison category Topic comparing it to Duplicati?
Interesting idea, of course, but I am only familiar with duplicacy cli version which I evaluated just recently.
I’ll see how much useful info I can gather and post if I decide that it might be interesting to others
Yes, I read it and it seems clever. However, I imagine that it fills the remote destination with millions of blocks? Am I missing some important detail here?
dgcom, well if you change your mind feel free to add your input to the (now) existing Comparison topic.
kenkendk, yes - I suppose if you have a 100G source location with a BEST case scenario of a single 100G file and a 100k block size you’d end up with 1,048,576 individual block files on the destination (assuming there was nothing to de-dupe).
And that’s just the starting file count with no history!
Yes, it creates smaller blocks, but size is configurable. Are you thinking about performance impact? From my (very quick and dirty) testing with network share I did not see a problem.
Yes, I have not seen many filesystems that like millions of files. Performance really starts to degrade when there are many files or folders.
That is what I am wondering, there are people reporting slow filelistings (say with B2 that lists 200 files pr. call) and they are using 100kb blocks wrapped in 50mb volumes, it must be really bad if they are all just stored in plain.
I haven’t actually tried Duplicacy yet but I know when I’ve dealt with lots of files with structured names I end up building a folder structure to break them up - perhaps that’s how they’re implementing it.
But that’s when you have all files in the same folder, no? If you use folder structure to index your blocks, processing can be pretty fast.
I was planning to do a bit more performance comparison - now I am even more curious…
Me too, but I don’t have the time, so I hope to hear what you find out.
We have discussed splitting Duplicati backups into sub-folders, but not pursued it because with subfolders, you need to send multiple list requests to get the full file list.
That is not a problem with a local/network disk, but really sucks on some high-latency WebDAV interface.
Now I feel challenged
I was reading duplicacy forums and saw people successfully backing up 500Gb to Wasabi, this means it is doable.
But I guess, restore might be more challenging.
I’ll see if I can test 10Gb against B2 and compare timing.
P.S. Just remembered that I saw a good discussion on B2 efficiency in duplicacy github… The question was not about performance, but about number of requests - and, hence, the cost:
Not sure how they do exploration (figuring out another client has uploaded a chunk), but it does look like they use subfolders to store all chunks as files:
Yes, they do… But I think, they flatten this for some cloud providers…
I am working on some tests to compare both approaches and shall see how folder looks on B2 and similar…
One drawback of duplicacy approach is that you cannot use 3rd-party tool (like rclone) to copy backup sets between providers - but this is solved with the copy option of the tool.
Ok, I think I understand the process now. But how does it know which blocks are there? If you have 1 million blocks, listing them with a pagesize of max 1000, is still 1000 HTTP requests. Even with a 1 sec pr. request that is more than 15 minutes. And since it supports multiple writers it pretty much needs to do this on every backup?
I haven’t checked the code, but I’d call each file with its hash - then you just need to check if it exist at destination to make a decision to upload it or not and I do not need entire list ahead of time.
Here is the quote from duplicay design document:
Store each chunk in the storage using a file name derived from its hash, and rely on the file system API to manage chunks without using a centralized indexing database