Benchmarking different storage providers

hosting-services

#21

I guess it falls between the two. You are only storing the difference, but you do not have a dependency. “Deconstructed backup” maybe?


#22

You raise some great points and I really appreciate your thoughtful approach. In enterprise data protection solutions, dynamic block sizes are common. This because data is typically laid down as a stream and so finding that one-byte change can really make a huge difference since a one byte offset could potentially impact current and future streams. The data is laid down this way due to the history of backup applications writing to streaming media, aka tape.

The benefit of your product is that you are focused more on a file-based strategy which makes more sense and I completely agree about the stagnant nature of data. Using a fixed block strategy on a file-based backup makes sense especially since the 1 byte offset will really only affect one file which is not a huge penalty in the grand scheme of things.

The thing that I personally am trying to get my head around is the performance difference between the different algorithms. (Specifically duplicacy and duplicati.) . Naturally, the products are different and have different defaults, but for me given that I am using a low-end single board computer as my “backup server” and that cloud storage is relatively cheap, I would be willing to trade off storage efficiency for speed.

My personal sense is that the default parameters for duplicacy are more focused towards speed over efficiency whereas duplicati is the opposite. One example is that duplicacy leaves encryption off by default which is a problem, IMO.

Anyway, I plan to continue tests and would welcome any suggested Duplicati parameters.

Thank you!


#23

Interesting thought - have you tried a performance comparison with Duplicati configure as similar to Duplucacy as possibly?


#24

I am trying to do some of that now. I am not going to mess with the deduplication settings; they are what they are. However, Duplicacy had an unfair disadvantage with encryption turned off and so I am running a backup as we speak on the same data set with encryption enabled.

The other setting I am thinking about is threading. My SBC is a quad core one and I am debating enabling the multi-threading option on Duplicacy to see what it does. Does Duplicati support multi-threading? (I think that the Duplicacy multi-threading parameter refers to the number simultaneous uploads.)

To be clear, in case anyone wonders, I am backing about about 2.5GB of file data residing on a NAS over a GigE LAN. These are typical office docs and my backup target is B2. The source server is a 2GB Pine64 connected to the NAS via a GigE switch.


#25

Not sure how Duplicacy handles compression, but you can also set --zip-compresion-level=0 to disable the compression part if that gives you a speedup.

The multi-core support is developed for Duplicati but I do not yet have a public build with it enabled.


#26

Yes, if you do raw block-level backup from disk (or other medium) I guess this can make a big difference.


#27

Which part(s) of the process have you worked on? Areas I can think of that could benefit from multi-threading include:

  • source file scanning
  • compression
  • uploading a single archive (dblock?) in multiple pieces at the same time
  • uploading multiple archives (dblock?) at the same time

#28

The process from listing files, checking files, extracting blocks, hashing blocks, hashing files, compression, encryption, etc is done in multiple threads. It does not upload multiple files concurrently, because there is some logic that is hard to get right if one file is uploaded and another fails.

In other words everything is running concurrently, but the uploads are still sequential.


#29

Sounds good, thanks!

I recall now you mentioning the single upload thread here.


#30

If Duplicati uses single thread for upload, what does this parameter do then:

  --asynchronous-upload-limit = 4
    When performing asynchronous uploads, Duplicati will create volumes that can be uploaded. To prevent Duplicati
    from generating too many volumes, this option limits the number of pending uploads. Set to zero to disable the
    limit

Is this just number of volumes to create for a separate upload in single thread?


#31

I believe that sets how many archive files "ahead"of the current upload that are created. For example, if your archive (dblock) size is set to 100M and the upload limit parameter is set to 4 then Duplicati will queue up no more than 4 archive files awaiting their turn to be sequentially uploaded.

I recall at least one post where a user questioned why Duplicati needed 2 gigs of local temp storage until they realized they had the upload limit set to 4 and archive size set to 500M.


#32

Ok, than you for re-interpreting this… I read it as 4 concurrent uploads on initial read, but after learning a bit more about how Duplicaty work, I realized that this is just an upload queue size… I think parameter name is more confusing than its description :slight_smile:


#33

Yes, that is a good description.

We can rename it, and provide an alias for the old value.
Feel free to add a PR for it :smile:


#34

Well, since this is OSS, fixing bugs is a self-service, I guess :slight_smile:
I’ll see if I can get PR for this later.


#35

Updated the spreadsheet with some new data points (and some tidy ups since now external).

https://f000.backblazeb2.com/file/backblaze-b2-public/Duplicati_Stats.xls

Some observations:

Overall Speed Results

Onedrive/Google still the champ - solid averages with no outliers
pCloud still strong but with a couple of (not so bad) outliers.
Backblaze is good but there was one major outlier (more on that below)
Box over Webdav looking pretty good (but over a slightly lower sample size since started later)

Outliers

I examine averages as well as variability of the backup time. Jotta is a fail (and likely to be dropped from test suite soon) since there is so many outliers.

For one particular run with Box over Webdav I was actually watching the status bar and the outlier for Box (which has been otherwise very good and consistent) was due to operation ‘compacting data’ (and post verification). Not sure what happens in that operation but it sure is very costly and maybe worth devs to look into how to optimize the transfers to reduce the impact.

Further Research

Obviously the best result is local storage which is much better than all cloud providers. One particular feature of pCloud is to give a virtual drive which may give best of both worlds (speed of local but then later sync to cloud safety and without the cost of local storage space).

Need more research since the pCloud virtual drive is (transparently) cached and so does consume some local storage but assuming that pcloud cache strategy is sound then then the impact to speed may not be high but otherwise invisible to user for managing local storage. Will update that in next round.


#36

You’re doing such a great job so I don’t want to ask too much, but if you don’t mind, could you post your updated results directly in the actual post? It would make it much easier for people to find and read those results and they would be preserved as long as this forum exists.

It can be done via copy and paste: Here's a very easy way of creating tables on this forum


#37

No problem - inline table as noted.

Pretty much done with the testing now.

Conclusions (with commentary on cost side)

  • If backing up moderate data sets (less than 1 TB) then B2 would be most economical. Speed is not bad (aside from one major blow up in my testing) with pay-as-you-use model. Need to be careful with the restore cost though.

  • For large data sets - the winner is… Onedrive. With the MS Family Pack mentioned in the post (Experiences with OneDrive as a backup target ) very economical for anything up to 5Tb with good performance (and you get MS Office as a freebie).

  • If speed is the only priority - then also can consider pcloud with using the virtual local drive. Speed of local backup with an separate cloud syncing happening. The firm is running an interesting 2Tb for life plan (but a rather high cost but breakeven after a couple of years compared to other paid services).

[[ Update ]]

further testing was done on S3 - see link below

In terms of conclusions - the above holds unless cost is not an issue and speed is highest priority - in that case S3 is clear winner (with caveats as noted in post

Provider Local HubiC Mega Onedrive Google pcloud Backblaze B2 Box (WebDav) pcloud (virtual) Jotta Box
Count 36 35 35 35 35 35 35 32 18 30 9
Avg 0:03 1:24 0:41 0:31 0:32 0:30 0:34 0:21 0:02 1:28 2:43
Rank 2 9 8 5 6 4 7 3 1 10 11
StdDev 0:03 0:47 0:20 0:21 0:22 0:30 0:20 0:07 0:03 0:33 0:30
95% Perc 0:11 2:57 1:20 1:14 1:16 1:30 1:14 0:36 0:09 2:34 3:43
Rank 2 10 7 4 6 8 5 3 1 9 11
Include Outliers
Provider Local HubiC Mega Onedrive Google pcloud Backblaze B2 Box (WebDav) pcloud (virtual) Jotta Box
Count 36 36 37 36 35 38 38 35 18 37 9
Avg 0:03 1:28 0:50 0:34 0:32 0:43 1:17 0:31 0:02 2:55 2:43
Rank 2 9 7 5 4 6 8 3 1 11 10
StdDev 0:03 0:50 0:44 0:25 0:22 0:53 3:40 0:35 0:03 3:39 0:30
95% Perc 0:11 3:06 2:18 1:23 1:16 2:29 8:29 1:41 0:09 10:05 3:43
Rank 2 8 6 4 3 7 10 5 1 11 9

#38

I don’t suppose in all your testing with the above providers you happened to have run across which ones provide per-file hashes?

Also, I’m assuming it’s not worth trying to do benchmarks with TahoeLAFS due to the variances in the underlying destinations…
https://forum.duplicati.com/t/setting-up-tahoelafs/587/3


#39

Hi Jon,

Not to sure on the providers having hashes as part of the API. Is there any way I can find out / test for that?

Right now I am pretty much retiring the testing as the results are pretty stable and conclusive now. I would consider testing again if there is new backend (anyone want to give me an invite for free storage somewhere :slight_smile: :slight_smile: :slight_smile: ). However for Tahoe it just looks a little too advanced/manual to setup so haven’t tested.


#40

Here is one list which shows which provides support local hash:
https://rclone.org/overview/