Benchmarking different storage providers

kenkendk · September 14, 2017, 1:32pm

Not sure how Duplicacy handles compression, but you can also set --zip-compresion-level=0 to disable the compression part if that gives you a speedup.

The multi-core support is developed for Duplicati but I do not yet have a public build with it enabled.

kenkendk · September 14, 2017, 1:33pm

Yes, if you do raw block-level backup from disk (or other medium) I guess this can make a big difference.

JonMikelV · September 14, 2017, 3:12pm

Which part(s) of the process have you worked on? Areas I can think of that could benefit from multi-threading include:

source file scanning
compression
uploading a single archive (dblock?) in multiple pieces at the same time
uploading multiple archives (dblock?) at the same time

kenkendk · September 14, 2017, 8:31pm

The process from listing files, checking files, extracting blocks, hashing blocks, hashing files, compression, encryption, etc is done in multiple threads. It does not upload multiple files concurrently, because there is some logic that is hard to get right if one file is uploaded and another fails.

In other words everything is running concurrently, but the uploads are still sequential.

JonMikelV · September 14, 2017, 9:01pm

Sounds good, thanks!

I recall now you mentioning the single upload thread here.

dgcom · September 14, 2017, 9:18pm

If Duplicati uses single thread for upload, what does this parameter do then:

  --asynchronous-upload-limit = 4
    When performing asynchronous uploads, Duplicati will create volumes that can be uploaded. To prevent Duplicati
    from generating too many volumes, this option limits the number of pending uploads. Set to zero to disable the
    limit

Is this just number of volumes to create for a separate upload in single thread?

JonMikelV · September 15, 2017, 1:07am

I believe that sets how many archive files "ahead"of the current upload that are created. For example, if your archive (dblock) size is set to 100M and the upload limit parameter is set to 4 then Duplicati will queue up no more than 4 archive files awaiting their turn to be sequentially uploaded.

I recall at least one post where a user questioned why Duplicati needed 2 gigs of local temp storage until they realized they had the upload limit set to 4 and archive size set to 500M.

dgcom · September 15, 2017, 4:49am

Ok, than you for re-interpreting this… I read it as 4 concurrent uploads on initial read, but after learning a bit more about how Duplicaty work, I realized that this is just an upload queue size… I think parameter name is more confusing than its description

kenkendk · September 15, 2017, 8:32am

Yes, that is a good description.

We can rename it, and provide an alias for the old value.
Feel free to add a PR for it

dgcom · September 15, 2017, 2:04pm

Well, since this is OSS, fixing bugs is a self-service, I guess
I’ll see if I can get PR for this later.

Kelly_Trinh · September 16, 2017, 2:53am

Updated the spreadsheet with some new data points (and some tidy ups since now external).

https://f000.backblazeb2.com/file/backblaze-b2-public/Duplicati_Stats.xls

Some observations:

Overall Speed Results

Onedrive/Google still the champ - solid averages with no outliers
pCloud still strong but with a couple of (not so bad) outliers.
Backblaze is good but there was one major outlier (more on that below)
Box over Webdav looking pretty good (but over a slightly lower sample size since started later)

Outliers

I examine averages as well as variability of the backup time. Jotta is a fail (and likely to be dropped from test suite soon) since there is so many outliers.

For one particular run with Box over Webdav I was actually watching the status bar and the outlier for Box (which has been otherwise very good and consistent) was due to operation ‘compacting data’ (and post verification). Not sure what happens in that operation but it sure is very costly and maybe worth devs to look into how to optimize the transfers to reduce the impact.

Further Research

Obviously the best result is local storage which is much better than all cloud providers. One particular feature of pCloud is to give a virtual drive which may give best of both worlds (speed of local but then later sync to cloud safety and without the cost of local storage space).

Need more research since the pCloud virtual drive is (transparently) cached and so does consume some local storage but assuming that pcloud cache strategy is sound then then the impact to speed may not be high but otherwise invisible to user for managing local storage. Will update that in next round.

tophee · September 18, 2017, 9:12pm

You’re doing such a great job so I don’t want to ask too much, but if you don’t mind, could you post your updated results directly in the actual post? It would make it much easier for people to find and read those results and they would be preserved as long as this forum exists.

It can be done via copy and paste: Here's a very easy way of creating tables on this forum

Kelly_Trinh · September 19, 2017, 11:00am

No problem - inline table as noted.

Pretty much done with the testing now.

Conclusions (with commentary on cost side)

If backing up moderate data sets (less than 1 TB) then B2 would be most economical. Speed is not bad (aside from one major blow up in my testing) with pay-as-you-use model. Need to be careful with the restore cost though.
For large data sets - the winner is… Onedrive. With the MS Family Pack mentioned in the post (Experiences with OneDrive as a backup target - #5 by ttfn ) very economical for anything up to 5Tb with good performance (and you get MS Office as a freebie).
If speed is the only priority - then also can consider pcloud with using the virtual local drive. Speed of local backup with an separate cloud syncing happening. The firm is running an interesting 2Tb for life plan (but a rather high cost but breakeven after a couple of years compared to other paid services).

[[ Update ]]

further testing was done on S3 - see link below

In terms of conclusions - the above holds unless cost is not an issue and speed is highest priority - in that case S3 is clear winner (with caveats as noted in post

Provider	Local	HubiC	Mega	Onedrive	Google	pcloud	Backblaze B2	Box (WebDav)	pcloud (virtual)	Jotta	Box
Count	36	35	35	35	35	35	35	32	18	30	9
Avg	0:03	1:24	0:41	0:31	0:32	0:30	0:34	0:21	0:02	1:28	2:43
Rank	2	9	8	5	6	4	7	3	1	10	11
StdDev	0:03	0:47	0:20	0:21	0:22	0:30	0:20	0:07	0:03	0:33	0:30
95% Perc	0:11	2:57	1:20	1:14	1:16	1:30	1:14	0:36	0:09	2:34	3:43
Rank	2	10	7	4	6	8	5	3	1	9	11

Include Outliers

Provider	Local	HubiC	Mega	Onedrive	Google	pcloud	Backblaze B2	Box (WebDav)	pcloud (virtual)	Jotta	Box
Count	36	36	37	36	35	38	38	35	18	37	9
Avg	0:03	1:28	0:50	0:34	0:32	0:43	1:17	0:31	0:02	2:55	2:43
Rank	2	9	7	5	4	6	8	3	1	11	10
StdDev	0:03	0:50	0:44	0:25	0:22	0:53	3:40	0:35	0:03	3:39	0:30
95% Perc	0:11	3:06	2:18	1:23	1:16	2:29	8:29	1:41	0:09	10:05	3:43
Rank	2	8	6	4	3	7	10	5	1	11	9

JonMikelV · September 21, 2017, 5:10pm

I don’t suppose in all your testing with the above providers you happened to have run across which ones provide per-file hashes?

Also, I’m assuming it’s not worth trying to do benchmarks with TahoeLAFS due to the variances in the underlying destinations…
https://forum.duplicati.com/t/setting-up-tahoelafs/587/3

Kelly_Trinh · September 22, 2017, 12:58am

Hi Jon,

Not to sure on the providers having hashes as part of the API. Is there any way I can find out / test for that?

Right now I am pretty much retiring the testing as the results are pretty stable and conclusive now. I would consider testing again if there is new backend (anyone want to give me an invite for free storage somewhere ). However for Tahoe it just looks a little too advanced/manual to setup so haven’t tested.

dgcom · September 22, 2017, 3:53am

Here is one list which shows which provides support local hash:
https://rclone.org/overview/