Recommended block settings & more

Loadn_Abox · February 21, 2018, 5:43pm

I’m another CrashPlan refugee and I’m just getting serious about building out a new solution

I have a large amount of Data (over 4TB) that I will be backing up, however much of it is very static (Mostly archived video files and such) I keep it all on a dual-parity ZFS array on a headless Ubuntu box (I control it entirely from CLI)

I keep local backups of all my home systems on the Ubuntu box but I really want something cost-effective for off-site “disaster level” scenarios. After researching cloud solutions for the amount of data I have and supporting Linux systems, I decided instead of conventional cloud, to set up a small system at a family members house ~10 miles away. This leads to several important considerations.

I plan to use ssh across the WAN. This covers a secure connection for data in transit and makes firewalling/NAT easy. (correct me if there’s an easier way to accomplish all the below with my own equipment)

#1 I need to be considerate of bandwidth. This not only affects my own 2TB monthly limit (which I’m already using 1.5TB of) but the bandwidth of my family member.

Keeping in mind 90% of my data cannot be compressed and is static, what are the optimal settings to limit bandwidth in both directions. I consider this doubly since by my understanding ublock limit can affect this through scanning of existing files (read from target backup destination). I don’t want to be reading every file on every backup, that would destroy our bandwidth limit and likely put us over our monthly quota.

#1(a) I will have a dedicated Linux server accepting the ssh connections but it won’t be beefy (just dropping in an old dual core system with a couple of big drives). What would the recommended destination file sizes be to limit bandwidth without adversely impacting the remote system.

#2 I would like to avoid having to set NAT on the remote end if possible since I won’t have regular access or control of their (cheap consumer grade) home network equipment. To that end is there any built-in way to initiate the connection from the remote system that avoids NAT? (I have considered implementing an IPSec tunnel, just looking for other options as well)

#3 should it be possible to populate the first backup within my local network and then move the system remote? Or does duplicati save information on the backups based on host name?

Pectojin · February 21, 2018, 7:39pm

Duplicati will look at every local file that it’s backing up, but it shouldn’t be looking at any files on the backup destination beyond validating files and compacting/cleaning periodically.

I don’t think the size matters much on the receiving end. The receiver just downloads files, it doesn’t do anything with them.

This one is a bit tricky as Duplicati doesn’t provide any tools itself. In general it’s best to have NAT, VPN, or just backing up to your local network and publishing that backup somewhere so the remote system can download the backup, although that requires twice the space

Edit: maybe you can do some reverse port forwarding to allow bypassing the destination firewall Bypassing corporate firewall with reverse ssh port forwarding - think shell - | toic.org

Yes. Definitely do a full backup before moving it off site when looking at such large data sizes. Duplicati won’t care how the backup files got there, just that they are there.

I think the biggest issue I’ve heard of with very large datasets is large slow databases. Increasing your volume and block sizes will improve this at the cost of increasing the overhead network traffic. I haven’t played with this a lot, but maybe someone else can chime in with some settings they tested?

There are some considerations when changing block and volume size, such as how much local disk space is used during the backup. E.g. 4 times the volume size is usually built locally while the system is waiting for a block to upload.

Very large block sizes will, as mentioned, increase overhead in small file changes. The bigger the block, the more data has to be re-uploaded when even 1 bit is changed in some data. So bigger blocks and volumes generally increase network usage to some extend, but improves database speeds by not requiring as large databases.

Loadn_Abox · February 21, 2018, 9:10pm

Excellent information, thank you.

I had planned to put the sqlite DB onto my zfs array. It has great iops and async write speeds, hopefully that will mitigate a little bit.

samw · February 24, 2018, 2:08am

You can seed the backup if you want. Just create the backup job to copy to an external device like a portable disk. Take it your family member house and copy it to the location where you will be backing up over SSH. Then reconfigure the job to backup to the SSH target. After that Duplicati will only be sending over the changes which depending on the rate of data change shouldn’t be a lot.