Best Practice to backup a remote Linux machine from Windows?

vbs · June 5, 2024, 7:42pm

Hi everyone,

I am running Duplicati on a Windows host and I also want to backup data from 2 Linux machines. To keep everything centralized and to run only one instance of Duplicati, I have written a pre-script for Duplicati on the Windows machine that logs in via SSH to those two Linux machines, creates a gzip tarball of all the data to backup there and copies it over to the Windows machine. Then, those two copied tarballs are part of the regular Duplicati backup.

While I am quite happy with that approach, I realized that it might possibly mess up the chance of Duplicati to efficiently reuse blocks in the source data for consecutive backup snapshots, which might result in a larger storage size in the backend.
I am afraid that even small changes in the source data on the Linux machines might lead to binary wise very different tarballs created on the Linux machines. And when the tarballs are different, Duplicati might have problems to find common binary blocks and the opportunities to reuse data will be missed.

So, the question is: is this concern justified?

And if yes, what would be a better approach? I thought about extracting the tarballs on the Windows machine so Duplicati is actually seeing the raw source files, so it will be easier to find reusable blocks in them that did not change compared to the last snapshot.
But the problem is that I would lose Linux file permissions information by extracting the tarball to the Windows NTFS file system, which is not acceptable.

Another idea would be to ditch the approach to have Duplicati only once but to also install it directly on those two Linux machines, so I would end up running 3 Duplicati instances, one one each machine. It would mean every instance would have direct access to the source files (no tarblls involved).
Does not sound to stupid either, right?

So, what is your opinion on this? Thanks guys!

Jojo-1000 · June 6, 2024, 8:29pm

If you are not completely focused on the maximum storage efficiency, I would recommend separate instances with different destination folders for each machine. The restore process can be quite slow and bigger backups are slower, so if you need something it will be faster to recover.

Your current approach depends on how much the data changes, but I think that it will probably use more space than having duplicated destinations in the long run. The tar files are not aligned with the duplicati block size, so adding one bytes to an early file in the archive probably means that all the data afterwards needs to be backed up again. There is no attempt to find duplicated data shifted by less than the block size. Edit: With compression, the data will change completely for any change at all.

Alternatively, you could use a network drive in Windows to access the files, but that will probably mess up the permissions (I am not sure how well this works).