How we are migrating (many of) our servers from Linux to FreeBSD - Part 2 - Backups and Disaster Recovery

After my post on why we’re migrating (most of) our servers from Linux to FreeBSD, I’ve started to write about how we’re doing it. After covering a basic installation (we’re doing a massive use of jails), I’m going now to describe how we’re performing backups.

Backup is not a tool. Backup is not a software you can buy. Backup is a strategy you need to study and implement to be able to solve your specific problems. You need to understand what you’re doing, otherwise you’ll always have a Schrödinger’s Backup - it may work or not and if you don’t test it well enough (i.e. restore) you’ll find out when it’s too late.

We’re performing backups in many different ways but, for our physical and virtual FreeBSD servers, we have a dual approach. We need both a “ready to use” backup (that will be described here, useful for a fast disaster recovery or prompt restore of specific jails) and a “colder”, more space efficient backup that can be kept for months (or years), more similar to the borg approach on previous posts. Generally speaking, we store our OS (and jails) on ZFS, so I’ll describe this kind of approach here.

Disaster recovery backup - ZFS send/receive

BastilleBSD creates its datasets and mounts them on /usr/local/bastille . There are no databases, all the jails’ configurations are inside that mountpoint so it’s quite easy to backup and restore all the jails or any single jail in one go. More, the “everything-in-a-jail” approach simplifies the restore process as you don’t need to restore the entire host OS, just install an empty FreeBSD server, install BastilleBSD and restore the jails. Or add the jails to already existing FreeBSD systems.

We normally use FreeBSD (or Linux with ZFS) backup servers, well protected and encrypted at rest. For the ZFS send/receive approach, our servers are NOT reachable from the outside. We can ssh into them only using a VPN - they’re too precious to be exposed on the World Wild Web - or, if strictly needed, we expose ssh only using keys, no passwords. We perform the backups using a pull strategy: the backup server connects to the production servers, gets the data, disconnects. The production servers have NO ACCESS to the main backup server. Should they ever be seriously compromised, the backup is safe.

There are many tools that can help to set up this kind of configuration. I’ve tried many of them and found that they all have some good and bad points. The one I decided to use for our servers is zfs-autobackup. It’s easy to use, everything can be set via command line and has a good cron (or Jenkins) output, useful to understand if everything is right.

Let’s consider two servers, one is called “ProdA” and the other is called “Bck” - we obviously want to backup the ProdA into Bck.

Installing ProdA has been covered on a previous post, Bck is quite simple and outside the scope of this post. We just need a protected zfs FreeBSD (or Linux) server. That’s all. Let’s assume that ProdA has a BastilleBSD zfs dataset (and children datasets), with jails and everything needed, as configured in the last post. We now need to install the needed software. On Bck:

pkg install py311-zfs-autobackup mbuffer

On ProdA:

pkg install mbuffer

mbuffer will be used as a ram buffer to avoid read/write spikes (or slowdowns) while sending/receiving the snapshots.

It’s time to prepare the destination dataset. Assuming that Bck has a zroot base dataset, we’ll be creating (as root) a zroot/backups/ProdA

zfs create -p zroot/backups/ProdA

Ok, let’s now go to ProdA

We want to create an unprivileged user that will send the data. We don’t want to allow Bck to connect as root, even if it’s trusted and secure. Let’s create a user called “backupper”. Then, we need to give backupper the right permissions:

zfs allow -u backupper send,snapshot,hold,mount,destroy zroot

Note: if you want Bck to be able to delete the snapshots on ProdA, backupper needs the destroy permission. That means this user can destroy the whole system as can ALSO destroy zroot (or any source dataset you decide). If you’re afraid of this, different approaches must be used (i.e.: local root performing snapshot/cleanups and Bck only transferring them, not hard to achieve with zfs-autobackup). Considering that the Bck is safe, secure and protected, we can tolerate this weakness. Just be sure nobody can break the “backupper” user. Do not use password, use ssh keys and treat this user with the same care you'd use with root.

Now, as root on ProdA:

zfs set autobackup:bck_server=true zroot

We’re setting a custom property, called “autobackup:bck_server”, allowing the zroot (and children) dataset to be backed up by zfs-autobackup. zfs-autobackup will search for all datasets with that property set to “true” (also on different pools) and will backup them. If there’s a specific dataset you don’t want to backup, just set it to “false”. Or if you don’t want to backup the entire zroot but, for example, only “zroot/bastille” (and children), just set autobackup:bck_server=true for that dataset.

ssh config

zfs-autobackup will connect via ssh and zfs-autobackup will try to connect as root. Moreover, even after exchanging the ssh key, Bck will connect many times to ProdA to send its zfs commands (one connection per command). Ssh session initiation is quite long, so there will be some latency. In order to (greatly) speed up this time,

“You can make your ssh connections persistent and greatly speed up zfs-autobackup:

On the server that initiates the backup add this to your ~/.ssh/config:

Host ProdA
User backupper
ControlPath ~/.ssh/control-master-%r@%h:%p
ControlMaster auto
ControlPersist 3600

(Taken from https://github.com/psy0rz/zfs_autobackup/wiki/Performance)

It's now time to go back to Bck and issue a command like this (one line):

/usr/local/bin/zfs-autobackup --ssh-source ProdA bck_server zroot/backups/ProdA --zfs-compressed --no-progress --verbose --buffer 32M --keep-source 0 --no-holds  --set-properties readonly=on --clear-refreservation --keep-target 1d1w,1w1m,1m6m  --destroy-missing 30d --clear-mountpoint

Bck will connect to ProdA, perform the snapshots and start transferring. The most interesting options I used here are:

 --keep-source 0 (only the last snapshot will be kept on ProdA)
 --set-properties readonly=on (be sure the Bck clone is read only, so we will be able to perform an incremental/differential backup next time)
 --keep-target 1d1w,1w1m,1m6m (keep one backup per day for one week, one per week for one month, one per month for six months)
 --destroy-missing 30d (if we've deleted a dataset, keep it for 30 days before removing it from Bck)
 --clear-mountpoint (do not mount the dataset in Bck, as it will cause problems sooner or later)

The first copy will be slow as it'll need to send all the data. The second will be quite fast as only the differences will be transferred.

How to perform a disaster recovery

Ok, your dataset (or datasets) has gone. You need to replace it with the last external backup. You have to retransfer the copy into ProdA (or another FreeBSD host, no difference). Connect to Bck and search for the snapshot you want to restore (zfs list -t snapshot will help). Once identified (one line):

zfs send -R zroot/backups/ProdA/zroot/bastille/bastille/jails/t1@bck_server@20220528005830 | mbuffer -4 -s 128k -m 32M | ssh root@ProdA "zfs receive -F -x canmount -x readonly zroot/bastille/bastille/jails/t1"

Note the -x canmount -x readonly flags. Remember that we altered the canmount and readonly properties of the transferred datasets during the backup, so we must restore them into a normal state.

Once finished, ProdA (or the other, restored host) will show t1 as an available jail and you'll be able to start it.

Disaster recovery backup - ZFS send/receive

ssh config

How to perform a disaster recovery

You may also like

How we are migrating (many of) our servers from Linux to FreeBSD - Part 3 - Proxmox to FreeBSD

From Cloud Chaos to FreeBSD Efficiency

Why we're migrating (many of) our servers from Linux to FreeBSD