Please note: This article has been automatically translated and adapted. There could be some errors.
Backup: why
Everything can be lost. Let's keep it in mind. Whether for accidental or specific reasons, anything can unexpectedly disappear or can disappear by mistake. How many times have we lost an object and never found it again?
While it is difficult to lose an object, it is much easier to delete a file or any kind of information from the computer, or sometimes we just want or need to restore a previous version of a specific file.
Many people think that it is enough to ensure that the storage medium is redundant - and you are fine. WRONG: A raid undoubtedly helps to void losing everything if a disk fails, but what happens when data is accidentally deleted, when it is compromised by a virus or any external entity, or when the computer (raid or not) is stolen, or goes on fire, or any unexpected failure?
I have collected experiences of all kinds. Just a few:
- Flooded server rooms
- Servers destroyed by an earthquake, i.e. by walls that collapse
- Ransomware of any kind, which lately affect much more than in the past
- Damage voluntarily caused by someone who is interested in creating problems (e.g.: computer companies that cause damage to create work for themselves. Yes, I have seen this too, and not just once, unfortunately. And I'm dealing with a situation like that right now).
- Errors performed by the administrator (it can happen to anyone)
If we consider servers exposed on the Internet (e.g. e-commerce, e-mail servers, etc.), the situation becomes even more critical because in addition to the correctness of the data, it is also important to ensure the operational continuity of the service.
The best solution, therefore, is to always have backups available. But what characteristics should these backups have?
Backup: how to do it
There are many backup tools, each focused on a specific element. From very complete suites like Bacula and Amanda, to small tools focused only on specific needs.
All in all, it's not easy to find your ideal and perfect tool, open source or proprietary, so the first thing to do is ask yourself a series of questions:
"How much am I willing to risk? What do I want to preserve? How much down can I tolerate in case of data loss? How much and what kind of space do I have?"
The first question is the most delicate, and sometimes it is both cause and consequence of the technical choices previously made. Some people believe that it is sufficient to have a backup copy inside the same machine that you want to "back up". The choice may be simple & practical, but what happens in case of machine failure? The classic USB drive plugged and on which files are copied every day is as much at risk of failure as the rest of the hardware. And no, don't tell me that the uninterruptible power supply ensures there are no major surges. I've seen UPSes costing thousands of euros burn and burn everything behind them. Destroying the day (or week or carreer) of the administrator who felt safe. If you want, you can go ahead and claim the damages from the insurance company. The money will probably come, certainly not your precious data.
The first step, then, is to always have a management plan. Decide beforehand whether the scale should lean more towards security or saving money.
The safest backup, in fact, is the one as far away as possible from the machine you want to secure.
But this creates two problems: the more data we want to keep safe, the greater the need for space and bandwidth. In fact, if we want to store on a separate device, it must be connected (via some kind of network) to our main hardware and must be able to accommodate all the data we need to preserve. If all this could be done in a LAN, we would not have any major problems. If we wanted to keep our backups off our network, we would also have to deal with connectivity. So we might decide to store less data, to ensure that we have a higher operational speed, both when backing up and, and especially, when recovering.
Safer, in fact, does not mean more practical. If I had a 7 MBit/sec connection and 30 GB of data in backup, how long would it take to recover everything in case of failure? So can I afford such a major downtime? If we're talking about vacation photos from 2000, probably yes (unless you have a strong nostalgia), but if we're talking about important data that breaks a company's productivity, are we sure we can be so patient? And if we are talking about a medical record?
Exactly for this reason, we must study the best backup policy best suited to our needs, remembering that there is no "perfect" solution.
Backup the entire disk or individual files?
This is one of the first questions we need to ask ourselves. Both solutions have advantages and disadvantages, I will list the most important ones:
Entire disk (or storage)
Advantages
- Easy full recovery in case of data loss. Just restore the entire backup to the original disk, and everything will be back exactly as before.
- Often the solution integrated in virtualization systems (e.g.: Proxmox), easy to manage both from the command line and from web interface
- Also in the world of virtualization, there are products (e.g. Veeam Backup) that also allow the recovery of individual files, giving the best of the two worlds.
Disadvantages
- On physical machines, you will practically always need to turn off the machine to make a backup of this type, interrupting its operation for the entire processing time.
- The space occupied may be quite big, as some things that we may not care about will also be copied.
- The operation may be slow, as the disk will have to be scanned bit by bit. Otherwise and/or with backup programs that use file system analysis techniques to optimize time, the procedure may fail if the file system settings are not standard. (e.g.: I had a customer with a directly formatted disk without partitions on it. Everything works, but the Veeam is not able to perform a reliable backup (let alone a recovery) because of this configuration choice).
In many cases, this may still be the best solution. Or it can be a starting point for a complete backup, followed by lighter backups. One of the tools I use, in these cases, is the excellent Clonezilla.
Single files
It gets quite complicated here. Theoretically, it would seem the simplest and most comfortable solution, but it is not always so.
Advantages
- Also possible with basic system utilities (tar, cp, rsync, etc.).
- Greater granularity: Individual files can be backed up, compared to previous backups.
- Possibility to backup only deltas, copying only the modified parts of the files reducing both the storage space and the amount of data to be transferred.
- Portability: Files can be individually moved from one media to another.
- Easy partial recovery: You can choose what to restore and where.
- Possibility of compression/deduplication at the file or block level.
- Ability to backup and restore without shutting down the machine.
Disadvantages
- The simplest solutions may require a lot of storage space.
- For an efficient full backup, you should take a snapshot (VSS, in Microsoft terminology) of the file system before starting copying.
- There may be pitfalls and they may remain hidden until you need the backup.
Backup: how do I do it
Generally speaking, I tend to use both solutions, i.e. backups of complete machines and individual files, depending on needs and situations. My choices over the years have been quite consistent. Specifically, I believe that having maximum granularity for backups is the best choice, also because in many situations I happened to have to recover some files or a series of e-mails mistakenly deleted by some distracted customer.
Specifically, I believe that a good backup should have some basic features:
- instant recovery capability, and sufficiently high processing speed.
- It must be external.
- security - no, I would never place a backup on Dropbox or Google Drive, etc.
- efficient space management.
- compression and deduplication, better if done off-line or quite fast.
- the system must be as minimally invasive as possible. It should not require the installation of too many components.
There are various lines of action, those that say that the backup machine should have direct access to the backup server and those that say that it is the backup server that should contact the systems to be secured. Both solutions have their advantages and disadvantages, but in my case I prefer that is the server that connects to the clients, and for two very specific reasons: 1) in my opinion it's easier to keep a server "hidden" and secure than to leave so many access ports to it from all the clients and 2) in this way I can program the backups according to a very precise logic (eg: finished the first, switch to the second). Otherwise it becomes more difficult and there is the risk that too many backups will overlap, saturating the machine resources.
pure rsync
Historically I used (and I still use it, for some servers) a script made by me based on rsync and hard links. Basically, every backup starts from the previous one. If the file hasn't changed, a hard link is simply created and no additional space is occupied. If it has been changed, only the difference is copied (thanks to rsync), but a new file (with the same name) is generated on the file system. And so on, day by day, for all servers.
Advantages
- I always have a complete and immediately usable copy of the files, so the backup is "ready to use". Restoreable at any time, recoverable on any media
- Used space is not the sum of the backups, but the sum of the first and the amount of the modified files. Warning: NOT the differences of the files.
- Easy to make
- Requires nothing more than rsync and access to the machine (normally, ssh)
Disadvantages
- if no snapshot system is set up, the files are copied on the fly. In the case of a widely used database, it will be useless as the restore will generate some files that are not working
- Space inefficient: Unless you use file systems with integrated deduplication (e.g. ZFS), any minor change to each file will require storage space equal to the size of the file. E.g.: if I add a line to a 10 GB database, the next backup will occupy 10 GB more than the previous one, as an entire copy of the database file will have to be saved.
- unless using a compressed file system, all files will be uncompressed
This is therefore a very good technique if you want to keep the backup of a few machines or not huge machines, and if you don't want to have too much history.
My current favorite choices: Borg, Restic and BURP Backup
So I got to the point of having hundreds of servers (and my PCs) to store every day, some of them even several times a day, I necessarily had to find a more complete and "professional" alternative.
I've already written about Borg and Restic (and I suggest to read it as there are also good hints on how to snapshot a live file system). Yet, I'm also still using BURP Backup. After more than 5 years, it has shown to be reliable.
I used for about a year the excellent storeBackup, which I integrated with some of my scripts to centralize the backup, but I discarded it because it requires that backups start from the client itself, not from the server. Very efficient, however, and it's a solution that I recommend.
I then made some tests with other products (Obnam (now retired), Attic, etc.) but for various reasons they were discarded (specifically, either for the same reason I abandoned storeBackup or for performance reasons). I started using BackupPC, a solution that I still prefer when I want to give a customer a turnkey backup system, convenient to use via web and to forget, because in case of problems he will contact us and tell us. BackupPC is an excellent solution, I've been using it for more than 10 years, but it has the problem that requires a "full" backup and a series of incrementals, so every now and then it takes a complete full backup. The result: on the network, it can be heavy or overloaded. So I decided to stop using it for full or remote servers.
BURP, an excellent integrated system, met almost all my expectations, meeting the requirements that I listed above:
- there is a server, which coordinates, and clients. The communication keys are generated at the first contact, thanks to a password, and are maintained.
- Clients can contact the server at any time, but it is the server that decides if and when the backup can take place, and how.
- The software is small and lightweight, and I was able to install it without problems on all my servers and their operating systems, including embedded ones.
- it has an intelligent transfer system: using the same rsync library, it copies only the differences of individual files (and not whole files). Unlike rsync, however. it is able to store only the "deltas", i.e. the differences, between several generations. Just to return to the previous example, if you add a line to a 10 GB database, the space occupied by the next backup will be approximately that of the line.
- off-line" backup optimization: when the client connects, it will send the file list. The server will compare it with what it has, and ask the client to transfer only the differences or new files. At the end, once the connection is closed, the server will optimize everything and generate all the necessary structures, connections, etc.. The client, at that point, has already finished its work.
- All data can be compressed and deduplicated. In version 1.x through an external utility provided by them (which, in my case, is active once a week), from version 2.x (in development, not yet stable) it will be done at the end of backups, automatically.
- A convenient ncurses interface (and web, even if I use it less) to control everything.
Since I installed BURP, my backups have become very fast, light and with a very good granularity. The system manages itself, adding a client is very easy and I get comfortable emails that update me on the situation.
Windows support is also excellent, it creates the VSS of the drives by itself and auto-updates the app (Windows, in fact) if you want. Thanks to guides available on the project site itself, you can recover an entire Windows installation from a backup and allow it to boot, which is not always granted when you decide to make a copy of individual files.
To sum up, there is not THE perfect backup system, but BURP (along with Borg and Restic), are least for now and have been for years, undoubtedly among my favorite choices. It is important to always remember one general rule: it's better to have one more backup than one less.