Sometimes, servers can become unresponsive, both physical and virtual, and you may be unable to connect to them, particularly when they are overloaded. In such cases, a watchdog can be a solution.
A watchdog device, assisted by a watchdog application, monitors the server to ensure it is active and healthy. Every 30 seconds (though this interval can be adjusted), the daemon checks if everything is functioning correctly. If it is, that’s fine; if not, the watchdog device can perform certain actions. In my case, I usually request the device to execute a hard reboot of the server to restore its reliability.
Proxmox allows the installation and configuration of a watchdog device, enabling you to specify what actions to take when problems arise.
The easiest way to enable it is as follows: on the Proxmox server, navigate to /etc/pve/qemu-server/ (if no cluster has been configured) and edit the VM config file.
Add a watchdog device by appending this line to the VM definition:
This instructs Proxmox to perform a hard reset of the VM if it becomes unresponsive. Shut down and restart the VM.
This step is necessary, as the watchdog will be created at the next “start” of the VM, and a simple reboot will not suffice.
The next step is to install and configure the watchdog daemon inside the VM. Be cautious, as some GNU/Linux distributions (e.g., Ubuntu) may blacklist the watchdog kernel module. If this is the case, check /etc/modprobe.d/blacklist-watchdog.conf (if it exists). In my situation, I removed the i6300esb from the blacklist and added it to /etc/modules so that it would load at boot.
After installing the daemon, configure it as desired.
To test the entire setup, you can intentionally hang the kernel by executing the following command:
echo c > /proc/sysrq-trigger
After waiting for a few seconds, the VM should automatically restart.