Cookie Consent by TermsFeed

Proxmox - Enable and use Watchdog to reboot stuck servers

Sometimes servers can become unresponsive. Both physical and virtual, you may be unable to connect to them, especially when overloaded. So a watchdog can be a solution.

A watchdog device, helped by a watchdog application, controls that the server is active and healthy. Every 30 seconds (but this can be changed), the daemon tries to understand if everything is right. If it is, that's ok. If it's not, the watchdog device is able to perform some actions. In my situation, I tend to ask the device to perform a hard reboot of the server, to make it reliable again.

Proxmox allows to install a watchdog device and configure it, focusing what to do when things go wrong.

The easiest way to enable it is: on the Proxmox server, go to /etc/pve/qemu-server/ (if no cluster has been configured) and edit the VM config file.

Add a watchdog device appending this line to VM definition:

watchdog: model=i6300esb,action=reset

This will tell Proxmox to perform a hard reset of the VM if stuck.

Shutdown and start the VM. You need to perform this as the watchdog will be created at next "start" of the VM. A reboot won't be enough.

Next step is to install and configure the watchdog daemon inside the VM. Be careful, some GNU/Linux distributions (for example, Ubuntu) blacklist the watchdog kernel module, so have a look at /etc/modprobe.d/blacklist-watchdog.conf  (if present). In my situation, I delisted the i6300esb from the blacklist and put it in /etc/modules, so it gets loaded at boot.

After installing the daemon, configure it as you like.

If you want to test the whole setup, hang the kernel this way

echo c > /proc/sysrq-trigger

and wait. After some seconds, the VM should be restarted.

Stefano Marinelli

Stefano Marinelli