Healthchecks for the Network

Over the Christmas holidays, we spent a couple of weeks abroad with family. During this time, I still like to have access to my network, both for the self-hosted services I have but also so I can play around with new ideas during the downtime.

Unfortunately, about halfway through our holiday, I lost access. I couldn’t tell whether this was because my DNS Updater script failed to set the correct dynamic WAN IP on my DNS record, or whether the router had locked up, or my servers were down. I was relatively sure that they couldn’t be all down, as I lost access to the VPN as well as SSH and they run on separate Raspberry Pis, but without visibility into the network, I couldn’t tell1.

So I resolved to get some sort of heartbeat put in place to get a signal every few minutes if the server is still alive. I found, which did exactly what I wanted. You set up a check with a unique ID and specify both a timeout and a grace period. Then just ping a specific URL with the unique ID and the server is marked as up. Otherwise wait the grace period and mark it down if still no signal is received.

I of course set all this up with my Ansible playbooks and it couldn’t have been easier. I added a new variable for each host in the host_vars/hostname.yml file called healthcheck_id and then set up the following task:

  - name: Ensure healthcheck script exists
      src: ''
      dest: '/home/ansible/'
      mode: '0755'

  - name: Ensure healthcheck job is added to cron
      name: "healthcheck"
      minute: "*/5"
      job: "/home/ansible/ > /dev/null"

The script is very straightforward, but I always prefer calling simple scripts than to have a naked command line argument. The script simple does a curl on the URL with the relevant id:

#! /bin/bash

curl -fsS --retry 3{{ healthcheck_id }} > /dev/null

I then integrated this with my [Pushbullet][pushbullet] setup so I get notified as soon as a server is down and when it comes back up again. Additionally, I get these badges, which I can show in my internal dashboards and on this page!

Badge: Shield

Hopefully ☝️ says up!!!

  1. Yeah, it was the router that needed a restart. [return]