So you have your own Linux router, and two separate internet connections, and you’d like to have your router switch to the failover one when the main one is acting up.

Good news, you’re at the right place :)

In this guide we’ll go through what needs to be done to have your box automatically switch to the failover interface and back. We’ll also talk about how you could send updates / notifications to your system of choice.

Let’s take a look at a summary of what we need to get done:

  • decide which interface is going to be our main one, and which one is going to be the failover
  • adjust the route metric of our secondary interface
  • add iptables rules for our secondary interface
  • create and customize our failover script
  • add a systemd service that makes sure our script runs all the time
  • test!

Grab a drink and let’s get going!

Figure out our interfaces

In my case, enp1s0 is my main interface, and enp3s0 is my LTE failover interface.

Adjust the failover route metric

Linux uses the default route with the lower metric.

We can leverage that to dynamically configure which route to use at any given time.

To make sure that by default our main interface is used, we set our main route to something low, and our failover route to something really high.

I use dhcpcd to set up my interfaces. To change my metric, I added this to dhcpcd.conf:

interface enp1s0
metric 100

interface enp3s0
metric 99999

iptables

Hopefully if you’re adding failover support to an existing router, you already have some rules in place. To make sure routing will work correctly, I also added some rules for the failover interface:

iptables -t nat -A POSTROUTING -o enp3s0 -j MASQUERADE

Create and customize our failover script

This is what you’ve been waiting for!

Copy this script to /usr/local/bin/failover.sh and customize the variables!

Read the comments for more details:

#!/bin/bash
# failover.sh
# v2.1 2022-01-26
# * fixed hardcoded interface and incorrect 0/1 values
# v2.0 2022-01-16
# Alex Alexander <alex.alexander@gmail.com>

# Your main internet interface
IF_MAIN="enp1s0"
# The interface you want to enable if IF_MAIN is not working
IF_FAILOVER="enp3s0"

# the metric to set the FAILOVER to when disabled
METRIC_FAILOVER_OFF="99999"
# the metric to set the FAILOVER to when ENABLED
METRIC_FAILOVER_ACTIVE="10"

# this number of pings has to fail for us to change state
FAILOVER_PING_THRESHOLD=2

# the hosts we ping to figure out if internet is alive.
# order matters, so we check two separate providers to make sure it's not the other end
HOSTS_TO_PING=(
  "1.1.1.1"
  "8.8.8.8"
  "1.0.0.1"
  "8.8.4.4"
)

# how long to waiting when testing main interface
PING_WAIT_MAIN=2
# how long to waiting when testing failover interface
PING_WAIT_FAILOVER=5
PING_LOOPS=1

# how often should we check
CHECK_MAIN_INTERVAL=10

# check whether IF_FAILOVER is working every X seconds
CHECK_FAILOVER_INTERVAL=600
# also check on start
CHECK_FAILOVER_COUNTER=${CHECK_FAILOVER_INTERVAL}
# my failover if is a little unstable, so when checking if it is working, we check twice
CHECK_FAILOVER_THRESHOLD=1
CHECK_FAILOVER_ROUTE=0
CHECK_FAILOVER_PING=0

DEBUG=false
if [[ "$1" == "-d" ]]; then
  DEBUG=true
fi

# did we fail because the route was missing?
FAILOVER_DUE_TO_MISSING_ROUTE=false

LAST_STATE=
FAILOVER=false
PINGS_FAILED=0
PINGS_PASSED=0

# We use this method to update some external service.
function update_ha() {
  echo "New State: ${@}"

  # 
  # echo "Sending state to Home Assistant: ${@}"
  # curl --header "Content-Type: application/json" \
  #  --request POST -o /dev/null -s \
  #  --data "{\"state\": \"${@^}\"}" \
  #  http://<some-host>/api/webhook/failover-status >/dev/null

  LAST_STATE="${@}"
}

# This function knows how to check if pings work over an interface.
# It exports results to PINGS_PASSED and PINGS_FAILED
function check_pings() {
  IF_TYPE=${1} # MAIN, FAILOVER
  IF_NAME="IF_${IF_TYPE}"
  IF=${!IF_NAME}
  if [[ -z ${IF} ]]; then
    echo "[EEE] Could not deduct IF from ${IF_TYPE}"
    exit 1
  fi
  PING_WAIT_NAME="PING_WAIT_${IF_TYPE}"
  PING_WAIT=${!PING_WAIT_NAME}
  PINGS_FAILED=0
  PINGS_PASSED=0
  for ip in "${HOSTS_TO_PING[@]}"; do
    ping -c ${PING_LOOPS} -W ${PING_WAIT} -I ${IF} "${ip}" 2>&1 >/dev/null
    PING_RESULT=$?
    if [[ ${PING_RESULT} -eq 0 ]]; then
      PINGS_PASSED=$(( PINGS_PASSED + 1 ))
      PINGS_FAILED=0
      if [[ "${FAILOVER}" == true ]] || [[ "${DEBUG}" == true ]]; then
        echo "[I] (failover: ${FAILOVER}) CHECKING ${IF_TYPE} IF: Ping to ${ip}/${IF} succeeded!"
      fi
    else
      PINGS_PASSED=0
      PINGS_FAILED=$(( PINGS_FAILED + 1 ))
      echo "[E] (failover: ${FAILOVER}) CHECKING ${IF_TYPE} IF: Ping to ${ip}/${IF} FAILED"
    fi
    [[ ${PINGS_PASSED} -ge ${FAILOVER_PING_THRESHOLD} ]] && break
    [[ ${PINGS_FAILED} -ge ${FAILOVER_PING_THRESHOLD} ]] && break
  done
}

# Our main check function
function check() {
  # first, check if our main interface route even exists
  # if not, we can't really do anything, but we can update our state
  if ! ip route list | grep default | grep -q ${IF_MAIN}; then
    if [[ "${FAILOVER_DUE_TO_MISSING_ROUTE}" == false ]]; then
      echo "[E] Could not find route for main interface (${IF_MAIN})"
      FAILOVER_DUE_TO_MISSING_ROUTE=true
      update_ha "Active (no route)"
    fi
    return
  fi
  
  # then, check if our failover interface route even exists
  # we can't failover if there's no failover route ;)
  # this is cheap, so we do it every time
  if ! ip route list | grep default | grep -q ${IF_FAILOVER}; then
    if [[ ${CHECK_FAILOVER_ROUTE} -lt ${CHECK_FAILOVER_THRESHOLD} ]]; then
      echo "[W] Could not find route for failover interface, will retry (${IF_FAILOVER})"
      CHECK_FAILOVER_ROUTE=$(( CHECK_FAILOVER_ROUTE + 1 )) 
      return
    fi
    echo "[E] Could not find route for failover interface (${IF_FAILOVER})"
    update_ha "Unavailable (no route)"
    return
  fi

  CHECK_FAILOVER_ROUTE=0

  CHECK_FAILOVER_COUNTER=$(( CHECK_FAILOVER_COUNTER + CHECK_MAIN_INTERVAL ))
  CHECK_FAILOVER_WAS_DONE=false
  # every ~10m, send some pings over the failover interface to make sure it's
  # actually working. If it's not, we can't do much to fix it automatically,
  # but at least we can send out a notification to investigate, so we are not
  # surprised later!
  if [[ ${CHECK_FAILOVER_COUNTER} -ge ${CHECK_FAILOVER_INTERVAL} ]]; then
    echo "Verifying Failover Internet is reachable"

    check_pings FAILOVER

    if [[ ${PINGS_FAILED} -ge ${FAILOVER_PING_THRESHOLD} ]]; then
      if [[ ${CHECK_FAILOVER_PING} -lt ${CHECK_FAILOVER_THRESHOLD} ]]; then
        echo "[W] Failover interface check pings failed, will retry (${IF_FAILOVER})"
        CHECK_FAILOVER_PING=$(( CHECK_FAILOVER_PING + 1 )) 
        return
      fi
      update_ha "Unavailable (no ping)"
      return
    else
      CHECK_FAILOVER_COUNTER=0
    fi
    CHECK_FAILOVER_WAS_DONE=true
  fi

  CHECK_FAILOVER_PING=0

  STATE=

  METRIC=$(ip route list | grep "^default" | grep "${IF_FAILOVER}" | sed "s:.*metric \([0-9]*\).*:\1:")
  [[ ${METRIC} -eq ${METRIC_FAILOVER_OFF} ]] &&
    FAILOVER=false || FAILOVER=true
  if [[ "${FAILOVER}" == true ]]; then
    DEFAULT_GW=$(ip route list | grep "^default" | grep "${IF_MAIN}" | sed "s:.*via \([.0-9]*\).*:\1:")
    VIA="via ${DEFAULT_GW}"
  else
    VIA=""
  fi

  # we made it here, all routes seem to be present, let's check our main interface
  check_pings MAIN

  if [[ ${PINGS_FAILED} -lt ${FAILOVER_PING_THRESHOLD} ]]; then
    STATE="Ready"
    if [[ "${FAILOVER}" == true ]]; then
      echo "[CHANGE] Ping through main IF {$IF_MAIN} worked, RESTORING"
      # we need to re-write the route so it lowers the metric
      FAILOVER_GW=$(ip route list | grep "^default" | grep "${IF_FAILOVER}" | sed "s:.*via \([.0-9]*\).*:\1:")
      ip route del default via ${FAILOVER_GW}
      ip route add default via ${FAILOVER_GW} dev ${IF_FAILOVER} metric ${METRIC_FAILOVER_OFF}
      FAILOVER_DUE_TO_MISSING_ROUTE=false
    fi
    if [[ "${FAILOVER_DUE_TO_MISSING_ROUTE}" == true ]]; then
      echo "[CHANGE] Main IF ${IF_MAIN} route came back, RESTORING"
      FAILOVER_DUE_TO_MISSING_ROUTE=false
    fi
  else
    STATE="Active (no ping)"
    if [[ "${FAILOVER}" == true ]]; then
      [[ "${DEBUG}" == true ]] &&
        echo "(failover: true) Pings failed, but we've already failed over."
    else
      echo "[CHANGE] At least ${FAILOVER_PING_THRESHOLD} pings failed in a row, FAILING OVER"
      # we need to re-write the route so it lowers the metric
      FAILOVER_GW=$(ip route list | grep "^default" | grep "${IF_FAILOVER}" | sed "s:.*via \([.0-9]*\).*:\1:")
      ip route del default via ${FAILOVER_GW}
      ip route add default via ${FAILOVER_GW} dev ${IF_FAILOVER} metric ${METRIC_FAILOVER_ACTIVE}
    fi
  fi

  if [[ ${STATE} != ${LAST_STATE} ]] || [[ "${CHECK_FAILOVER_WAS_DONE}" == true ]]; then
     update_ha "${STATE}"
  fi
}

echo "Internet Failover Script"
echo "---"
echo "Main Interface: ${IF_MAIN}"
echo "-   Main Check: ${CHECK_MAIN_INTERVAL}s"
echo "Failover Interface: ${IF_FAILOVER}"
echo "-   Failover Check: ${CHECK_FAILOVER_INTERVAL}s"
echo "==="

while true; do
  check
  sleep ${CHECK_MAIN_INTERVAL}
done

Whew :)

systemd service

We’re getting there! Next up we need to set up a systemd service, which makes running our script easier.

Create /etc/systemd/system/failover.service:

[Unit]
Description=Failover

[Service]
User=root
WorkingDirectory=/usr/local/bin
ExecStart=failover.sh
Restart=always

[Install]
WantedBy=multi-user.target

Make sure to edit the working directory and script name, then enable the service:

systemctl daemon-reload
systemctl enable failover
systemctl start failover

Check that things worked:

# systemctl status failover
● failover.service
     Loaded: loaded (/etc/systemd/system/failover.service; enabled; vendor preset: disabled)
     Active: active (running) since Sun 2022-01-16 18:29:19 PST; 9s ago
   Main PID: 275142 (failover.sh)
      Tasks: 2 (limit: 9467)
     Memory: 852.0K
        CPU: 168ms
     CGroup: /system.slice/failover.service
             ├─275142 /bin/bash /usr/local/bin/failover.sh
             └─275162 sleep 10

Jan 16 18:29:19 systemd[1]: Started Failover
Jan 16 18:29:19 failover.sh[275142]: Internet Failover Script
Jan 16 18:29:19 failover.sh[275142]: ---
Jan 16 18:29:19 failover.sh[275142]: Main Interface: enp1s0
Jan 16 18:29:19 failover.sh[275142]: -   Main Check: 10s
Jan 16 18:29:19 failover.sh[275142]: Failover Interface: enp3s0
Jan 16 18:29:19 failover.sh[275142]: -   Failover Check: 600s
Jan 16 18:29:19 failover.sh[275142]: ===
Jan 16 18:29:19 failover.sh[275142]: Verifying Failover Internet is reachable
Jan 16 18:29:19 failover.sh[275142]: Sending state to Home Assistant: Ready

Tests!

At this point you should be ready to test! The script should be testing your internet already, so hopefully you’re not seeing errors already :P

To make sure things are actually working, let’s simulate some failures.

Hint: run journalctl -t "failover.sh" -f to keep track of the failover logs.

  • The easiest tests are unplugging cables. If you can do this, I recommend it.
    • First, unplug your FAILOVER cable. This check runs every 10 seconds, so hopefully after ~20 seconds you should see logs mentioning the FAILOVER route is gone.
[E] Could not find route for failover interface (enp3s0)
Sending state to Home Assistant: Unavailable (no route)
  • Plug the cable back in, and a few seconds later you should see another log saying failover is ready.
Sending state to Home Assistant: Ready
  • Then, unplug your MAIN internet cable. Internet should switch to your failover and a log should show up.
  • Finally, plug the MAIN cable back in.
  • Internet might be failing even if the route is there. Let’s test that too.
    • To simulate the MAIN internet not working, run this on your router:
      • iptables -A OUTPUT -o enp1s0 -j DROP (replace enp1s0 with your main interface)
      • After a few seconds, you should see ping errors in the logs and the script should switch to FAILOVER
[E] (failover: 0) CHECKING MAIN IF: Ping to 1.1.1.1/enp1s0 FAILED
[E] (failover: 0) CHECKING MAIN IF: Ping to 8.8.8.8/enp1s0 FAILED
[CHANGE] At least 2 pings failed in a row, FAILING OVER
[I] Sending state to Home Assistant: Active (no ping)
  • To undo ^, run, iptables -D OUTPUT -o enp1s0 -j DROP.
[I] (failover: 1) CHECKING MAIN IF: Ping to 1.1.1.1/enp1s0 succeeded!
[I] (failover: 1) CHECKING MAIN IF: Ping to 8.8.8.8/enp1s0 succeeded!
[CHANGE] Ping through main IF {enp1s0} worked, RESTORING
[I] Sending state to Home Assistant: Ready
  • To test the FAILOVER check, run
    • iptables -A OUTPUT -o enp3s0 -j DROP (replace enp3s0 with your failover interface)
    • FAILOVER checks run every 10 minutes, so either wait or systemctl restart failover
[E] (failover: false) CHECKING FAILOVER IF: Ping to 1.1.1.1/enp3s0 FAILED
[E] (failover: false) CHECKING FAILOVER IF: Ping to 8.8.8.8/enp3s0 FAILED
Sending state to Home Assistant: Unavailable (no ping)
  • To restore:
    • iptables -D OUTPUT -o enp3s0 -j DROP
Verifying Failover Internet is reachable
Sending state to Home Assistant: Ready

That’s all!

Hopefully everything’s working great and this post was helpful :) Feel free to comment if you have any questions. Cheers!