High Availability of Home IoT System

Smart Home improves control over home appliances and integrating different types of resources enhances automation to make life easier and reduce daily routines. For those benefits, I start to convert our home infrastructure. All home members are happy with this improvement until having service degradation of the system.

How can we have a high availability of smart-home systems?

Firstly we need to know, what kind of control plane will be utilized, the access mechanism between the control plane and clients, and the base requirements for successfully establishing this control plane. From physical connectivity to Application level and includes DHCP, DNS, and NTP.
By default, each brand has its mobile application and integrations to Google or Apple Home. To remove the dependency on the internet and keep traffic local to reduce the latency issues of the system, I chose Home Assistant as a smart home control plane solution that also supports a variety of devices.

ha-home-control-plane

Choosing the Physical Connectivity Method

For IoT, Bluetooth, Zigbee, and Wifi are common protocols. I chose Wifi communication for interactive smart devices because it does not require any additional device such as a Zigbee hub and is easy to access by multiple devices to the same smart device (such as phone to device). The only drawback is the battery only lasts for a short time but all of the home device entities are wall-powered so I don’t have such a worry. For read-only sensors such as temperature monitors, Bluetooth Low Energy (BLE) is another cheap solution, and most of the SBCs now have support for BLE.

To achieve high availability (HA), I place two Wireless Access Points (AP) at home. They are in different places but they are close enough to cover each other AP clients in case of failure. With this configuration, our physical connectivity is active-active (HA). So if one AP fails, Clients which they are connected to failed AP will migrate to another one.

IoT devices such as ESP8266 or ESP32 have limited CPU and Memory resources. With those limitations, wifi management does not have any advanced service daemon, which means their best connectivity selection mechanism or switching between two APs will as not fast as our phones or laptops. For example, if the manufacture code design for AP connect is based on the first appearance of the scan, your clients will connect the AP without considering the signal level, or if the two APs do not have similar Wifi Beacon Interval configuration, count of connected devices will be higher which has lower beacon interval. So I recommend having a similar Beacon Interval for all your APs.

High Availability of Home Assistant

Home assistant does not support redundancy directly out of the box or with the plugin. So to achieve this, we have several focus areas that we have to solve.

The first one is entity status synchronization. For example, you have a button that throws an event to the control plane for toggling your lamp, if the status is not synced between two control planes when you press the button, the target state and actual state do not match, you have to press multiple times to reach the desired state. Secondly, your device can able to control multiple local network devices at the same time, otherwise, you can not control your device until the session drops or paring with other control planes.

I have Wiz Lamps at home and communication is done with basic UDP packages. It supports notifying registered local clients, with this functionality successful status changes will informed from Lamp to directly HomeAssistants.

I built my light switches with cheap minimal ESP-01 and placed them behind the switches. Esphome protocol supports connecting by my multiple HomeAsistants and informs status changes for each connected client.

Others such as TVs and Box already designed to be controlled by every household, even LG Netcast works without any issue to have multiple clients. The only concern is Tuya devices. They are well-secured but hard to manage by external application.

Action Duplication

So with this configuration, our devices can be managed by two of the HomeAsistants, but with this redundant and replicated system, we have another issue. When the event is physically fired from the light switch, it will take some time to reach HomeAsistans because of some spikes or expired ARP table entries. In that case, your first home assistant will successfully toggle your lamp to off and a status change will arrive for your second HomeAsistant but the button trigger event recently received and your lamp will be re-toggled and return to the first state or your flowers are two times watered by automation.

Secondary HomeAstiant must not execute the automation until the first one is degraded. This is pretty easy at HomeAsistant, home assistant automation has its state and we can have a virtual switch that indicates whether our current home assistant is primary or backup.

Before executing the control script, Our system also needs to know whether it is secondary or primary, to determine the primary or backup state. So to resolve this, I use the hostname information of the system, and the following line will health check the primary HomeAsistant status.

You can place the below lines in your configuration.yaml .

command_line:
  - sensor:
      name: Hostname
      command: 'hostname'
      scan_interval: 100
  - sensor:
      name: MainState
      command: 'curl -s -o /dev/null  -w "%{http_code}" http://{IPaddress of primary}:8123'
      scan_interval: 5

Automation will change the status for our primary and backup HomeAsistant to control our virtual toggle switch. For example at startup, HomeAsistant is triggered and their switch is on. When the backup cluster can reach the primary one, the backup cluster will toggle off the switch, and the primary will not change the switch because conduction is only executed if the hostname does not match the primary hostname.

homeasistant-is-main

alias: MainControl
description: ""
trigger:
  - platform: state
    entity_id:
      - sensor.hostname
  - platform: state
    entity_id:
      - sensor.mainstate
  - platform: homeassistant
    event: start
condition: []
action:
  - service: homeassistant.toggle
    metadata: {}
    data: {}
    target:
      entity_id: input_boolean.is_master
  - if:
      - condition: or
        conditions:
          - condition: not
            conditions:
              - condition: state
                entity_id: sensor.mainstate
                state: "200"
          - condition: state
            entity_id: sensor.hostname
            state: homeas-1
    then:
      - service: homeassistant.turn_on
        metadata: {}
        data: {}
        target:
          entity_id: input_boolean.is_master
    else:
      - service: homeassistant.turn_off
        metadata: {}
        data: {}
        target:
          entity_id: input_boolean.is_master
mode: single

For your automation, you need to place the if statement below for every automation to prevent duplicated actions.

homeasistant-is-main-2

condition: state
entity_id: input_boolean.is_master
state: "on"

Keeping Critical Network Components Alive

Enterprise equipment supports redundant configuration but home-use devices do not have such features, so I prefer to disable those services on the device to serve on Raspberry Pi’s, which I have more control over.

So if the HomeAsistant is using Wi-Fi to communicate with devices, base network components need to be alive to have an IP address and DNS resolution. ISC DHCP supports backup server configuration but a few years ago, I switched to Dnsmasq for DNS server and DHCP and it does not support the active-passive redundancy feature.

A basic bash script will help us to start or stop Dnsmasq on the backup system.

#!/usr/bin/env bash

PRI=10.0.1.2 # Primary server main interface IP address.

ON_BACKUP=0
while true; do
    ping $PRI -c 1  > /dev/null 2>&1
    PING_STAT=$?
    netcat -zuvn $PRI 67 > /dev/null 2>&1
    PORT_STAT=$?
    if [[ $PING_STAT -eq 0 ]] && [[ $PORT_STAT -eq 0 ]]; then
        if [[ $ON_BACKUP != false ]]; then
            echo "Primary avaible, stopping the Dnsmasq"
            ON_BACKUP=false
            service dnsmasq stop > /dev/null 2>&1
        fi
    else
        if [[ $ON_BACKUP != true ]]; then
            echo "Primary unavailable, starting the Dnsmasq"
            ON_BACKUP=true
            service dnsmasq start > /dev/null 2>&1
        fi
    fi
    sleep 10
done

Floating IP for Accessing HomeAsistant Web Interface

HomeAssistant mobile app supports multiple servers but to have automated switching between servers and keep API available at the same IP, we will utilize Keepalived service.

Keepalived will be responsible for checking the availability of HomeAsistant and provisioning virtual IP respecting priority.

An example /etc/keepalived/keepalived.conf file configuration. Keepalived will run both primary and secondary; the only difference is the priority of the VRRP instance. A higher number indicates more priority, so while defining this number in the secondary system, do not forget to have a lover than the primary system.

vrrp_script homeas_status {
    script "/usr/bin/curl http://127.0.0.1:8123"
    interval 2 # check for every 2 seconds
}
vrrp_instance HOMEAS_EXT {
    state BACKUP
    interface lan1 # Name your interface, it can be eth0, wlan0 or br0
    virtual_router_id 50
    priority 100 # !!!!!!! have a lower number on your backup system
    advert_int 1
    virtual_ipaddress {
        10.0.1.5 dev lan1 label lan1:5 # do not forget to change lan1 with your interface name
    }
    track_script {
        homeas_status
    }
}

In conclusion, we have a low-budget system and a simple solution, to have a reliable system at home to manage a smart home.


© 2024 All rights reserved.

Powered by Hydejack v7.5.0