High Availability at homes

Fabrizio Waldner
8 min readMar 9, 2019

https://github.com/fabrizio2210/High-Availability-at-homes

Introduction

I like to play at home with Raspberry pi. At work I am a system administrator and our first goal is to serve a service without interruptions. This thing fascinates me and I wanted reproduce this condition at home. This project is a proof of concept to verify if what I have in my mind can work.

Definitions

Before start with the project I asked me some definitions. At work there are contracts with customer that define services and boundaries. My project at home is an hobby, so I have to define myself what are the goals.

What is a service?

In this project I want define a service as a web service, it can be a simple page or a Content Management System (like WordPress or Grav). So a service exposed by a HTTP sever. In these days we are used to think a service like a web service, but in IT strict definition, a service is whatever helps a customer to reach his goals. At first, I would like to expose stateless services (simple) and then stateful ones (complex).

What is a failure?

Normally you want deploy a service that is resilient to the problems (hardware/software). This property is what fascinates me in my work. The definition of failure is depending on the infrastructure where the service is running. In my case I planned to provide the service using a Raspberry Pi at my home with a fiber connection. One can be think that the the development board should be redundant in the same location maybe putting in front of them a load balancer like HAproxy. But my experience says me that the frequent problem is the outage of the connection. How may times my mom has unplugged the modem with broom? How many times some household appliances have made a blackout? So the infrastructure should be reliable against the disruption of connection. After this thought I believe that I have to use multiple Raspberry Pis situated in different houses with different connection. The name comes from this infrastructure, High Availability at Homes.

Goal

To achieve results, it is important to keep in mind the objective. The main goal of this project is to provide a service (see above definition) tolerant to the failures (see above definition too).

How?

Being that I identified the failure as interruption of internet connection, I have to spread the service on multiple internet connections. So it means replicate the service in multiple sites (homes in my case) and use a strategy to direct the client connections toward the available node(s).

I know that there are network tricks to advise an IP in multiple places (see DNS infrastructure like 8.8.8.8 of Google), but is not my case, I will use a simpler method: announcing available IPs on DNS. After some researches, I found Dynu Systems (https://www.dynu.com/) that provides the possibility to update, but primarily create DNS records with an API interface. And I was impressed by the low cost of the service: free until 4 records and 9$/year for more records.

Unit of release

I spoke about service, but I didn’t tell how to deploy it. For this project I found Docker very useful for these reasons:

  • it allows me to have a protected environment for each web-server (container)
  • it allows me to focus and operate on what is the data to storage (volumes)
  • it allows to install multiple web-servers and utilize Traefik as a dynamic reverse proxy

The unit of release is a Docker stack. A stack is a group of Docker services with its variables and configurations. Here more information https://docs.docker.com/get-started/part5/

To deploy a stack I used Ansible because :

  • it is agent-less
  • it allows me to do some operation and not to define a state only
  • it speaks yaml as Docker stack

Scheme

Fig.1 — The general scheme of the project
  • Dynu client: the client that updates the RR record on DNS
  • Traefik: I use it as a dynamic reverse proxy to direct connection toward the right docker
  • volumes: if you took part of Docker world, you know that are essentially persistent storage
  • web-server: actually the service that I want to provide
  • sync volumes: a component that keep synced the volumes with the other nodes

Steps

Before implementing the project, it is important to define the path to reach the goal. Sometimes it can be cumbersome, but it is important to keep in mind what are the steps to follow. When a project is performed in a long time, it easy to lost the objectives. Moreover, it is satisfying see how many steps you reached. However, there is no problem to redefine the steps during the way.

For this project I define these steps:

  • Installation of Docker Swarm on a single node
  • Deploy of Traefik as a service on Swarm
  • Creation of a role to deploy stacks
  • Manipulation of a stack, inserting some key/value
  • Deploy of a web-service without volumes on single node
  • Deploy of a web-service without sync on single node, but with local volumes
  • Manipulation of stack, reading some key/value
  • Deploy of a web-service on multiple node with volume, but without sync
  • Creation of a mechanism of sync aware of multiple node
  • Deploy of a web-service on multiple node with volume and with sync

Synchronization

To keep synced the data in the Docker volumes, I used Csync2 with Lsyncd. Csync2 (https://github.com/LINBIT/csync2) is born to synchronize multiple nodes independently by the source of the modification. In other words, it allows a multi-master synchronization. It works similar to rsync, but it is smarter. Lsyncd (https://github.com/axkibe/lsyncd) can be used to trigger Csync2 or rsync. I like, as is usual in Unix universe, that every component does only one thing well. I found the combination of both very powerful, but less flexible, specially for Csync2. Csync2 is not designed to work in a container, neither to work behind a NAT. It wants that the all the nodes are marked by an hostname and that the name used by the client must be referable to the IP which is coming to the sever. For overcoming to these inflexibility I had to use some workarounds.

Integration

A big aspect of my job (the real one, as system administrator) is the integration between different components, between the components and your environment. I am not a developer, and I prefer use tools that some other ones developed than develop myself the tool. This because we live in a world where a lot of people are sharing their code and I think that someone addressed the problem better than me.

The main effort in this project is totally integrate various services and components.

Components

In this section I will describe the scripts that I developed to reach the goal. All scripts run in a separate container.

Dynu client

https://github.com/fabrizio2210/HA_dynu_client

This the main part of the mechanism. If the service doesn’t need files to sync, this component can manage the high availability alone.

The role of the script is to keep update the record DNS with the IP of the node. It also checks that the other nodes are available, if no it disables the record.

Sync server

https://github.com/fabrizio2210/HA_volume_sync_server

The container that hosts this script, mounts the volumes to sync. The script launches Lsyncd to trigger the synchronization and propagate the modifications from this node to others. The script launches also Csync2 to receive the synchronizations from other nodes. In this container there is a HTTP tunnel provided by Chisel (https://github.com/jpillora/chisel). The use of this HTTP tunnel is fundamental for this project, in fact I use it to convert TCP connection of Csync2 in HTTP connection manageable by Traefik.

Sync proxy

https://github.com/fabrizio2210/docker-HA_sync_proxy

This container allows to contact other nodes. In fact Csync2 can not connect directly the other nodes that are behind a NAT, but specially behind a reverse proxy. Every sync proxy container is linked to one node of the group.

This container is a wrapper for Chisel, because I found an arm version of Chisel, but not a arm Docker container of the HTTP tunnel.

Fig.2 — Integration of Csync2 with HTTP tunneling

Sync deployer

https://github.com/fabrizio2210/HA_volume_sync_deployer
As you can see from the schema (fig.2), there is the needing to have a proxy for each node. Moreover, the configuration of the Sync server is dependent by the nodes to contact. So this container mounts the Docker socket and a bash script keeps update the Docker services Sync server and Sync proxy.

But how can this container be aware of other nodes? It makes DNS queries of TXT value created by Dynu client. This mechanism is illustrated in fig.3.

Fig.3 — How to keep each node aware of other nodes

Operations

Infrastructure

In the project you can find two playbooks. One is for creating a base infrastructure where to deploy the web-service, the other one is the tool to deploy the web-service.

setHA_at_home.yml prepares the nodes to host the web-service, so it:

  • performs basic setting on the machine (password, no ipv6…)
  • installs Docker and setup Swarm
  • deploys Traefik as a service in Docker Swarm

Deploys

I created deployStack.yml playbook to deploy the web-service and the components that are needed to achieve the goal of the high availability. In fact, for each service, these components are released:

  • the web-service (i.e. Grav)
  • the Dynu client (to keep updated DNS records)
  • the the volume sync deployer (if necessary to manage volume sync)

Then the volume sync deployer will spawn instances of sync server and sync proxy.

The deploy starts from an yaml that represents the web-service to release. Basically, inside the yaml is described the Docker Image, the hostname of the service, the volumes to sync and the credentials to use the Dynu domain. The tasks in the playbook take this yaml and manipulate it.

Example of stack to deploy Grav (https://getgrav.org/):

version: '3.3'
services:
grav:
image: fabrizio2210/grav:armv7hf
environment:
VIRTUAL_HOST: 'grav.*******.dynu.net'
volumes:
- data-grav:/usr/html/user
restart: always
volumes:
data-grav:
HA_at_homes:
domain_name: '*******.dynu.net'
use_traefik: True
node_name: 'grav'
https: False
exposed_service: 'grav'
exposed_port: '80'
################
# Dynu credentials
dynu_user: '****************'
dynu_password: '****************'
dynu_secret: '********************'
dynu_username: '********@*****.**'
####
# Auth and keys of the infrastructure
csync2_key: '*************'
proxy_auth: 'user:*****************'

Example of a simple http service:

##############################################################
#############ES HTTP #########################################
version: '3.3'
services:
http:
image: fabrizio2210/passwordchart

HA_at_homes:
domain_name: '******.dynu.net'
use_traefik: True
node_name: 'pswchart'
https: False
exposed_service: 'http'
exposed_port: '80'
dynu_user: '***********************'
dynu_password: '******************'
dynu_secret: '******************'
dynu_username: '*****@*****.**'
###################

Considerations

HA vs load balancing

Sometimes the high availability is confused with load balancing. A system can ensure high availability using an active/passive approach. In my case, I have an active/active configuration, so I do both. There is no problem with a stateless web service, but with an application that needs a session like grav (i.e. during admin editing), there are some problems. The load balancing achieved by DNS RR doesn’t take account the session affinity, so at new resolution of the service DNS record, the browser goes to a new node that doesn’t recognize the client.

To avoid this behavior, I should change the logic of Dynu client script to keep only one record active at time.

Test

In this project, I understand the importance of tests. Before push a commit that triggers a container build, it is important to be sure that your code is not wrong. For each component I developed an use case prepared by a test script.

Hands on

If you want try my project on your PC, without Raspberry Pis, I developed a test script that uses Vagrant and libvirt/KVM to simulate a simple scenario.

The requirements for testing are:

  • have an account on www.dynu.com
  • have a Linux Operating System (actually I tested on Ubuntu 16.04)
  • have Vagrant with libvirt/KVM installed

Follow the instructions in Readme file of the GIT project.

--

--

Fabrizio Waldner

Site Reliability Engineer at Google. This is my personal blog, thoughts and opinions belong solely to me and not to my employer.