DockerLab proxy performance and IPVS

Using one node as reverse proxy is a bottleneck. Introducing IPVS I can improve the performance of a website.

Fabrizio Waldner
8 min readJan 2, 2021

Today I want to talk about how I spotted a performance bottleneck in a my infrastructure (DockerLab) and found a way to prevent that.

All started during a performance test on a website hosted on DockerLab. Let me present the scenario: a static website is served by 4 Nginxs that run on 4 Raspberry Pis. The Nginxs are not directly on the bare metal, but they are in Docker containers managed by the Docker swarm. Besides, in front of every Nginx there is a Traefik reverse proxy. But just one reverse proxy is punted using a VIP that can move in case of failover (more information in DockerLab).

It seems a complex and redundant structure, but it comes for free when you have a Docker swarm.

Scheme of the infrastructure

The Docker stacks deployed

version: '3.2'services:
traefik:
image: arm32v6/traefik
command:
- "--api"
- "--api.dashboard=true"
- "--api.insecure=true"
- "--entrypoints.web.address=:80"
- "--entrypoints.websecure.address=:443"
- "--providers.docker=true"
- "--providers.docker.watch=true"
- "--providers.docker.swarmMode=true"
- "--providers.docker.exposedByDefault=false"
- "--accessLog"
- "--providers.docker.network=Traefik_backends"
- "--certificatesresolvers.myresolver.acme.storage=/etc/traefik/acme/acme.json"
- "--certificatesresolvers.myresolver.acme.email=xx@xx.com"
- "--certificatesresolvers.myresolver.acme.httpchallenge.entrypoint=web"
- "--certificatesresolvers.myresolver.acme.httpchallenge=true"
- "--metrics"
- "--metrics.prometheus"
volumes:
- /var/run/docker.sock:/var/run/docker.sock
- acme_data:/etc/traefik/acme/
networks:
- backends
ports:
- target: 80
published: 80
mode: host
- target: 443
published: 443
mode: host
- target: 8080
published: 8888
mode: host
deploy:
labels:
traefik.http.routers.dashboard.service: "api@internal"
mode: global
placement:
constraints:
- node.role == manager
update_config:
parallelism: 1
delay: 10s
restart_policy:
condition: on-failure
networks:
backends:
driver: overlay
volumes:
acme_data:
driver: 'local'
driver_opts:
type: 'none'
o: 'bind'
device: "/mnt/cluster/traefik/acme"

Brief translation: create a Traefik instance for every Docker swarm node. Make them listen on 80 and 443 TCP port of each node. Bypass Docker routing mesh. Enable the Let’s Encrypt automation to have TLS certificates. Auto-configure the reverse proxy based on the services on the Docker swarm. Create a backends network where the backend services have to attach.

Here the website stack (I prefer keeping the stacks separated):

version: '3.3'
services:
website:
image: fabrizio2210/ervisa-makeup-it
deploy:
mode: global
labels:
traefik.enable: "true"
traefik.http.routers.my-router.rule: "Host(`example.com`)"
traefik.http.routers.my-router.entrypoints: websecure
traefik.http.routers.my-routerhttp.entrypoints: web
traefik.http.routers.my-httprouter.rule: "Host(`example.com`)"
traefik.http.routers.my-httprouter.middlewares: my-redirectscheme
traefik.http.routers.my-router.tls: "true"
traefik.http.routers.my-router.tls.certresolver: myresolver
traefik.http.services.my-service.loadbalancer.server.port: 80
traefik.http.middlewares.my-redirectscheme.redirectscheme.scheme: https
traefik.http.middlewares.my-redirectscheme.redirectscheme.permanent: "true"
restart: always
networks:
- Traefik_backends
- default
networks:
Traefik_backends:
external: true

Brief translation: create fabrizio2210/ervisa-makeup-it (it is an Nginx crafted by with static contents) container for each node. Instruct Traefik to forward when the example.com host is matched. Setup Traefik to redirect 80 requests to port 443 (http -> https). Use the backends network created in the Traefik stack (the above one).

The test

The load test is very rough, it is to have an idea how the website behaves rather than how much are the performance. There is a fixed load and the number of concurrent clients is a consequence.

for i in $(seq 100 ) ; do 
wget --level=inf --recursive \
--page-requisites --user-agent=Mozilla \
--no-parent -e robots=off \
-P /run/user/1000 \
--reject-regex "(.*)\?(.*)" -q \
https://example.com &
sleep 1
done

Brief translation: every second for 100 times download the entire website avoiding some nasty link. Obviously it is not able to download the whole website in one second, so it means after a while there will be several concurrent clients. The number of concurrent client depends on how much fast is the website to return the data.

The outcome

In order to measure the performance I will keep in consideration the following parameters:

  1. The total time of the test
  2. The maximum number of concurrent clients
  3. The highest response time during the test

Here the results:

  1. It took 6 minutes and 36 seconds to serve all the clients.
  2. There was at maximum 96 concurrents clients:
Y= number of clients; X= sample every 2 seconds

3. The highest Response time was 13.87 seconds. But to be honest, some points in between are missing, although the samples would be every minute, so it means that it went over the timeout (15s):

Zoom of 15 minutes containing the test

The Bottleneck

In order to understand where there is room to improve, I used basic indicators:

Utilized Network I/O

Obviously there is a big peak of bandwidth utilization. It worth notice one node (raspberrypi3) had the highest traffic IN/OUT because it acts as Reverse Proxy. It is not alarming because the maximum throughput may be 90 Mbps.

Utilized RAM

The utilization of the RAM was pretty stable.

Utilized Disk I/O

Number of I/O
Disk Read Throughput

There was just an initial read from the disk, then the website was served by the cache (from RAM memory). Therefore there was not a big effort from the disk.

Utilized CPU

CPU computed as total minus idle

The CPU of raspberrypi3 was100% for all the test, whereas the others CPUs were quite stable. This is probably the bottleneck.

The improvement

From the graphs I understand the CPU is the bottleneck. Why the CPU is so high is immediately explained. The process that is keeping busy the CPU is the Reverse Proxy, namely Traefik. Indeed in this configuration, although there are 4 reverse proxy distributed on the 4 nodes, just the one punted by the VIP is used. The Reverse proxy is in charge of terminating the TLS connection and choose the right backend container (i.e. one of the 4 Nginx).

Therefore, the optimization is to distribute the traffic on all the 4 Reverse Proxy using a Layer 4 Load Balancer (i.e. TCP Balancer). Being the L4 normally less performance expensive of L7 Load Balancer I should achieve a better result without creating a new bottleneck.

Scheme of the new infrastructure

IPVS: what is it?

Doing some researching I discovered that Linux kernel already (from 2.6.x version) contains a simple feature of Load Balancing at L4. It is called IP Virtual Server and it is configured with the command ipvsadm . The concept is to map a virtual server defined by IP:port to several real backend server defined by other IP:port. It is very simple and is lacking of health check. The interesting thing is it operates directly in the kernel and the configuration is similar to netfilter/iptables. More information here.

New setup

To understand the commands for the new configuration I have to show the IPs of the servers:

   NAME          IP           VIP
bananam2u1 192.168.1.2 192.168.1.200
raspberrypi0 192.168.1.10
raspberrypi1 192.168.1.11
raspberrypi2 192.168.1.12
raspberrypi3 192.168.1.13

On another node (bananam2u1) I setup the IP Virtual Server:

root@bananam2u1:~# ifconfig eth0:0 192.168.1.200 up
root@bananam2u1:~# ipvsadm -A -t 192.168.1.200:443 -s rr
root@bananam2u1:~# ipvsadm -a -t 192.168.1.200:443 -g -r 192.168.1.10
root@bananam2u1:~# ipvsadm -a -t 192.168.1.200:443 -g -r 192.168.1.11
root@bananam2u1:~# ipvsadm -a -t 192.168.1.200:443 -g -r 192.168.1.12
root@bananam2u1:~# ipvsadm -a -t 192.168.1.200:443 -g -r 192.168.1.13
root@bananam2u1:~# ipvsadm -l -n
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP 192.168.1.200:443 rr
-> 192.168.1.10:443 Route 1 0 0
-> 192.168.1.11:443 Route 1 0 0
-> 192.168.1.12:443 Route 1 0 0
-> 192.168.1.13:443 Route 1 0 0

The used configuration is the direct routing, this configuration is designed to expose the virtual server in a network and contact the real backend servers in the same network (i.e. same subnet). In this case the L2 address is used to deliver the packet to the backend server and handle just an half connection. In other words, the destination IP of the packet that reaches the backend server is the Virtual Server IP (i.e. 192.168.1.200), but the MAC address is the backend server interface’s one. The source IP is kept to the original source IP, so the backend will respond directly to the client.

I would like to use a masquerading configuration to simplify, but having all the IPs on the same network leads to a problem of not NATting (probably a little bug). The destination IP of the packet that reaches the backend server should be the real IP of the backend server, but it is not. It reaches with the Virtual Server IP (i.e. 192.168.1.200).

In both cases, I can adjust the routing using iptables on backend servers:

root@raspberrypi0:~# iptables -t nat -A PREROUTING -d 192.168.1.200/32 -j DNAT --to-destination 192.168.1.10
root@raspberrypi1:~# iptables -t nat -A PREROUTING -d 192.168.1.200/32 -j DNAT --to-destination 192.168.1.11
root@raspberrypi2:~# iptables -t nat -A PREROUTING -d 192.168.1.200/32 -j DNAT --to-destination 192.168.1.12
root@raspberrypi3:~# iptables -t nat -A PREROUTING -d 192.168.1.200/32 -j DNAT --to-destination 192.168.1.13

Finally, I changed the /etc/hosts of the tester to use the new Virtual Server (192.168.1.200).

The Second outcome

  1. It took 5 minutes and 2 seconds
  2. There was at maximum 88 concurrents clients:
Y= number of clients; X= sample every 2 seconds

3. There was not an highest response time. The response time was pretty the same (about 1.2 seconds):

This is the latest hour. The test started at 10:45 and finished at 10:50

The Result

The outcome of the test is brilliant, especially for the response time: it was not altered by the test. It means that the infrastructure, using the same load, has not reached the breaking point.

To be complete, I insert the graphs of the basic indicators.

The network utilization, as expected, is more balanced (the peak at 10:40 is not involved in the test)
Memory kept stable
At the beginning of the test, the file were read from the disk and then served from memory
The CPU stayed above the 50% of utilization

Conclusion

From the point of view of the load balancing, the design is totally fine. CPU linearly increases even if it has not reachead the limit. Probably to scale up we should increase the number of the nodes and this design allows this (horizontal scaling).

On the other hand, this design is ignoring the problem of the High Availability. The IP Virtual Server is blind of the health status of the backend nodes, so it could forward a TCP connection to an unhealthy node. In order to overcome to this, we can choose two ways:

  • creating an health check script on the node that acts as Virtual Server. The script should verify the availability of the backend nodes and update the Virtual Server configuration;
  • using VIPs attached to the backend nodes. Each backend node should expose a VIP that is managed by a keepalived daemon that moves it to an healthy node when there is any problem. In this configuration the Virtual Server should not punt to the real backend IP anymore, but to the VIP backend IP.

I hope this little journey has aroused some interest!

--

--

Fabrizio Waldner

Site Reliability Engineer at Google. This is my personal blog, thoughts and opinions belong solely to me and not to my employer.