Last week Vitaly and I migrated BotBot.me to new servers and also launched a redesign of the user account section. You can now support us by becoming a subscriber for $3/month and even log personal channels for $2/month. If you are curious check it out here.

In this post I'll be sharing tactics we used to migrate the service with minimal service interruption.

BotBot.me receives less traffic than most our customers' sites but it is made up of multiple services and collects hundreds of messages per minute leading to some interesting challenges. The main services that make up BotBot.me are:

  • Redis
  • Postgresql
  • Go IRC client
  • Python plugins
  • Django web interface

A quick read of the architecture docs will help you understand how all these services plug together. We use SaltStack to configure our servers. I am not going to present the detailed configuration of each service but instead explain the strategy we followed for the migration.

Reverse Proxy

We use Nginx for TLS termination, serving static assets, and finally reverse proxying to the Django app.

In order to securely connect our legacy system to the new server, we used autossh to create SSH tunnel between the two. Here is an example for the Redis service:

# /etc/init/autossh_redis.conf
# autossh startup Script
description "autossh daemon startup"
start on net-device-up IFACE=eth0
stop on runlevel [01S6]
respawn
respawn limit 5 60 # respawn max 5 times in 60 seconds
env AUTOSSH_PIDFILE=/var/run/autossh_redis.pid
env AUTOSSH_POLL=60
env AUTOSSH_FIRST_POLL=30
env AUTOSSH_GATETIME=0
env AUTOSSH_DEBUG=1
exec autossh -2 -M 20000 -C -N autossh@legacy_server -L 6379:localhost:6379 -i /root/autossh_id_rsa

The Django web app is being served with uWSGI. Here is its autossh config:

# /etc/init/autossh_uwsgi.conf
# autossh startup Script
description "autossh daemon startup"
start on net-device-up IFACE=eth0
stop on runlevel [01S6]
respawn
env AUTOSSH_PIDFILE=/var/run/autossh_uwsgi.pid
env AUTOSSH_POLL=60
env AUTOSSH_FIRST_POLL=30
env AUTOSSH_GATETIME=0
env AUTOSSH_DEBUG=1
exec autossh -2 -M 30000 -C -N autossh@legacy_server -L 8080:localhost:8080 -i /root/autossh_id_rsa

Then we configured Nginx on new_server to run in read-only mode (i.e., only accept GET requests) and pointed it to the SSH tunneled uWSGI instances running on legacy_server

DNS

We lowered the DNS TTL to the minimum our provider allows (5 minutes) a few days before D-day to ensure changes would propagate as quickly as possible. When we switched the DNS to point to new_server, traffic started to flow to without fully engaging the full stack (Redis and uWSGI were still tunneled to legacy_server). It introduced a marginal extra latency, but was barely noticeable since the two servers were not too far apart geographically. Using the new Nginx instance to switch traffic between the legacy and new servers gave us full control over exactly which servers users would hit versus waiting on DNS changes to propagate.

Database

When we made the switch, we needed to ensure that no processes were writing to our legacy database. For BotBot.me we stopped logging into PostgreSQL, but continued to collect the logs temporarily storing them in our message bus, Redis. Due to the way BotBot.me is architected, the web UI was still functional in read-only mode and people continued to receive real-time updates in their browser. With database writes on hold, we could safely dump the legacy database and restore it on the other side of the fence.

Finalizing the switch over

With the data moved over to the new infrastructure, we configured Nginx to send traffic to the uWSGI instance on new_server and started up the plugins to drain the logs accumulated on the legacy_server Redis instance. Once that process completed, we simultaneously :

  • shut down the IRC client on legacy_server
  • started the IRC client on new_server
  • killed the Redis tunnel on new_server.
  • started the local Redis instance on new_server

Et voila we moved a live site with minimal service interruption (only a short period of read-only mode). Due to IRC network flood control, we lose a minute or two of IRC messages as the bots PART/JOIN channels over a hundred channels, but this was an acceptable loss for us. If you haven't already, check out BotBot.me and let us know what you think.