Lessons Learned Architecting Realtime Applications

Building realtime applications is a big change from how we’ve built websites in the past. Typically, realtime websites require each client holding open a long-running connection to the server so updates can be pushed down to the client immediately. This is in stark contrast to traditional websites where the aim of the server is to serve a page and quickly free up the connection for the next request. Taking a server or application that is optimized for short-lived connections and slapping on a realtime component simply doesn’t work (for reasons I’ll explain below).

We built and maintain two real world realtime websites, our IRC logger, BotBot.me and our team discussion tool, Ginger. We intentionally took two very different approaches in building them. The first site, Ginger, was built using a two-process approach. One process handles all of the long-running connections while another serves the API and pages that require server-generated HTML. BotBot.me, on the other hand, does everything using one Django process. After having them both in production for a while, we’ve learned a few things about what works and what doesn’t.

The Single Server Approach

I’ll start with the simpler approach. BotBot.me’s realtime connections are handled from the same process which generates the server-side HTML pages. We’re using Server Sent Events via django-sse. SSE is a standardized approach to long-polling and part of the HTML5 spec. Native support in browsers is very good and unsupported browsers can use it via a JavaScript polyfill. Coupled with XHR, SSE gives you a realtime bi-directional communication channel without having to mess with websockets. There are a number reasons why you might not want to use websockets (see notes below).

This setup is beautiful because it makes development simple. Only one process to start-up and only one code-base to maintain. You start running into problems, however, when you try to take this app from toy code on your laptop to production-ready. If you’re thinking about taking this approach, here’s a few issues you’ll encounter.

You’re Going to Run out of Workers

A common way to serve dynamic applications is with Nginx proxying to a few workers (WSGI, Rack, etc) serving your application. Common practice is to only run 4-8 workers behind the proxy. Those workers are often capable of serving a surprising number of requests because they are constantly being freed up and re-used. That all changes when you introduce long-running connections. Now you can handle 4-8 clients using your app at any given time. Not exactly a high-traffic setup.

“No problem” you might say, “I’ll use !”. Something like Gevent or EventMachine will let you serve more requests with that handful of workers, but introduces new problems.

You’re Going to Run out of Database Connections

You’ve now blown past 4-8 clients and, because most of these connections just sit idle, you can now handle more than 100 at any given time. Is your app ready for that though? Can your database handle 100 simultaneous connections? Probably not. Now you need to setup a database connection pool. Hopefully you’ve done that before. If not maybe you can pull one off of PyPI that works (we’ve had good luck with django-db-geventpool).

Going Async isn’t Free

In Python, switching to gevent is a pip install and gunicorn flag away. It seems so simple on the surface. But wait, our database driver, psycopg2, isn’t green thread-safe. If you want to reuse connections, now you need to add psycogreen into your stack and make sure it does its monkey-patching early on. Are you sure the rest of your stack works seamlessly with gevent? By going async, you’ve also made debugging considerably more difficult. I think everybody I’ve met with real world gevent experience has a war story about trying to solve some strange deadlock or traceback being swallowed somewhere in the stack.

Your Processes Need to Know the Difference

On a traditional web server, you want to kill connections that don’t finish within a certain amount of time. This keeps your worker pool available to respond to other requests and protects you from Slowloris attacks. Your realtime connections are exactly the opposite. They need to be held open indefinitely. If they drop, they should be re-opened immediately. This means, even though your code is all in the same package, you need to manage different configurations in Nginx and possibly in your application server as well to make sure they are served to clients correctly.

You’re Putting a Square Peg in a Round Hole

There are lots of great ways to handle many concurrent long-running connections. Node.js and Go were built from the ground-up with this scenario in mind. In the Python world, we have Tornado, Twisted, and others that are much better suited for this role. Django, Rails, and other traditional web frameworks weren’t built for this type of workload. While it may seem like the easy route at first, it tends to make your life harder later.

How about the alternative?

A Separate Realtime Process

The approach we took with Ginger separates concerns. Traditional web requests are routed to our Django server while long-running realtime connections are routed to a small Node.js script which holds those connections open. They communicate over a Redis pub/sub channel. This approach is great because it solves most of the issues presented in the single-process approach. Our realtime endpoint is optimized to handle lots of long connections, it doesn’t need to talk to the main database, and it uses software designed to make this sort of stuff easy. Unfortunately, it too has a few issues.

You Need to Do More Work Upfront

If you’re just building a toy app for yourself, this is going to be overkill. Splitting the components up requires a little more thought and planning upfront. It also means getting your development environment up-and-running is more of a hassle. Tools like foreman or its Python counterpart honcho make this easier, but, again it’s one more thing to manage.

You (Might) Need to Learn Something New

If you’ve been building traditional websites, chances are you’ll be picking up a new framework, or even a new language to build your realtime endpoint. It will require a basic understanding of programming in an asynchronous manner (callbacks, co-routines, etc). Choosing the right toolkit can make this an easy transition, but it will still be more than the “just throw gevent at it” solution. For inspiration, read how Disqus replaced 4 maxed out Python/gevent servers with one mostly idle Go server in one week.

Your Auth Story Just Got More Complicated

With one process all your sessions are in the same place. Knowing which requests are coming from authenticated clients and what resources those clients have access to is all baked in. When you add a different process to the mix, it may not have access to your session storage or even talk directly to the primary database to check permissions. You either need to do the work to make those happen or come up with an alternate authentication scheme. For Ginger, we generate short-lived tokens in Redis and pass them securely from the server to the client. The client passes the token back to the realtime endpoint for authentication. See Pusher’s docs for another example of how to handle this.

What About Single Process, Realtime First?

There’s another option here which basically flips the option we used for BotBot.me on its head. Instead of trying to cram asynchronous bits (the realtime stuff) into your synchronous framework, you could put your synchronous bits in your asynchronous framework. Options here include Node.js (frameworks: Express, Meteor, Hapi) or Go (frameworks: Revel, Gorilla).

I’m excited about the possibilities here, but the maturity and usability of these libraries isn’t anywhere near what you get with a framework like Django or Rails. In addition, their ecosystem is years behind more mature languages. You’ll be writing more code from scratch and not get the benefit of the massive open source ecosystem built around Python, Ruby, etc. I don’t doubt they’ll get to that place eventually, but for now, I think the trade-off is too great.

Conclusion

If you made it this far, it’s probably clear which option I prefer. If I had to do it again, I would take the approach we used on Ginger. Separate processes optimized for separate concerns. It may be more work upfront, but it makes everything easier down the road. Especially when you need to grow from one server (in the physical or VPS sense) to multiple servers.

How about you? If you’re running sites with realtime components in production, I’d love to hear your thoughts and how you manage the processes.

Thanks to Armin Ronacher for reviewing this post for me.

Recommending Reading

Arnout Kazemier’s talk, Websuckets for some reasons you should not rely solely on websockets.
Armin Ronacher’s post, Stateless and Proud in the Realtime World which recommends a similar approach.
Aymeric Augstin’s talk, State of Realtime with Django. TL;DW: The tools are there, but the Python ecosystem isn’t built for an async world (yet).