The Road Ahead¶

Once your site is launched, it’s definitely time to celebrate. That’s a huge accomplishment! Congratulations!

But now you need to make sure it stays up and running.

There are a few forces fighting against you in this battle:

Your users (via traffic spikes)
Your software (via bit rot)
You (via poor decisions)

The first one is no surprise, but after the launch, the last two are the ones that are more likely to catch you by surprise and take your site down.

Traffic Spikes¶

During “normal” operation, your site shouldn’t be utilizing 100% of the resources at any level of the stack. Anything that regularly sits above 70% (CPU, RAM, disk, etc.) is something that should be optimized or given additional resources. While the extra resources are essentially wasted in day-to-day operation, you’ll be glad you have the extra buffer when a traffic spike hits.

While some traffic spikes happen out of the blue, others are very predictable. A big marketing push or a strategic partnership, for example, might drive a major influx of traffic to your site, so make sure your business/marketing team is communicating these events to your developers. For the first few, it’s a good idea to have all hands on deck just like you did on launch day. After weathering a few good bursts of traffic, you’ll gain confidence with the platform and be able to predict the outcome of future events without disrupting regular development.

Withstanding your first traffic spike is the true test that you’ve built a system that can scale. If, however, you do run into issues, it’s time to revisit Your War Room: Monitoring the Launch. For sites with very “bursty” traffic patterns, you may also want to go back to A Word on Auto Scaling for a cost-efficient way to handle the peaks.

Bit Rot¶

As you’ve learned, high performance sites are built on a towering stack of different services. The ability to pull mature software off-the-shelf and plug it into your infrastructure is both a blessing and a curse. On one hand, you’re standing on the shoulders of giants, benefiting from years of development and testing. On the other hand, all that software needs to stay patched and up-to- date to avoid security holes or being left behind on unsupported software. Don’t be surprised if from the time you started development to the time you launch at least one part of your stack is already outdated.

Part of the problem is that it’s so easy to postpone this sort of housekeeping. It’s usually work with no immediate benefit to your end-users or your bottom line. Teams tend to get an “if it ain’t broke, don’t fix it” mentality around large pieces of software, but it’s a dangerous path to follow.

While it may be OK to skip a minor version here and there, you also want to make sure you don’t get too far behind. If you wait too long, a few small upgrade tasks can pile up into an insurmountable mountain of work, grinding all regular development to a halt. Save your team grief by scheduling regular upgrade cycles where dependencies (your OS, major services, and Python libraries) are reviewed and updated to avoid being left behind.

Tip

When reviewing upgrades, we usually avoid brand new major releases. If any bugs made their way past the release process, you don’t want to be the first one to find them. Give new releases a couple months to “mature”. Upgrade on the first point release or when enough time has passed for you to be confident there aren’t any major problems.

Poor Decisions¶

As the developer and maintainer of a large site, you effectively are in constant contact with a loaded pistol pointed at your foot. In other words, you are your own worst enemy when it comes to keeping the site alive and healthy. It’s easy to get lulled into a false sense of security once the system is finely tuned and you are deploying new code frequently. Here are a few common scenarios where you might inadvertently pull the trigger of that pistol:

Accidentally Flushing the Cache¶

Restarting your cache or web accelerator during a traffic peak can be catastrophic. It’s like dropping your shield and taking off your helmet in the heat of battle. All that traffic now runs straight to the slow services at the bottom of your stack. If they can’t withstand the abuse while your caches warm back up, the whole site will topple over. This is commonly referred to as dog-piling or a cache stampede1.

Thankfully, Varnish has a reload operation (service varnish reload on Ubuntu) so you can load a new configuration without losing the in-memory cache. Make sure your deployment tool uses this operation when you deploy configuration changes instead of opting for a hard restart.

For your other caches (memcached/Redis), you can set it up to allow you to clear portions of your cache without a full-fledged cache flush. The simplest way to do this is to have multiple entries in your CACHES dictionary for the different types of cache (templates, database, sessions, etc.). You use the same server instances for every one, but adjust the KEY_PREFIX2 to logically separate them. Now you can invalidate an individual cache (all your templates fragments for example), by incrementing the VERSION3 for that cache.

These techniques should make the need for a restart a very rare event. If a restart is absolutely necessary, during a software upgrade for example, plan it during a historically low-traffic time frame. Slowly rolling the restarts across each server will help too, ensuring there is only one cold cache at any given time.

1: http://en.wikipedia.org/wiki/Cache_stampede
2: https://docs.djangoproject.com/en/dev/ref/settings/#key-prefix
3: https://docs.djangoproject.com/en/dev/ref/settings/#version

Locking the Database¶

Database locks are a necessary evil but as your database gets bigger and your site gets more traffic, locks will start taking longer to release and the ramifications (temporarily blocking writes) become more problematic. Two common culprits for long database locks are schema migrations and backups.

During development, South and the built-in migrations in Django 1.7+ make changing your database schema a trivial task. Adding and removing columns only requires a couple of simple commands and might take less than a second on your development machine.

In production, however, you need to be very wary of migrations. A migration on a table with millions of rows could hold a lock for minutes while it adjusts the internal schema. This is bad news if you have users trying to write to that table at the same time.

One of the drawbacks of MySQL is that even a simple ADD COLUMN operation requires a full data copy of the table, completely locking it in the process. Version 5.6 has improved some operations, but this is one of the areas where PostgreSQL beats it hands-down. If you’re stuck on MySQL and running into this issue, read Taking the Pain Out of MySQL Schema Changes4 by Basecamp, or do what Lanyrd did and make the switch to PostgreSQL5.

No matter what database you’re on, migrations should always be reviewed and tested on a recent replica of your live data prior to going to production.

Backups are another cause of long database locks. Performing a dump on a sufficiently large database is going to be a time consuming process. It’s best to take snapshots from a read-only replica of your live database to mitigate this issue.

4: http://signalvnoise.com/posts/3174-taking-the-pain-out-of-mysql-schema-changes
5: http://www.aeracode.org/2012/11/13/one-change-not-enough/

Mass Cache Invalidation¶

While not as bad as flushing the cache altogether, changing the prefix of an important set of cache keys or mass editing objects in the database will trigger a large number of cache misses and can create an influx of queries to the database. Knowing where your most frequently used or “hot” code paths are, and taking extra care with that code, will help you avoid this issue. If you’re unsure about a specific change, do your deploy during a low traffic period to avoid taking your site down.

Expensive Admin Views¶

It’s easy to put all your focus on the user-facing side of the site and forget about your administration interface. Building an unoptimized admin view that makes a few thousand database queries isn’t hard to do. But if you have a team of admins hammering on that view when your database is already hot, it can be the straw that breaks the camel’s back.

If you are using a query cache such as johnny-cache, each save in the admin will invalidate all the cached queries for the given table. A flurry of admin activity can trigger mass cache invalidation that will put heavy pressure back on the database as the cache warms up again.

If you find yourself in a situation where admin activity is causing problems, treat it like you would any other Django view as discussed in Where to Optimize.

Expensive Background Tasks¶

We already discussed pushing slow tasks to background workers or performing them periodically via a cron job. But just because they operate outside of the request/response cycle doesn’t mean you don’t have to worry about them. Poorly optimized, database-intensive tasks can cause unexpected but regular spikes in load. In the worst case, they exceed their scheduled time window and begin to stack on top of each other. Apply the same database optimization techniques we used on your views (Database Optimization) to your background tasks and keep an eye on their performance just like the rest of your stack.

Gradual Degradation¶

Gradual performance degradation is a silent but swift assassin. Without keeping an eye on your metrics and being diligent about your site’s performance characteristics as you add new features, you can chip away at a once well-tuned site until it topples over.

Part of your regular release process should be to watch your performance metrics (discussed in Instrumentation) and look for regressions. If you see your response times or load creeping up, it’s much easier to handle it immediately than try to sift through months of commits to figure out where the problem stems from.

Complexity Creep¶

If you’ve followed along so far, you’re doing a good job of keeping unnecessary complexity out of your software. You’ve deferred many “hard” programming problems to your database, caches, and web accelerators. As your site grows and you encounter new scaling problems, it’s easy to get in the mindset that your site is a unique flower and requires equally unique solutions to keep it alive. It’s fun to build your own tools, but that “not invented here”6 attitude is dangerous in the long run. You’re better off learning how to leverage Varnish more effectively now than to scrap it in favor of your own tool. The ramifications of that decision are vast:

Training new developers is more costly. You can find a developer with experience using a service like Varnish, but your custom solution will require training every developer that walks in the door.
Developing low-level infrastructure tools pulls development time away from your core product. With a well-supported set of source services, you effectively have a whole team of developers improving your infrastructure for free.
Writing your own software is not a one-time cost. All software requires ongoing development, testing, documentation, etc.

6: http://en.wikipedia.org/wiki/Not_invented_here

While not as glamorous as the initial build-out, the process of maintaining software, keeping technical debt in check, and maximizing uptime is even more important. It’s a task that favors discipline and patience so take it as an opportunity to settle in and get your Zen on.