When you develop a sizable content heavy web site you quickly learn, hopefully not the hard way, that caching is a very important piece of your infrastructure. The database servers are the typical bottleneck in high volume website.
Common wisdom in such cases is to reduce database queries with caching instead of hitting the database on every request. This is a good approach, and even seems relatively simple on the surface, but you will quickly discover that the devil is hidden in the details. There is no one right way to do it.
One of the most famous quotes1 about Computer Science clearly states it:
There are only two hard things in Computer Science: cache invalidation and naming things.
The best approach is very application dependent; one-size does not fit all. What we are going to describe are the results of a succession of trade-offs that worked well for us on a a high-traffic online magazine using a large number of open source reusable applications. The challenge was to be able to use the canonical packages without having to fork each of them in order to improve their scalability for our specific needs.
There are many reusable applications that add a caching abstraction layer on top of the Django ORM (johnny-cache
, django-cachebot
, django-cache-machine
). After a fairly complete analysis of most of the contenders we selected johnny-cache.
Johnny-cache
acts globally at the project level without having to modify each individual application and it always returns “fresh content”. It particularly shines on web sites that have many more reads than writes and on projects that use many external reusable applications.
This sounds great, doesn’t it? There are downsides, however:
- It works at the level of the request / response cycle so it is not aware of anything happening outside of it (cron job, tasks, etc.)
- It is not clever when it invalidates content: every time you change or add an instance in a model it will invalidate all the queries related to this model2.
Most of the time this is not as bad as it sounds, but you need to keep it in mind. On a busy web site you’ll need to build some counter measures to avoid devastating effects. We’ll explore the issues of dog-piling and the thundering heard problem in the next post.
[1] Tim Bray, quoting Phil Karlton
[2] You can find more details about the invalidation policy in the documentation