Brian Luft <brian@lincolnloop.com>
DjangoCon 2010 Portland, OR
Working with Django for a few years.
With Lincoln Loop since 2008.
Clients include National Geographic, PBS, Nasuni, and redbeacon.
Companies and organizations that have been operating for at least a few years have amassed large amounts of content and data.
For most organizations that also means systems and tools that have been around for a while:
--database option
Natural Keys - new feature in Django 1.2
Helps make fixtures much more portable.
Example:
Part of django-extensions - autoloads your Django models into the interpreter namespace.
1 python bin/manage.py shell_plus
2 From 'auth' autoload: Permission, Group, User, Message
3 From 'contenttypes' autoload: ContentType
4 From 'sessions' autoload: Session
5 From 'sites' autoload: Site
6 From 'admin' autoload: LogEntry
7 From 'redirects' autoload: Redirect
8 From 'south' autoload: MigrationHistory
9 From 'categories' autoload: Category
10 From 'content' autoload: Page, Content, Guide, WikiName, WikiLink, Template, Attraction
11 From 'menus' autoload: Menu, MenuItemGroup, MenuItem
The input cache lets you access and eval previously input commands in a flexible manner.
1 In [5]: _i3
The output cache lets you access the results of previous statements.
1 In [23]: range(5)
2 Out [23]: [0, 1, 2, 3, 4]
3 ...
4 In [30]: _23
5 Out [30]: [0, 1, 2, 3, 4]
You can suppress output using a ';' at the end of a line.
The output cache prevents Python's garbage collector from removing previous results. This can quickly use up memory. Use the cache_size setting to bump or down the cache (including 0 to disable).
You can easily use the %macro feature to capture previous input lines into a single command. %hist is handy for this.
You can also %store your macros so you have them available in the iPython namespace across sessions.
Even seen this?
1 SQL Error: 1452: Cannot add OR UPDATE a child row:
2 a FOREIGN key CONSTRAINT fails (`myapp`.`categories`,
3 CONSTRAINT `fk_categories_categories`
4 FOREIGN KEY (`parent_id`) REFERENCES `categories` (`id`)
5 ON DELETE NO ACTION ON UPDATE NO ACTION)
Mix shell commands and Python:
1 In[1]: !echo "set foreign_key_checks=0; drop table customers" | manage.py dbshell"
2 In[2]: Accounts.objects.delete()
3 In[3]: import sys
4 In[4]: print "I'm getting distracted"
5 In[5]: reload(sys.modules['apps.tickets'])
6 In[6]: %macro resetaccts
7 In[7]: %store resetaccts
8 ...
9 In[45]: resetaccts
10 In[46]: migrate
Moving data from a big ol' crufty database to your shiny new Django application can be a wedge in agile development processes.
Not all data is created alike.
1 #migration_settings.py
2 from settings import *
3
4 DATABASES = {}
5 DATABASES['default'] = {
6 'NAME': 'newscompany',
7 'ENGINE': 'django.db.backends.mysql',
8 'USER': 'joe',
9 'PASSWORD': 'schmoe',
10 }
11
12 DATABASES['legacy'] = {
13 'NAME': 'newscompany_legacy',
14 'ENGINE': 'django.db.backends.mysql',
15 'USER': 'root',
16 'PASSWORD': 'youllneverguess',
17 }
1 DATABASE_ROUTERS = ['apps.legacy.db_router.LegacyRouter',]
2
3 INSTALLED_APPS += (
4 'apps.legacy',
5 )
6
7 DEBUG = False
Wait...
1 DEBUG = False
???
What about South?
I wouldn't use South for this type of project -Andrew Godwin, South
We'd like to start pulling legacy Articles into the new application. Running inspectdb has given us this:
1 #apps.legacy.models.py
2
3 class News_Section:
4 name = models.CharField(max_length=50)
5 ...
6
7 class News_Topic:
8 title = models.CharField(max_length=50)
9 ...
10
11 class News_Article(models.Model):
12 section = models.ForeignKey(Section)
13 topic = models.ForeignKey(Topic)
14 ...
We've decided that the legacy schema works pretty well here and we'll only make minor modifications
1 #apps.content.models.py
2
3 class Section:
4 name = models.CharField(max_length=50)
5 ...
6
7 class Topic:
8 name = models.CharField(max_length=50)
9 ...
10
11 class Article(models.Model):
12 section = models.ForeignKey(Section)
13 topic = models.ForeignKey(Topic)
14 ...
First we'll migrate Sections
1 #apps/legacy/migrations/sections.py
2 from apps.legacy import models as legacy_models
3 from apps.content.models import Section
4
5 for s in legacy_models.News_Section.objects.all():
6 section = Section(name=s.name)
7 section.save()
Next we handle Topics:
1 #apps/legacy/migrations/sections.py
2 from apps.legacy import models as legacy_models
3 from apps.content.models import Topic
4
5 for t in legacy_models.News_Topic.objects.all():
6 topic = Topic()
7
8 #map to new field name
9 topic.name = t.title
10
11 topic.save()
We're on a roll now!
For these types of one-to-one mappings it is easy to generalize the operation and reduce a bunch of repetitive code. (Declarative code FTW!)
1 def map_table(field_map, src, dst):
2 for src_col, dst_col in field_map:
3 setattr(dst, dst_col, getattr(src, src_col))
4 return dst
5
6 MAP = ( ('name', 'name',) ,
7 ('title', 'headline', ),
8 ('create_date', 'date_created', ) )
9
10 for some_old_object in legacy_models.SomeOldModel.objects.all():
11 map_table(MAP, some_old_object, ShinyNewObject()).save()
A pattern like this should work well on tables that aren't focal entities in the schema.
Now, time for the Article.
In this case we have foreign keys to fill so our table-mapping pattern won't get us all the way there. We'll need to account for the relations manually.
1 for a in legacy_models.News_Articles.all():
2 article = Article()
3 article.headline = a.headline
4
5 #get the Section
6 Section.objects.get(...) #OOPS! How do we know which one?
A snag. We need to lookup the Section in the new database that corresponds to the old Article's old News_Section. How can we reliably tell which one?
We need to be able to tell which row in the legacy DB an object in the new system came from. In addition to keeping the original ID, you may want to preserve other fields even if you don't have a definite plan for it. Reasons in support of:
I'm in favor of sticking these directly on the models, unless it is more than a few extra fields.
Our new Section, Topics and Article Models
1 #apps.legacy.content.py
2
3 class Section:
4 legacy_id = models.IntegerField()
5 ...
6
7 class Topic:
8 legacy_id = models.IntegerField()
9 ...
10
11 class Article(models.Model):
12 section = models.ForeignKey(Section)
13 topic = models.ForeignKey(Topic)
14 legacy_id = models.IntegerField()
15 ...
1 for a in legacy_models.News_Articles.all():
2 article = Article()
3 article.headline = a.headline
4
5 #get the Section
6 section = Section.objects.get(legacy_id=a.section.id)
7 article.section = section
8
9 #get the Topic
10 topic = Topic.objects.get(legacy_id=a.topic.id)
11 article.save()
OK, so we're getting close. Now we just rinse and repeat with the rest of the tables in the DB.
Turns out the legacy system contains 75,000 Articles.
Turns out the legacy system contains 75,000 Articles.
By the way, we haven't put much attention into the 3,000,000 user comments, the 700,000 user accounts, the user activity stream, the media assets, and a few other things.
Turns out the legacy system contains 75,000 Articles.
By the way, we haven't put much attention into the 3,000,000 user comments, the 700,000 user accounts, the user activity stream, the media assets, and a few other things.
Also, the Articles mapping is going to need work because there are different article "types" that were shoehorned into the system.
Let's look at our Article migration again:
1 for a in legacy_models.News_Articles.all():
2 ...
How big of a QuerySet can we actually handle? Beyond a few thousand objects things might get dicey. (Don't try this at home)
We can switch to using the ModelManager.iterator:
1 for a in legacy_models.News_Articles.iterator():
2 ...
Memory crisis averted! (Not really...)
But wait, now we're making at one query per News_Article:
1 for a in legacy_models.News_Articles.iterator():
2 ...
75,000 article queries
1 for a in legacy_models.News_Articles.iterator():
2 article = Article()
3 article.headline = a.headline
4
5 #get the Section
6 section = Section.objects.get(legacy_id=a.section.id)
7 article.section = section
75,000 section queries
1 for a in legacy_models.News_Articles.iterator():
2 article = Article()
3 article.headline = a.headline
4
5 #get the Section
6 section = Section.objects.get(legacy_id=a.section.id)
7 article.section = section
8
9 #get the Topic
10 topic = Topic.objects.get(legacy_id=a.topic.id)
11 article.save()
75,000 topic queries
5ms connection latency = 750 sec = 12.5 minutes just in network latency
Neither option is very attractive as we deal with large tables.
What should guide our decision making?
Don't worry about getting too exotic until you've maxed out other options. A well designed job system will give you a ton of mileage.
Don't Work Blind. Make sure you know how to:
Also helps:
Grab some handy tools:
The ability to cancel the process and leave data in a consistent state.
The ability to restart the process from a specific point.
The ability to record how long a job takes.
The ability to record what was done, and what went wrong
The ability to run a job against a single row, a ranges of rows, or a single table
The ability to have the migration ignore errors (log them of course) or stop on any exception
One more look at our naive first stab at it:
1 for a in legacy_models.News_Articles.all():
2 article = Article()
3 article.headline = a.headline
4
5 #get the Section
6 section = Section.objects.get(legacy_id=a.section.id)
7 article.section = section
8
9 #get the Topic
10 topic = Topic.objects.get(legacy_id=a.topic.id)
11 article.save()
Problems:
We can mediate between the "all-or-little" extremes of all() and iterator() using set batch sizes.
Take a guess at a reasonable batch size. 1000 rows should be a reasonable starting point for most situations.
We can move the work of mapping rows out to a runner script.
1 from apps.content.models import Topic
2 from legacy.migration.runner import MigrationBase
3 from apps.legacy.models import News_Topic
4
5 class Migration(MigrationBase):
6 model = Section
7 legacy_model = News_Section
8
9 # legacy application
10 MAP = (('title', 'name',))
This lets us vary the batch size and resource usage independently of the individual jobs.
1 from apps.content.models import Topic
2 from legacy.migration.runner import MigrationBase
3 from apps.legacy.models import News_Topic
4
5 class Migration(MigrationBase):
6 model = Section
7 legacy_model = News_Section
8
9 # legacy application
10 MAP = (('title', 'name',))
11
12 def process_row(self, row):
13 return (row.title.upper())
Generalize construction of multi-value INSERT statements
1 class MigrationBase(object):
2
3 @property
4 def column_list(self):
5 return ','.join(self.MAP)
6
7 @property
8 def values_placeholder(self):
9 return 'DEFAULT,' + ','.join(['%s']*len(self.MAP))
10
11 @property
12 def insert_stmt(self):
13 return "INSERT INTO %s VALUES (%s)" % (self.model._meta.db_table,
14 self.values_placeholder, )
A little extra work since we need to collect the legacy IDs.
Remember how we needed to look up the related model in the new database using the legacy ID?
1 for a in legacy_models.News_Articles.all():
2 article = Article()
3 article.headline = a.headline
4
5 #get the Section
6 Section.objects.get(...) #OOPS! How do we know which one?
1 class Migration(MigrationBase):
2 model = Article
3 legacy_model = News_Article
4 related = [{'model': Section,
5 'map_index': 1}]
6
7 MAP = ('title',
8 'section_id',
9 'legacy_id', )
10
11 def process_row(self, row):
12 return {'values': [row.title,
13 None,
14 row.id],
15 'Section': row.section_id}
Now we can grab the related objects in a batch (one query), apply the correct new IDs in bulk, and preserve our batch INSERT for the new objects.
In a real-world example, a job that was taking a few minutes was reduced to less than a second.
The advantage of the atomic-style jobs is that they can run independently.
This means we can use Queue from multiprocessing and run jobs in parallel.
If you need big league performace, replace the local Queue with Celery.
Now we can also run jobs on multiple network nodes and even use multiple copies of the legacy DB for improved read throughput.
We can also write to multiple application DBs for increased write throughput. Merge them at the end.
Use cloud servers (EC2 / Rackspace)
Other lovely things you'll run into:
Don't get stuck in the Django ORM tunnel. This is a very appropriate domain for using alternative approaches.
Turns out the solution is to Talk To Your Teammates!
ChronicDB is a new product with an innovative approach to schema migrations.
Built in Python and C. A free version is available for small databases.
Thank you for your attention.