table indexes in rails

Table indexes are sometimes neglected in the rush to fancy RESTful web 2.0 double chocolate milkshake apps. But they are critical for performance. This post is MySQL-oriented, but the ideas apply to other databases too.

let’s take a ride, on the country side

Consider this ordered query. We will alter the index structure both times to make sure the query cache is clear:

>> Topic.count # => 357383
>> ActiveRecord::Migration.add_index :topics, :title
>> benchmark { Topic.find(:all, :limit => 1000, :order => "title asc"); nil }
0.049945 seconds

Pretty quick for my Mac Mini. Now, we’ll drop the index:

>> ActiveRecord::Migration.remove_index :topics, :title
>> benchmark { Topic.find(:all, :limit => 1000, :order => "title asc"); nil }
1.185544 seconds (2273% change)

Clearly this is something you might want (note, benchmark method is here). But before you go adding an index to every possible column, it is important to understand what they do.

what is an index?

A table index is like a real index in a book or a magazine. Imagine you have a copy of Cosmopolitan, and you want to know which articles were written by “Justin Timberlake”. You have some choices: if you will never need to find out this information again, you can just flip through every page, in order, looking at the byline. But if this is a common search (of course it is!), you can write down on a separate piece of paper the bylines of every article in alphabetical order, and note what page each article is on. Then you can just use that paper the next time the search comes up, and if you just need to know the page you don’t even have to open the real magazine itself.

What’s the cost to this? Each time another article is added to the magazine, in this universe of dynamic magazines, you can’t just slip it into the right spot. You also have to add its byline to the big list of bylines off to the side. As you can imagine, though, since the list of bylines is already in order it’s not really a big deal. You do have to have enough desk space to keep the extra list, though.

what they can’t do

An index won’t help you when you need to know partial fields—usually. If you want articles by “Timberlake” and your index is organized like “Timberlake, Justin”, then you can just glance at half of the information and the index still helps. But you want articles by “Justin” then you have to look at every article again, unless you have a separate index for first names. So a regular index won’t help for fulltext searches unless you index every word separately (don’t do that; it would massively slow down your inserts).

rules of thumb

Primary keys are indexed by default, because indexing is also (usually) how the database enforces UNIQUE constraints. But for other fields, you usually want to index:

  • Any field used in a JOIN—usually foreign keys (for Rails, this means fields ending in _id). The :include ActiveRecord key generates JOIN clauses.
  • Any field used in an ORDER BY clause, for instance, with an ActiveRecord :order key.
  • Any field used in a GROUP BY clause, for instance, with the ActiveRecord :group key.
  • Any field used in a validates_uniqueness_of check, unless it’s really big.

What about multi-column indexes? An index is like a treasure map to a row, and sometimes you need to take multiple steps. If you have a query that uses multiple fields to hone in on some particular set of records, you can add a multi-column index:

add_index :tagging, [:item_id, :item_type]

In this example from real life, we have a polymorphic association. An optimized join to the target table, such as that generated by has_many_polymorphs, will hinge on two columns. Indexing both at once will gain us some speed, because MySQL can only use one index per query. In creating this index, I started with the column that is most unique. But if you are also :order-ing a lot based on the other column, you could start with that.

have some explaining to do

A multi-column index can also be used in place of partial, less specific indexes, to avoid duplication. For example, an index that goes [item_id, item_type, item_pos] can be used if we just need to ORDER or SELECT based on the item_id and then the item_type, or just the item_id. But it cannot be used if we need to select on the item_type but not the item_id. The specificity has an order. If you’re not sure whether or not a particular query can use your index or not, add the index and then use the MySQL EXPLAIN command:

>> ActiveRecord::Base.logger = Logger.new(STDOUT)
>> Topic.find(:all, :limit => 1000, :order => "title asc"); nil
  Topic Load (0.003195)
  SELECT * FROM topics ORDER BY title asc LIMIT 1000

There’s the generated SQL, so we can now ask for an explanation even from within the Rails console:

>> s = "SELECT * FROM topics ORDER BY title asc LIMIT 1000"
>> puts `echo "EXPLAIN #{s}" | mysql -u root app_development`
id select_type table  type  possible_keys key  key_len ref  rows   Extra
1  SIMPLE      topics ALL   NULL          NULL NULL    NULL 357383 Using filesort

And again, but with the index added:

>> ActiveRecord::Migration.add_index :topics, :title
>> puts `echo "EXPLAIN #{s}" | mysql -u root app_development`
id  select_type table  type  possible_keys key                     key_len ref  rows   Extra
1   SIMPLE      topics index NULL          index_topics_on_title   257     NULL 357383

See that key column? For MySQL, key and index mean the same. In the first EXPLAIN, we see that no key is being used. Then we add the index, and yes, our index gets used. Good news.

The behavior of the possible_keys column doesn’t seem to match what I read in the docs. Maybe someone can clear this up for me.

fly away

That’s about the size of it. To find potential index points, watch your MySQL slow query log or your Rails production.log for slow requests. Benchmark when you’re unsure of a decision, and beware the query cache when you do. And be ready for some pleasant performance gains.

4 responses

  1. If you’re interested in seeing which queries use an index, and what index they are using you should check out the query_analyzer plugin.

    It’ll run an EXPLAIN on each query after its run and display a chart of what tables were involved in the queries, and what indexes from each table were used.

    In MySQL you can also use the log-slow-queries configuration variable, which will tell you what queries take longer than 10 seconds to run. I often pair this with long_query_time, which I set to a low value, like 1 or 2 to log all the queries that take longer than 1 or 2 seconds.

    Using log-queries-not-using-indexes configuration variable you can also log queries that, while they may usually take less time than long_query_time, don’t use any indexes at all.

    Logging slow and non-indexed queries is a great way to identify candidate queries in need of indexes.

  2. One of the greatest joys (and greatest self-pants-kickings) is when you remember you don’t have an index on your table and then watch your page load times drop from 10 seconds to .10 seconds.

  3. Thank you for the great article, it helped me achieve a similar speed-up Yurek mentions.

    Evan, I guess the strange behavior of the possible_keys column in the last case might be explained by this:

    “It is possible that key will name an index that is not present in the possible_keys value. This can happen if none of the possible_keys indexes are suitable for looking up rows, but all the columns selected by the query are columns of some other index.”

    So maybe that even goes for having a NULL value in the possible_keys column. I don’t fully understand myself how this is possible but at least the doc covers this case, I think.

    (taken from http://dev.mysql.com/doc/refman/5.0/en/using-explain.html)