table indexes in rails

Table indexes are sometimes neglected in the rush to fancy RESTful web 2.0 double chocolate milkshake apps. But they are critical for performance. This post is MySQL-oriented, but the ideas apply to other databases too.

let’s take a ride, on the country side

Consider this ordered query. We will alter the index structure both times to make sure the query cache is clear:

>> Topic.count # => 357383
>> ActiveRecord::Migration.add_index :topics, :title
>> benchmark { Topic.find(:all, :limit => 1000, :order => "title asc"); nil }
0.049945 seconds

Pretty quick for my Mac Mini. Now, we’ll drop the index:

>> ActiveRecord::Migration.remove_index :topics, :title
>> benchmark { Topic.find(:all, :limit => 1000, :order => "title asc"); nil }
1.185544 seconds (2273% change)

Clearly this is something you might want (note, benchmark method is here). But before you go adding an index to every possible column, it is important to understand what they do.

what is an index?

A table index is like a real index in a book or a magazine. Imagine you have a copy of Cosmopolitan, and you want to know which articles were written by “Justin Timberlake”. You have some choices: if you will never need to find out this information again, you can just flip through every page, in order, looking at the byline. But if this is a common search (of course it is!), you can write down on a separate piece of paper the bylines of every article in alphabetical order, and note what page each article is on. Then you can just use that paper the next time the search comes up, and if you just need to know the page you don’t even have to open the real magazine itself.

What’s the cost to this? Each time another article is added to the magazine, in this universe of dynamic magazines, you can’t just slip it into the right spot. You also have to add its byline to the big list of bylines off to the side. As you can imagine, though, since the list of bylines is already in order it’s not really a big deal. You do have to have enough desk space to keep the extra list, though.

what they can’t do

An index won’t help you when you need to know partial fields—usually. If you want articles by “Timberlake” and your index is organized like “Timberlake, Justin”, then you can just glance at half of the information and the index still helps. But you want articles by “Justin” then you have to look at every article again, unless you have a separate index for first names. So a regular index won’t help for fulltext searches unless you index every word separately (don’t do that; it would massively slow down your inserts).

rules of thumb

Primary keys are indexed by default, because indexing is also (usually) how the database enforces UNIQUE constraints. But for other fields, you usually want to index:

  • Any field used in a JOIN—usually foreign keys (for Rails, this means fields ending in _id). The :include ActiveRecord key generates JOIN clauses.
  • Any field used in an ORDER BY clause, for instance, with an ActiveRecord :order key.
  • Any field used in a GROUP BY clause, for instance, with the ActiveRecord :group key.
  • Any field used in a validates_uniqueness_of check, unless it’s really big.

What about multi-column indexes? An index is like a treasure map to a row, and sometimes you need to take multiple steps. If you have a query that uses multiple fields to hone in on some particular set of records, you can add a multi-column index:

add_index :tagging, [:item_id, :item_type]

In this example from real life, we have a polymorphic association. An optimized join to the target table, such as that generated by has_many_polymorphs, will hinge on two columns. Indexing both at once will gain us some speed, because MySQL can only use one index per query. In creating this index, I started with the column that is most unique. But if you are also :order-ing a lot based on the other column, you could start with that.

have some explaining to do

A multi-column index can also be used in place of partial, less specific indexes, to avoid duplication. For example, an index that goes [item_id, item_type, item_pos] can be used if we just need to ORDER or SELECT based on the item_id and then the item_type, or just the item_id. But it cannot be used if we need to select on the item_type but not the item_id. The specificity has an order. If you’re not sure whether or not a particular query can use your index or not, add the index and then use the MySQL EXPLAIN command:

>> ActiveRecord::Base.logger = Logger.new(STDOUT)
>> Topic.find(:all, :limit => 1000, :order => "title asc"); nil
  Topic Load (0.003195)
  SELECT * FROM topics ORDER BY title asc LIMIT 1000

There’s the generated SQL, so we can now ask for an explanation even from within the Rails console:

>> s = "SELECT * FROM topics ORDER BY title asc LIMIT 1000"
>> puts `echo "EXPLAIN #{s}" | mysql -u root app_development`
id select_type table  type  possible_keys key  key_len ref  rows   Extra
1  SIMPLE      topics ALL   NULL          NULL NULL    NULL 357383 Using filesort

And again, but with the index added:

>> ActiveRecord::Migration.add_index :topics, :title
>> puts `echo "EXPLAIN #{s}" | mysql -u root app_development`
id  select_type table  type  possible_keys key                     key_len ref  rows   Extra
1   SIMPLE      topics index NULL          index_topics_on_title   257     NULL 357383

See that key column? For MySQL, key and index mean the same. In the first EXPLAIN, we see that no key is being used. Then we add the index, and yes, our index gets used. Good news.

The behavior of the possible_keys column doesn’t seem to match what I read in the docs. Maybe someone can clear this up for me.

fly away

That’s about the size of it. To find potential index points, watch your MySQL slow query log or your Rails production.log for slow requests. Benchmark when you’re unsure of a decision, and beware the query cache when you do. And be ready for some pleasant performance gains.

log system security events to twitter

Ok, so cdcarter’s server got compromised just recently. The attacker deleted a bunch of logs, so we don’t really know what went down.

you have got to be kidding

No, I’m not. There’s a really nice twitter gem all ready to go. So, here’s the core of our app, just 13 lines:

twit = Twitter::Base.new config[:user], config[:password]
tail = "tail -n 200 /var/log/auth.log | grep -v '(pam_unix)'" # optionally filter some events

while (sleep 1)
  if @last_msg
    msgs = `#{tail}`.split("\n")
    msgs = msgs[msgs.index(@last_msg)+1..-1] if msgs.include? @last_msg
    msgs.map{|n| twit.update n }
    @last_msg = msgs.last unless msgs.empty?
  else
    twit.update "Twist rebooted at #{Time.now}"
    @last_msg = `#{tail}`.split("\n").last
  end
end

twitter + system = twist

Check out the code from the Fauna repository:

svn co svn://rubyforge.org/var/svn/fauna/twist

Create a configuration file, /etc/twist.yml:

---
:sysuser: localuser
:user: twitteruser
:password: twittersecret

If the :sysuser is not root, you will have to make sure /var/log/auth.log as well as /etc/twist.rb are readable by the :sysuser. I recommend using a non-privileged :sysuser and adding it to a group that has permissions to read those two files.

Symlink twist.rb into /etc/init.d, set its permissions, and install it as a boot service (Ubuntu specific, your requirements may vary):

cd /etc/init.d/
sudo ln -s /home/you/twist/twist.rb twist.rb
sudo chown root twist.rb
sudo chgrp root twist.rb
sudo /usr/sbin/update-rc.d twist.rb defaults

Run /etc/init.d/twist.rb start to start it immediately.

but this isn’t secure!

It’s only insecure if you leave your twitterings set to public, since you can use the friend system as a privacy control. Also, it is certainly possible for an attacker to notice your Twist setup, see the password, and go in to Twitter and start deleting entries. But if you have your personal Twitter user “follow” the server user, and receive updates by SMS, then you will have a permanent, real-time log on your phone.

It’s not supposed to be the end-all of server monitoring. But it’s pretty fun.

how to find the most popular tags

As a faithful follower of the growing up article, you now have a pretty nice custom tagging system for your app. But how do you find which tags are the most popular? After all, your tag cloud is waiting!

late for the sky

Here’s a find_popular class method for you:

class Tag < ActiveRecord::Base
  # [snip]
  def self.find_popular(args = {})
    find(:all, :select => 'tags.*, count(*) as popularity',
      :limit => args[:limit] || 10,
      :joins => "JOIN taggings ON taggings.tag_id = tags.id",
      :conditions => args[:conditions],
      :group => "taggings.tag_id",
      :order => "popularity DESC"  )
  end
end

usage note

Note that because of the count(*) as popularity clause, the returned tags will have an extra field with the number of tagged objects:

>> Tag.find_popular[0]
=> #<Tag:0x3790d18 @attributes={
  "name"=>"love",
  "id"=>"2",
  "popularity"=>"3"}
>> _.popularity
=> "3"

Pretty neat.

conclusion

Implementing this will become easier once I add custom finders to the polymorphic collection, but for now, it’s not so painful to just drop to SQL.

Thanks to defunkt for working some of this out a while back and to dbii for bugging me about an explanation.

making a web app rewrite work

One of my friends was recently involved in a very large Rails rewrite project that failed pretty abjectly. Meanwhile, at CNET, we just deployed a total overhaul of a crufty old Rails app, replacing it with a new, pleasant, better-designed app. And Chad Fowler has been writing about big rewrites for a while. It must be in the water.

move move move

It is critical, during a rewrite, to get new code into production as fast as possible. Plan for this. Immediate, visible results are more persuasive and motivating then any projected maintenance benefit or performance increase. And deploying code will expose bad design decisions right away, rather than months down the road.

But this means we have to suffer some pain up front to keep the project flying, because the new application will have to work in parallel with the old one for quite some time. Our new app will have to maintain data compatibility with the old. We will have to connect to the same database as the old…and use the same schema.

noooo, our fancy new domain model, noooo

RDBM systems are separate from applications for a reason. We need to use this separation to our advantage. Instead of trying to maintain API compatibility while fragmenting the data store, we can use the database as the interoperability layer, and focus on deploying new features. Old site written in PHP? Doesn’t matter. The database still talks only SQL.

Using the old schema will not break our fancy new domain model. It is relatively straightforward, if tedious, to graft a new domain model onto a partially orthogonal legacy schema.

models, unite!

Say we run EBay (hey, we’re rich!). But money != developer_happiness, and our old domain model, from 1896, has buyers and sellers. We just want users now. How do we treat buyers and sellers as a single model? They might even live in separate databases!

How about:

class Buyer < ActiveRecord::Base
  set_table_name "tbuytslq"
  alias_attribute :name, :buyer_name
end

class Seller < ActiveRecord::Base
  # MySQL can jump across databases, see below
  set_table_name "salsdb.seljhnlp"
  alias_attribute :name, :nm
end

What’s going on? Besides some legacy table configuration, we have to first encapsulate the old domain so that we can merge it into the new one. The alias calls make the legacy models expose a unified API. We can then work with them in our new model without pain. (If the tables contained the same information, but organized it differently, e.g., separate first_name and last_name fields, some helper methods would take care of the conversion in and out.)

pure magic?

Now for something cool:

class User # no inheritance

  def self.method_missing *args
    sources = [Buyer, Seller]
    if args.first.to_s =~ /^(find|create|new)/
      begin
        self.new(sources.unshift.send *args)
      rescue
        retry unless sources.empty?
        raise
      end
    else
      raise "Not supported: #{args.first}"
    end
  end

  attr_accessor :proxy

  def initialize(_proxy)
     @proxy = _proxy
  end

  def method_missing *args
    proxy.send(*args)
  end

  # new business logic here

end

We can delegate the ActiveRecord operations of our new model to multiple old models, and keep the new business logic unified. Crazyness! (We might want to add some id munging, so that for example find("b10") returns a Buyer-User, and find("s10") returns a Seller-User.)

If we need to add fields to User which are not present in both Buyer and Seller, we will have to give it a table (probably in yet another database) and descend from ActiveRecord. The proxy trick will need updating:

class User < ActiveRecord::Base

  belongs_to :buyer
  belongs_to :seller

  def method_missing *args
    (buyer or seller).send(*args)
  end

  # new business logic here

end

Some method chaining and other hackery will be required to make find() work when we haven’t yet build the aggregate instance for the legacy instance we are interested in, but that will depend on specifics of the schemas.

Regarding those associated legacy instances, MySQL can make table joins and generally act sane across disparate databases, as long as they run in the same server. I don’t know about Postgres. Oracle can make table joins across disparate servers (damn; something good in Oracle), but might require some fudging to get working. If it’s impossible to get the right joins together to make an :include work in our new aggregate model, we will have to rely on aggressive caching to handle performance issues.

Of course, you can go the other way, too, and split an overly complicated model into multiple simpler ones. Just set the table name and use undef_method to mask off irrelevant fields.

back to work

Now we can set up our Apache routes to forward the appropriate requests to the new app, copy over our CSS, and start rolling out new features while leaving the old app running and in place. When we have completely replaced some functionality in the old app, we can flip a switch in Apache and let the new app take over.

And if we need to use business logic in the old app from the new app, we can make POST requests to our (old) self. I’m not even kidding. This is what the web is for.

Of course a real-life scenario will be much more complicated than this, and the code above is somewhere between pseudo-code and something we can deploy. But we are on our way.

Will some annoying changes have to be made to both apps for interoperability purposes? Probably. Will it be slower than a pristine, solo deployment? Definitely. But it will work, and it will work now, and someday when the old application is completely replaced, the database can be migrated and the legacy schema interactions can be dropped. The users won’t notice a thing—they’ve been using the rewrite code since forever.

Baby steps; always baby steps.

update

There are some more obscure features in Rails that may help you with your legacy mapping. Check the composed_of field aggregator as well as delegate.

polymorphs 21 harder faster stronger

update

old content

You can now skip class reloading in model classes referenced by has_many_polymorphs in development mode.

Normally if you modify a model, Rails will unload and reload the file. This means the parent model has to build the polymorphic relationship all over again, which can be slow. But if the child classes aren’t unloaded in the first place, we avoid this problem. Of course while you are working on the models themselves you need to switch the caching off.

how to

In config/environments/development.rb:

config.after_initialize do
  config.has_many_polymorphs_cache_classes = true
end

That’s all.

preloading

The instructions in this section are unnecessary as of version 27.1 of the plugin.

Also at the top of app/controllers/application.rb, you need:

require 'model_that_contains_the_polymorphs_call'

This makes sure the parent model is always loaded at least once. Note that STI parent classes will also get ejected from the reloadable lists, since they need to be kept around for their children to function.

If you are working in the console or migrations and need to access the methods added by the plugin, you still need to manually reference your polymorphic parent class.

download

The usual place. Rails 1.2.0 or later is required for the new features.

polymorphs 20

Release 20 of the has_many_polymorphs plugin is available.

new features

  • clear method on the polymorphic collection
  • push, <<, delete, and clear on all individual subcollections

An undocumented API was changed.

With double polymorphic relationships (say, eaters and foods) instead of using .corn_as_eater to access the Corn subcollection, you now need to use .food_corn. Referencing the join tables still uses as. This better parallels the general collection API.

reconnaissance

Also, goodenough appeared on IRC and made a report:

class Hamster < ActiveRecord::Base
  has_many_polymorphs :toys,
    :from => [:wheels, :balls],
    :through => :hamsters
end

No join table! More to come.

web2nam.es

I am pleased to release Web2Names, a little toy to help you find a domain name. It does some tricky stemming on your search term to help find nearby words. Try searching on iphone (I kinda like “iphoni.es” and “iphoned.org”.)

implementation details

The site caches results very aggressively, because querying nameservers for scores of domains at a time is slow. It uses dig ns for the lookups, which means it occasionally returns false negatives (saying a domain is available when it really isn’t) if a site’s DNS configuration is completely hosed.

Web2Names is a Camping app. It is deployed behind Apache 2.2 with multiple load-balanced mongrels. It doesn’t use a database. Instead it uses Manfred Stienstra’s page cache library.

growing up your acts_as_taggable

The acts_as_taggable plugin, although built in to Rails, is basically deprecated. And the gem version is old and requires a separate join table for every taggable model, which is silly.

update

Please see the updated has_many_polymorphs documentation, which has instructions for using the new, built-in tagging generator.

install

Uninstall whatever acts_as_taggable version you have, to prevent fighting among the plugins. Then, install has_many_polymorphs:

script/plugin install -x svn://rubyforge.org/var/svn/fauna/has_many_polymorphs/trunk

models

We will assume you are using the default acts_as_taggable models. You should have a Tag model, with a name field, as well as a Taggings join model. However, these models are part of the acts_as_taggable plugin source, whereas their migrations have to be part of your app (which is confusing). Instead, with has_many_polymorphs, we have to make the models part of the app, too:

class Tag < ActiveRecord::Base
  has_many_polymorphs :taggables,
    :from => [:books, :magazines],
    :through => :taggings,
    :dependent => :destroy
end

class Tagging < ActiveRecord::Base
  belongs_to :tag
  belongs_to :taggable, :polymorphic => true

  def before_destroy
    # disallow orphaned tags
    tag.destroy_without_callbacks if tag.taggings.count < 2
  end
end

See the line [:books, :magazines]? This line replaces all the acts_as_taggable macro calls strewn through your models. Simply list in the array the models that you want to be able to tag. (You can even tag Tags, see below.)

migrations

We need to make sure your existing tag schema fits this. It should be something like:

class AddTagSupport < ActiveRecord::Migration
  def self.up
    create_table :tags do |t|
      t.column :name, :string, :null => false
    end
    add_index :tags, :name, :unique => true

    create_table :taggings do |t|
      t.column :tag_id, :integer, :null => false
      t.column :taggable_id, :integer, :null => false
      t.column :taggable_type, :string, :null => false
    end
    add_index :taggings, [:tag_id, :taggable_id, :taggable_type], :unique => true
  end

  def self.down
    drop_table :tags
    drop_table :taggings
  end
end

If your schema isn’t like this already, you have two choices. You can add more options to the association macros in your models, to show them how to relate to your schema, or you can write a migration to convert your schema into canonical form. The first option is possibly quicker to implement, but the second one will be easier to maintain.

api

Hey, you’re done! Well, not really. The API is different, which is the biggest sticking point. We’ll write some convenience methods to mimic the old way. You should put the methods in RAILS_ROOT/lib/tag_extensions.rb or similar.

class ActiveRecord::Base
  def tag_with tags
    tags.split(" ").each do |tag|
      Tag.find_or_create_by_name(tag).taggables << self
    end
  end

  def tag_list
    tags.map(&:name).join(' ')
  end
end

Pretty straightforward, which makes it easy to modify. For example, to make creating a model from params more transparent, we could add:

alias :tags= :tag_with

Or to enforce lowercase tags, just do:

def tag_with tags
  tags.downcase.split(" ").each do |tag|
    Tag.find_or_create_by_name(tag).taggables << self
   end
end

If you want to allow tags with spaces in them, we can accept an array instead of a string:

def tag_with *tags
  tags.flatten.each do |tag|
    Tag.find_or_create_by_name(tag).taggables << self
  end
end

To delete tags (don’t forget to downcase the tag_string if you need to):

def tag_delete tag_string
  split = tag_string.split(" ")
  tags.delete tags.select{|t| split.include? t.name}
end

To get all models for a tag (and all in a single SQL query, thanks to the plugin):

Tag.find_by_name("artichoke").taggables

To get only a specific model for a tag:

Tag.find_by_name("artichoke").books

Easy and powerful. And if you need to find which tags are most popular, see here.

Make sure that you require the file containing your API methods in environment.rb, since Rails won’t load it automatically:

require 'tag_extensions'

performance note

It is more efficient if we add :skip_duplicates => false to the has_many_polymorphs :taggables call. Then the taggables for each Tag won’t get loaded at all during the <<, because there is no reason to check them.

If we do this, though, and use database constraints to enforce uniqueness, we need to manually rescue assignment errors in our tag_with method:

def tag_with tags
  tags.split(" ").each do |tag|
    begin
      Tag.find_or_create_by_name(tag).taggables << self
    rescue ActiveRecord::StatementInvalid => e
      raise unless e.to_s[/Duplicate entry/]
    end
  end
end

self-referential tagging

What if you want to be able to tag tags? This is more useful than might first appear. It lets you create an ad-hoc non-exclusive hierarchy of categories without extra models. This is much more maintainable than one based on hard-coded models for each level. But you need some way to distinguish “the tags tagged by a tag” and “the tags a tag is tagged by”—a directed graph. So we will rename the parent relationship:

class Tag < ActiveRecord::Base
  has_many_polymorphs :taggables,
    :from => [:books, :magazines, :tags],
    :through => :taggings,
    :dependent => :destroy,
    :as => :tagger
end

class Tagging < ActiveRecord::Base
  belongs_to :tagger,
             :class_name => "Tag",
             :foreign_key => "tagger_id"
  belongs_to :taggable,
             :polymorphic => true
end

Modify the migration accordingly. Now, you can use some_tag.taggables to get the targets of some_tag, and some_tag.taggers to get the tags for which some_tag is itself a target.

Note that your tag_delete method might have to be more longwinded due to this long-outstanding Rails bug. (Rails 1.2.3 might be ok now; it’s unclear.)

wrapping up

You should find, once you complete the move, that tag lookup is faster, that the API is more intuitive and in line with regular ActiveRecord, that database compatibility is improved, and that the subsystem is more extensible. Quite a benefit!

Since has_many_polymorphs is basically a huge superset of the acts_as_taggable features, there is not much that isn’t supported. Many people use has_many_polymorphs in production for tagging already. Join them!

if you don’t want to hoe… echoe

Hoe, a gem used to build gems, injects itself as a dependency into every gem it creates. To many people, this is unnecessary, since hardly anybody wants to perform meta-operations on a gem itself. Mainly they just want to use the included library.

update

The latest documentation for Echoe is here.

update 2

I don’t recommend bothering with the conditional require anymore. People can figure out a LoadError easily enough, and it really clutters up the Rakefile.

the fork of destiny

But we don’t want unused libraries on the production server. What to do? Forget Hoe, and write the Rakefile by hand? I’ve certainly seen people do that. But remember, Hoe is open source. So let’s fork it:

sudo gem install echoe

Echoe is part of the Snax Fauna, little beasts running all over the place.

what’s changed?

Echoe 1.0.0 is based on Hoe 1.1.6. However, it does not inject itself as a dependency. If you want to still publish meta tasks in a Rakefile, you could do as follows:

require 'rubygems'
require 'rake'
begin
  require 'echoe'

  # all your regular Echoe/Hoe config

rescue LoadError => boom
  puts "You are missing a dependency required for meta-operations on this gem."
  puts "#{boom.to_s.capitalize}."

  desc 'No effect.'
  task :default; end

  # if you still want tests when Echoe is not present
  desc 'Run the test suite.'
  task :test do
     system "ruby -Ibin:lib:test some_tests_test.rb" # or whatever
  end
end

Then, people can still run the tests, and in the unlikely event that they want to mess with the gem itself, they can install Echoe. But nothing is forced on you. No opinions. No vinegar. No badgers.

Maybe badgers.