snax

searching the world in 231 seconds

update

Please refer to the official documentation here, rather than this post.

launch’d

I’m pleased to release the new CHOW search:


no delta, no no no

Sphinx is so fast that we don’t run an index queue. Re-indexing all of CHOW takes 4 minutes in production:

$ time indexer --config sphinx.production.conf complete
Sphinx 0.9.7
Copyright (c) 2001-2007, Andrew Aksyonoff

collected 405482 docs, 1095.1 MB
sorted 199.0 Mhits, 100.0% done
total 405482 docs, 1095082069 bytes
total 228.087 sec, 4801154.00 bytes/sec, 1777.75 docs/sec

real    3m51.321s
user    3m24.121s
sys     0m21.656s

That’s crazy.

why

Even though Solr/Lucene was available as an in-house CNET product, we dropped it in favor of Sphinx’s simplicity.

Sphinx accesses MySQL directly, so the interoperability happens at that level rather than in-app. This means you don’t need any indexing hooks in your models. Their lifecycle doesn’t affect the search daemon.

Plus, our old indexing daemon would mysteriously die and not restart. The Sphinx indexer just runs on a cronjob.

free codes

Kent Sibilev released acts_as_sphinx a while back, but I had already started on my implementation of a Rails Sphinx plugin. As it turned out, our needs were more sophisticated, so it’s good I did.

Mine is called Ultrasphinx, and features:

Of course it inherits from Sphinx itself:

Downsides of my plugin are:

The biggest benefit, really, is the SQL generation and index merging, which are related. The SQL generation lets you configure Sphinx via:

is_indexed(
  :fields => ["title", {:field => "post_last_created_at", :as => "published_at"}, "board_id"],
  :includes => [{:model => "Board", :field => "name", :as => "board"}],
  :concats => [{:model => "Post", :field => "content",
                :conditions => "posts.state = 0", :as => "body"}],
  :conditions => "topics.state = 0")

That is, you can :include fields from unary associations, and :concat fields from n-ary associations. For example, in this case, we are indexing all replies to a topic as part of that topic’s body.

Because the SQL is generated, by paying careful attention to Sphinx’s field expectations, we can create a merged index which allows us to rank totally orthogonal models by relevance.

Sphinx does require a unique ID for every indexed record. We work around this by using the alphabetical index of the model class as a modulus in an SQL function.

download

script/plugin install -x svn://rubyforge.org/var/svn/fauna/ultrasphinx/trunk

Documentation is here.

to the future

I don’t have much time to support this outside of our needs at CNET. So if you need something Certified and Enterprise Ready, I guess use Lucene, or maybe that French one I can’t spell.

If you need something faster, simpler, and more interesting, Sphinx + Ultrasphinx will be awesome.

Patches welcome; just ask if you want to be a committer. The support forum is here.

postscript

Who just searched for “señor fish”? Not kidding:

[eweaver@cnet search]$ rake ultrasphinx:daemon:tail
Tailing /opt/sphinx/var/log/query.log
  whole wheat pasta
  senor fish

Hopefully it’s better than the shrimp burrito I made the other day. That was kind of gross.