echoe 3

Echoe 3 is out, right on the heels of Rubygems 1.2. It supports the new runtime vs. development dependencies, and works correctly with the Rubyforge 1.0.0 gem.

It still supports all the usual features like certificate chains, RDoc upload, changeset parsing, manifest building, and cross-packaging. Documentation is here.

By the way, Rubygems 1.2 seems pretty great.

xapian search plugin

Francis Irving sent me a note about his work on a new Rails search plugin, acts_as_xapian. It uses the Xapian engine, which is a C++ indexer similar to Lucene. A particularly neat feature is built-in spellcheck.

I still plan to benchmark all these plugins on the Wikipedia dataset…it’s been delayed by the new job. If anyone has a big piece of iron I could use for a couple weeks I would appreciate it (16GB ram, hundreds of GB of free diskspace, no production load).

rubygems memory patch

This patch against RubyGems 1.1.1 improves memory usage by not keeping every unused gemspec permanently in memory. It should have low CPU impact as long as you do your gem requires up-front.

For MacPorts:

cd /opt/local/lib/ruby/site_ruby/1.8
curl https://evanweaver.files.wordpress.com/2010/12/rubygems-memory-1_1_1.diff \
  | sudo patch -p0

Incidentally, I used BleakHouse to track down which references were getting retained.

twitter

Right now, Twitter is suffering slowdowns. Earlier today it was down again. :(

There are an excessive number of single points of failure in the current system, and through developer error* and external circumstance we have managed to hit quite a few of them in the last week. I am sorry and embarrassed.

In particular, we mis-estimated the impact of some cache policy changes. If your site runs so hot that it can’t function without memcached, you’d better understand exactly how much buffer capacity you actually have.

We’re working on fixing it all, but it takes a long time…

* I ain’t sayin’ it was me, but I ain’t sayin’ it wasn’t.

sweeper

Automatically tag your music collection with metadata from Last.fm.

what it is

A while back Last.fm released a command line tool to retrieve metadata for an arbitrary mp3 from their new fingerprint database. I tried it yesterday and it seemed way better than MusicBrainz. So, as a person with a lot of random mp3s, I cooked up a script for retagging entire folders of songs.

Some neat things used in the script:

  • id3lib-ruby for
    handling mp3 tags
  • Text for calculating
    Levenshtein distance to the nearest correct genre name—amatch is a compiled version
    of the same thing, but not Windows-compatible
  • the incredibly comprehensive Last.fm
    API
  • XSD::Mapping for parsing the XML responses (better than
    Hpricot for small, well-formed documents)

A handy feature in the script is the ability to add the top 10 tagged genres to the comment field, so you can use iTunes or Foobar smart playlists for fancier multi-genre sorting. This is similar to lastfmtagger, but not Mac-specific.

demo

Before running sweeper --genre:

$ id3info 1_001.mp3
*** Tag information for 1_001.mp3
*** mp3 info
MPEG1/layer III
Bitrate: 128KBps
Frequency: 44KHz

After:

$ id3info 1_001.mp3
*** Tag information for 1_001.mp3
=== TPE1 (Lead performer(s)/Soloist(s)): Photon Band
=== TIT2 (Title/songname/content description): To Sing For You
=== WORS (Official internet radio station homepage): http://www.last.fm/music/Ph
oton+Band/_/To+Sing+For+You
=== TCON (Content type): Psychadelic
=== COMM (Comments): ()[]: rock, psychedelic, mod, Philly
*** mp3 info
MPEG1/layer III
Bitrate: 128KBps
Frequency: 44KHz

quickstart

Documentation is here, but for OS X:

sudo port install id3lib
sudo gem install sweeper
sweeper --help

Linux is similar to the above, depending on your distribution.

On Windows, you can just download a zipfile from the Rubyforge page and extract sweeper.exe to somewhere in your path.

I expect this to be eventually replaced by an official Last.fm tool, but for now, patches are welcome. It would be especially nice if someone could write a tutorial to help non-Ruby people install the script.

If you are going to contribute some code, grab the SVN checkout from Fauna, since the gem doesn’t ship with the test mp3s.

SVN, I know—how embarrassing!

bleakhouse 4

BleakHouse 4 came to life this weekend.

new implementation

BleakHouse now tracks the spawn points of every object on the heap, somewhat like Valgrind and somewhat like Dike.

This means there is no framing necessary, and the analysis task runs in seconds instead of hours. On the other hand, the pure-C instrumentation means it’s fast enough to run in production, won’t introduce new leaks in your app, and can track T_NODE and other Ruby internals.

sample

After exactly 2000 requests:

$ bleak /tmp/bleak.13795.0.dump
1334329 total objects
Final heap size 1334329 filled, 1132647 free
Displaying top 100 most common line/class pairs
408149 __null__:__null__:__node__
273858 (eval):3:String
135304 __null__:__null__:String
29998 /opt/local/lib/ruby/gems/1.8/gems/mongrel-1.1.4/lib/mongrel.rb:122:String
14000 /rails/activesupport/lib/active_support/core_ext/hash/keys.rb:8:String
11825 /rails/actionpack/lib/action_controller/base.rb:1215:String
7022 /opt/local/lib/ruby/site_ruby/1.8/rubygems/specification.rb:557:Array
5995 /rails/actionpack/lib/action_controller/session/cookie_store.rb:145:String
4524 /opt/local/lib/ruby/gems/1.8/specifications/gettext-1.90.0.gemspec:14:String
4000 /opt/local/lib/ruby/1.8/cgi/session.rb:299:Array
4000 /rails/actionpack/lib/action_controller/response.rb:10:Array
...

Somebody’s got an eval leak, for sure. And those session.rb counts are pretty suspicious.

The BleakHouse docs are here. The codebase is very solid and I look forward to adding some neat things in 4.1 and 4.2.

credit where it’s due

Part of the development of BleakHouse 4 was sponsored by a Rails company you have definitely heard of.

rails search benchmarks

I put together some benchmarks for the three main Rails fulltext search solutions: Sphinx/Ultrasphinx, Ferret/acts_as_ferret, and Solr/acts_as_solr. The book Advanced Rails Recipes was a big help in getting Ferret and Solr running quickly.

dataset

The dataset is the entire KJV Bible, indexed by verse and also by book. This gives us 31,102 smallish records and 66 large ones. Ferret and Solr both use a Ruby method for loading the per-book contents (since they traverse a Rails association), while Sphinx (with Ultrasphinx) uses :concatenate to generate a GROUP_CONCAT MySQL query.

You can checkout or browse the benchmark app yourself from here. Especially note the model configurations. The app should be runnable; the migrations include the dataset load.

performance results

These results exercise some basic queries of varying sizes. Some things are not covered; I may update the benches in the future for facet, filter, phrase, and operator usage.

I search 300 times each for a common short word in verses, the same word in books, the same word in all classes, then a rare pair of words in all classes, and finally a very rare long phrase in all classes. All engines were configured for no stopwords.

$ INDEX=1 rake benchmark

Sphinx
                               user     system      total        real
reindex                    0.000000   0.010000   2.310000 (  8.323945)
verse:god                  1.090000   0.160000   1.250000 ( 31.020551)
verse:god:no_ar            0.720000   0.080000   0.800000 ( 27.364780)
book:god                   0.980000   0.100000   1.080000 ( 26.839016)
all:god                    0.970000   0.100000   1.070000 ( 20.297412)
all:calves                 1.030000   0.110000   1.140000 ( 22.806805)
all:moreover               0.980000   0.120000   1.100000 ( 27.763920)
result counts: [3595, 64, 3659, 5, 2]
index size: 7.6M        total
memory usage in kb: {:virtual=>35356, :total=>35688, :real=>332}

Solr
                               user     system      total        real
reindex                  403.500000   4.650000 408.150000 (500.704153)
verse:god                  2.530000   0.500000   3.030000 ( 30.330766)
book:god                   1.910000   0.280000   2.190000 ( 30.164732)
all:god                    2.940000   0.360000   3.300000 ( 30.864319)
all:calves                 2.250000   0.330000   2.580000 ( 19.039895)
all:moreover               1.860000   0.300000   2.160000 ( 23.407134)
result counts: [4077, 64, 4141, 5, 2]
index size: 7.8M        total
memory usage in kb: {:virtual=>219376, :total=>298644, :real=>79268}

Ferret
                               user     system      total        real
reindex                    0.830000   2.130000   2.960000 (512.818894)
verse:god                  0.760000   0.210000   0.970000 (  2.557016)
book:god                   0.740000   0.030000   0.770000 (  1.914840)
all:god                  144.460000   4.430000 148.890000 (602.861915)
all:calves                 1.010000   0.050000   1.060000 (  3.033010)
all:moreover               0.710000   0.060000   0.770000 (  4.185469)
result counts: [3893, 64, 3957, 7, 2]
index size: 13M         total
memory usage in kb: {:virtual=>47272, :total=>112060, :real=>64788}

The horrible Ferret performance for “all:god” happened consistently. The log suggests that it does not use any kind of limit in multi-model search in order to ensure the relevance order is correct. This is a big fail.

The “real” column times for Sphinx and Solr are the same, which suggests that the bulk of it is socket overhead. The dataset is too small for the actual query time to have an effect. However, it looks like Ferret is reusing the socket (via DRb) which is a point in its favor. Sphinx currently does not support persistent connections.

It is important to realize this does not mean that Ferret is fastest overall. It means that Ferret is fastest for small datasets where the constant socket overhead dwarfs the logarithmic actual lookup overhead.

Do note that the “total” time spent (e.g., time spent in Ruby, instead of waiting on IO) is much lower for Sphinx than for Solr.

It would help the benchmark validity to run many query containers in parallel on separate machines, and to use a much larger dataset.

Other people’s benchmarks suggest that Sphinx starts to scale really well as query volume increases. Solr is likely to be within the same order of magnitude.

quality results

An extremely crude evaluation of search quality: which result set for the word “God” has the word repeated the most times in the records?

# Sphinx
Ultrasphinx::Search.new(:class_names => 'Verse', :query => "God",
:per_page => 10).run.map(&:content).join(" ").split(" ").
select{|s| s[/god/i]}.size
=> 45

# Solr
Verse.find_by_solr("God", :limit => 10).docs.map(&:content).join(" ").
split(" ").select{|s| s[/god/i]}.size
=> 30

# Ferret
Verse.find_by_contents("God", :limit => 10).map(&:content).join(" ").
split(" ").select{|s| s[/god/i]}.size
=> 26

Not much, but it’s something.

thoughts on usage

It’s interesting how similar the acts_as_ferret and acts_as_solr query interfaces are, and how different Ultrasphinx’s is. Multi-model search is an afterthought in Ferret and Solr, and it shows. (No other Rails Sphinx plugin supports multi-model search.)

The configuration interfaces are pretty similar until you start to get into engine-specific stuff like Ultrasphinx’s :association_sql, or Ferret’s analysis modules. Solr has its scary schema.xml but acts_as_solr hides that from you.

Ultrasphinx has some initialization annoyances which acts_as_solr doesn’t suffer from.

Ferret acted weird alongside Ultrasphinx unless I specifically required some acts_as_ferret files in environment.rb. Ferret also will index your entire table when you first reference the constant, which was a big surprise. In general Ferret is overly coupled. Solr is better and acts_as_solr does an especially nice job of hiding the Java from you.

I didn’t test any faceting ability. Solr probably has the best facet implementation. Ferret doesn’t seem to support facets at all.

on coupling

Ferret is unstable under load, and due to its very tight coupling, takes down your Rails containers along with it. Solr is pretty stable, but suffers from the opposite problem—if something goes wrong in a Rails container, and a record callback doesn’t fire, that record will never get indexed.

Sphinx avoids both these problems because it integrates through the database, not through the application. This is what databases are for. Sphinx is incredibly stable, but even if something happens to it, the loose coupling means that the only thing that fails in your app is search. And since it doesn’t rely on container callbacks, your index is always correct. This is the main reason I wrote Ultrasphinx.

Both Solr and Ferret are too slow to reindex on a regular basis. They could be much, much faster if they didn’t have to roundtrip every record through Ruby, but that’s how they’re designed.

Takeaway lesson—be deliberate about your integration points.

delta indexing support in ultrasphinx

Ahead of schedule, Ultrasphinx 1.9 is out with delta indexing, ERB support in the .base files, and official compatibility with Sphinx 0.9.8-rc1.

what it is

Delta indexing speeds up your updates by not reindexing the entire dataset every time. Instead, it keeps a main index which is updated rarely, and a delta index, which is updated frequently and only contains recently changed records.

Of course, your records need timestamps for this to work.

See the documentation for more details. There is also an explanation of the implementation on the forum.

gotchas

Note that there are some gotchas surrounding Sphinx and index merges, mainly that facet counts and text sorting may not be perfectly accurate. In an append-rich environment (most web apps) these tend not to matter.