ree

We recently migrated Twitter from a custom Ruby 1.8.6 build to a Ruby Enterprise Edition release candidate, courtesy of Phusion. Our primary motivation was the integration of Brent’s MBARI patches, which increase memory stability.

Some features of REE have no effect on our codebase, but we definitely benefit from the MBARI patchset, the Railsbench tunable GC, and the various leak fixes in 1.8.7p174. These are difficult to integrate and Phusion has done a fine job.

testing notes

I ran into an interesting issue. Ruby is faster if compiled with -Os (optimize for size) than with -O2 or -O3 (optimize for speed). Hongli pointed out that Ruby has poor instruction locality and benefits most from squeezing tightly into the instruction cache. This is an unusual phenomenon, although probably more common in interpreters and virtual machines than in “standard” C programs.

I also tested a build that included Joe Damato’s heaped thread frames, but it would hang Mongrel in rb_thread_schedule() after the first GC run, which is not exactly what we want. Hopefully this can be integrated later.

benchmarks

I ran a suite of benchmarks via Autobench/httperf and plotted them with Plot. The hardware was a 4-core Xeon machine with RHEL5, running 8 Mongrels balanced behind Apache 2.2. I made a typical API request that is answered primarily from composed caches.

As usual, we see that tuning the GC parameters has the greatest impact on throughput, but there is a definite gain from switching to the REE bundle. It’s also interesting how much the standard deviation is improved by the GC settings. (Some data points are skipped due to errors at high concurrency.)

upgrading

Moving from 1.8.6 to REE 1.8.7 was trivial, but moving to 1.9 will be more of an ordeal. It will be interesting to see what patches are still necessary on 1.9. Many of them are getting upstreamed, but some things (such as tcmalloc) will probably remain only available from 3rd parties.

All in all, good times in MRI land.

24 responses

  1. Thanks for the great writeup, Evan, and for sharing the behind the scenes data. We used your earlier writeup about GC tuning to double the performance of Weplay and it’s great to see REE 1.8.7 getting a Twitter-sized workout.

  2. This is an unusual phenomenon…

    More like the standard way things work these days. Every time I have tested the various settings with real, non-trivial programs, -Os wins, often with a huge margin. The speed difference of the L1i and L2 is just massive to instructions.

  3. Back in my C++ programming days, it was common wisdom that -Os produces code that actually runs faster than -O2; since CPU frequencies are very high, machine code execution is completely dominated by cache misses for at least a decade; making sure your hot execution path fits nicely into the CPU cache with some space to spare for the data gives you the best performance.

  4. It would be nice to see how much RAM each of those processes is using. Also if you’re looking for speed, some suggestions might be:
    Use mysqlplus so that the mysql driver doesn’t force a GC every 20 queries (which it currently
    does)
    Compile with -march=native

    Though I haven’t benchmarked these.

  5. As part of the MacRuby team, it would be my pleasure to help them experiment with MacRuby 0.5. However MacRuby isn’t running Rails yet and unless Twitter would want to switch from Scala to MacRuby for their queuing mechanism (using GCD for instance), I don’t really see any major interest for them to invest too much time in that yet.

    Maybe when we’ll run Rails very well though. ;)

  6. I understand the early stages nature of MacRuby 0.5.

    I was just curious if Evan’s people had done any work to look at 0.5. If what you’re seeing in your test cases is reflected in any real world application, and there’s no reason to suppose it wouldn’t, that would be a major step forward.

    While not wanting to speak for them, for Twitter I’d expect it to be enough of a major deal that they might consider devoting some resources (eg.manpower) to help get macruby 0.5 to a point where it would be good enough for production, and maybe even ported to a system that supported GNUstep.

  7. john: We have not tried MacRuby. You seem to overestimate the resources available at a product company to devote to experimental system building.

    There are many reasons to suppose that test cases are not representative of real-world performance; in my experience, that is the norm.

    That said, if MacRuby ran Rails and worked on Linux, I would try it.

  8. The LLVM version of MacRuby is something you really should be paying attention to, if Ruby performance is important to you. See here for more detail. You’ll see that the test cases to which I refer, are in fact the standard Rubf benchmark set, for which speedups of up to 37X over 1.8, and 7.8X over 1.9 have been seen, even at this early stage.

    In my experience (which includes 25 years of performance engineering, across a bunch of startups, in large scale HPC, data center, real-time and IO oriented applications), those kinds of speedups in low level benchmarks, tend to make a significant dent on real world applications.

    It’s not all peaches and cream: there are a bunch of major backwards regressions too, and the project is still at an early stage, with still a ways to go. But overall I can say, without hype, that things looks encouraging.

    Now lets say, for the sake of argument, that someone had a Ruby implementation, arriving in a year, that would give you on average a 10X speedup. What would that be worth to you and your business? Would you consider it to be a competitive advantage or an enabling technology? And if so, what would you do to get that advantage 3 or 6 months ahead of time? Think about it.

  9. John, you might want to note that as MacRuby does not yet support many features of Ruby proper, it’s quite unrealistic to rely on those benchmarks. As more of Ruby is supported, it will become harder and harder for the MacRuby team to keep the performance so high. I certainly hope they do, but they have some way to go yet. Let’s wait and see before speculating.

    Here’s a good post explaining why the benchmarks shouldn’t be taken at face value.

  10. You’re missing my point. I understand almost exactly at what stage the MacRuby project is at, and what the benchmarks represent, and while I don’t mean to speak for Evan or Twitter, it seems to me that the benchmarks show enough promise, to them, as to be worth investigating, even at this stage. I’m trying to encourage someone who may have resources to help that project, to actually do so, and help make the project happen. My point to Evan is that the specific payoff to him and his company should be calculable, and might be substantial, consequently justifying an allocation of resources.

    What’s motivating me here is that I would dearly like to see MacRuby both make it to production, and also be ported to Linux. And while the MacRuby guys have my confidence, I think they need major help for those two things to happen anytime soon.

  11. We didn’t have enough time on our own to integrate the MBARI patchset into 1.8.7 and had to wait for REE to do it. We don’t have enough time to port our own app to Ruby 1.9. There is absolutely no way we have time to work on MacRuby in the foreseeable future, especially if it needs “major help”.

    There may be an opportunity cost to ignoring it right now, but it is extremely small compared to everything else we can spend our resources on.

    That said, I also think the LLVM makes a good potential compilation host for dynamic runtimes.

  12. Seems like I owe you an apology for getting on your case. I mistook you for the other Evan at Twitter. Things now make sense.