I’d like to call out something important from my QCon slides: the Railsbench GC settings.
quick study
In my experience, a typical production Rails app on Ruby 1.8 can recover 20% to 40% of user CPU by applying Stefan Kaes’s Railsbench GC patch to the Ruby binary, and using the following environment variables:
RUBY_HEAP_MIN_SLOTS=500000 RUBY_HEAP_SLOTS_INCREMENT=250000 RUBY_HEAP_SLOTS_GROWTH_FACTOR=1 RUBY_GC_MALLOC_LIMIT=50000000
In return, you get a minor increase in peak memory consumption. Not too shabby.
gc behavior
Ruby’s GC is non-generational, which means that the GC looks at every object every time it runs. Java, OCaml, and other static languages have a generational GC, which means that only recent allocations are frequently checked (the older an object is, the less likely it is to be freeable). But Ruby pays a high cost each GC run because it looks at all old objects too; for example, the Rails code itself stored in the AST.
Writing a generational GC is quite difficult, but there is one thing we can do: run the GC less frequently. This is what Stefan’s patches allow.
how to tune your app
If you install this patch from Sylvain Joyeux, you can add object allocation deltas to your Rails log, as well as use Stefan’s GC tracking methods. This gives you visibility into exactly when the GC runs and how much useful work it does. Spin up ab
and script a representative page. Also start top
in another shell to watch total memory in the Rails process.
Now, do a simulated annealing-like search through the available environment settings.
example
Here is one of our test runs with Ruby’s default GC settings (we use a custom logger at Twitter, but you should be able to arrive at similar output):
tcpu:0.154 alloc:63926 delta:-66290 gccpu:0.078 tcpu:0.067 alloc:63640 delta:63645 gccpu:0.000 tcpu:0.146 alloc:63788 delta:-68896 gccpu:0.078 tcpu:0.060 alloc:63779 delta: 63784 gccpu:0.000 tcpu:0.138 alloc:63787 delta:-75152 gccpu:0.072 tcpu:0.059 alloc:63779 delta: 63784 gccpu:0.000 tcpu:0.138 alloc:63787 delta:-77591 gccpu:0.072 tcpu:0.062 alloc:63779 delta: 63784 gccpu:0.000
With the conservative Ruby defaults, the GC runs every other request, and takes 0.075 seconds, giving us a per-request GC cost of 0.038 seconds—40% of the entire request time.
This is excessive. But if you explore a bit, you will quickly arrive at something like RUBY_HEAP_MIN_SLOTS=500000 RUBY_HEAP_SLOTS_INCREMENT=250000 RUBY_HEAP_SLOTS_GROWTH_FACTOR=1 RUBY_GC_MALLOC_LIMIT=50000000
, which is what we use at Twitter.
These particular settings mean:
- Start with enough memory to hold Rails (Ruby’s default is practically nothing)
- Increase it linearly if you need more (Ruby’s default is exponential increase)
- Only garbage-collect every 50 million
malloc
calls (Ruby’s default is 6x smaller)
Here are the GC timings for the same ab
run with these settings applied:
tcpu:0.181 alloc:63829 delta:-763708 gccpu:0.118 tcpu:0.067 alloc:63776 delta: 63781 gccpu:0.000 tcpu:0.062 alloc:63777 delta: 63782 gccpu:0.000 tcpu:0.060 alloc:63776 delta: 63781 gccpu:0.000 tcpu:0.060 alloc:63776 delta: 63781 gccpu:0.000 tcpu:0.060 alloc:63776 delta: 63781 gccpu:0.000 tcpu:0.058 alloc:63776 delta: 63781 gccpu:0.000 tcpu:0.060 alloc:63776 delta: 63781 gccpu:0.000 tcpu:0.063 alloc:63776 delta: 63781 gccpu:0.000 tcpu:0.058 alloc:63776 delta: 63781 gccpu:0.000 tcpu:0.063 alloc:63776 delta: 63781 gccpu:0.000 tcpu:0.059 alloc:63776 delta: 63781 gccpu:0.000 tcpu:0.060 alloc:63776 delta: 63781 gccpu:0.000 tcpu:0.185 alloc:63828 delta:-761854 gccpu:0.119 tcpu:0.069 alloc:63777 delta: 63782 gccpu:0.000 tcpu:0.065 alloc:63776 delta: 63781 gccpu:0.000 tcpu:0.058 alloc:63776 delta: 63781 gccpu:0.000 tcpu:0.062 alloc:63776 delta: 63781 gccpu:0.000 tcpu:0.060 alloc:63776 delta: 63781 gccpu:0.000 tcpu:0.061 alloc:63776 delta: 63781 gccpu:0.000 tcpu:0.062 alloc:63776 delta: 63781 gccpu:0.000 tcpu:0.062 alloc:63776 delta: 63781 gccpu:0.000 tcpu:0.059 alloc:63776 delta: 63781 gccpu:0.000 tcpu:0.060 alloc:63776 delta: 63781 gccpu:0.000 tcpu:0.060 alloc:63776 delta: 63781 gccpu:0.000 tcpu:0.062 alloc:63775 delta: 63780 gccpu:0.000
Woah! Now the GC runs only every 13 requests, at a slightly higher cost, for a per-request cost of 0.009 seconds. This translates to a general speedup of 34%. The frequency of GC calls corresponds quite directly to the change in RUBY_GC_MALLOC_LIMIT
, but if we increase it much more the memory usage balloons.
further reading
Unfortunately, there’s not much clear information on the Ruby garbage collector. But here are a few resources:
- Optimizing Rails. Second-hand notes on on old presentation of Stefan’s.
- Acunote team blog. Has some good but scattered GC information.
- REE documentation. Phusion NL discusses Stefan’s patch.
- Brent Roman’s MBARI explanations. Detailed discussion of various leaks and optimizations.
We are now experimenting with Brent’s MBARI branch of Ruby 1.8.6, kindly sponsored by EngineYard. So far it looks excellent, expecially in combination with Stefan’s patches. I will publish some results soon.
Thanks for the realworld example, as you said, it is very difficult to find information on this.
Thanks for the post. I’ve included some of the information in the Ruby Enterprise Edition documentation if you don’t mind.
That’s fine, Hongli; just put an attribution if it’s significant.
Incidentally, if I were you, I would hardcode sane defaults into REE itself so people don’t have to worry about it.
Nice! Thanks for sharing your settings. This was on my todo list but I had no idea that such a big savings was possible until you posted. Here is our before/after (running with the Passenger and REE)
Also – there was a corresponding improvement in average response time – roughly 33%.
Have you also tried the 37signals settings? Seems like the biggest change is that you’re using linear memory expansion. How do the settings measure up?
Yeah, I’d be very much interested in how the Twitter settings compare to the 37signals settings.
I tested similar settings and there was no significant difference, except that RSS growth is a bit unpredictable due to the exponential heap expansion.
I’m curious. Have you thought about giving JRuby a go? From the benchmarks I’ve seen it’s significantly faster than MRI and your guys are already using Scala, so…thoughts?
We have considered it; it doesn’t offer enough benefit versus cost of integration for us right now.
It may never: the majority of the critical path is already in C.