ruby gc tuning

I’d like to call out something important from my QCon slides: the Railsbench GC settings.

quick study

In my experience, a typical production Rails app on Ruby 1.8 can recover 20% to 40% of user CPU by applying Stefan Kaes’s Railsbench GC patch to the Ruby binary, and using the following environment variables:

RUBY_HEAP_MIN_SLOTS=500000
RUBY_HEAP_SLOTS_INCREMENT=250000
RUBY_HEAP_SLOTS_GROWTH_FACTOR=1
RUBY_GC_MALLOC_LIMIT=50000000

In return, you get a minor increase in peak memory consumption. Not too shabby.

gc behavior

Ruby’s GC is non-generational, which means that the GC looks at every object every time it runs. Java, OCaml, and other static languages have a generational GC, which means that only recent allocations are frequently checked (the older an object is, the less likely it is to be freeable). But Ruby pays a high cost each GC run because it looks at all old objects too; for example, the Rails code itself stored in the AST.

Writing a generational GC is quite difficult, but there is one thing we can do: run the GC less frequently. This is what Stefan’s patches allow.

how to tune your app

If you install this patch from Sylvain Joyeux, you can add object allocation deltas to your Rails log, as well as use Stefan’s GC tracking methods. This gives you visibility into exactly when the GC runs and how much useful work it does. Spin up ab and script a representative page. Also start top in another shell to watch total memory in the Rails process.

Now, do a simulated annealing-like search through the available environment settings.

example

Here is one of our test runs with Ruby’s default GC settings (we use a custom logger at Twitter, but you should be able to arrive at similar output):

tcpu:0.154 alloc:63926 delta:-66290 gccpu:0.078
tcpu:0.067 alloc:63640 delta:63645 gccpu:0.000
tcpu:0.146 alloc:63788 delta:-68896 gccpu:0.078
tcpu:0.060 alloc:63779 delta: 63784 gccpu:0.000
tcpu:0.138 alloc:63787 delta:-75152 gccpu:0.072
tcpu:0.059 alloc:63779 delta: 63784 gccpu:0.000
tcpu:0.138 alloc:63787 delta:-77591 gccpu:0.072
tcpu:0.062 alloc:63779 delta: 63784 gccpu:0.000

With the conservative Ruby defaults, the GC runs every other request, and takes 0.075 seconds, giving us a per-request GC cost of 0.038 seconds—40% of the entire request time.

This is excessive. But if you explore a bit, you will quickly arrive at something like RUBY_HEAP_MIN_SLOTS=500000 RUBY_HEAP_SLOTS_INCREMENT=250000 RUBY_HEAP_SLOTS_GROWTH_FACTOR=1 RUBY_GC_MALLOC_LIMIT=50000000, which is what we use at Twitter.

These particular settings mean:

  • Start with enough memory to hold Rails (Ruby’s default is practically nothing)
  • Increase it linearly if you need more (Ruby’s default is exponential increase)
  • Only garbage-collect every 50 million malloc calls (Ruby’s default is 6x smaller)

Here are the GC timings for the same ab run with these settings applied:

tcpu:0.181 alloc:63829 delta:-763708 gccpu:0.118
tcpu:0.067 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.062 alloc:63777 delta: 63782 gccpu:0.000
tcpu:0.060 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.060 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.060 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.058 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.060 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.063 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.058 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.063 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.059 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.060 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.185 alloc:63828 delta:-761854 gccpu:0.119
tcpu:0.069 alloc:63777 delta: 63782 gccpu:0.000
tcpu:0.065 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.058 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.062 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.060 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.061 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.062 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.062 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.059 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.060 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.060 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.062 alloc:63775 delta: 63780 gccpu:0.000

Woah! Now the GC runs only every 13 requests, at a slightly higher cost, for a per-request cost of 0.009 seconds. This translates to a general speedup of 34%. The frequency of GC calls corresponds quite directly to the change in RUBY_GC_MALLOC_LIMIT, but if we increase it much more the memory usage balloons.

further reading

Unfortunately, there’s not much clear information on the Ruby garbage collector. But here are a few resources:

We are now experimenting with Brent’s MBARI branch of Ruby 1.8.6, kindly sponsored by EngineYard. So far it looks excellent, expecially in combination with Stefan’s patches. I will publish some results soon.

9 responses

  1. That’s fine, Hongli; just put an attribution if it’s significant.

    Incidentally, if I were you, I would hardcode sane defaults into REE itself so people don’t have to worry about it.

  2. Nice! Thanks for sharing your settings. This was on my todo list but I had no idea that such a big savings was possible until you posted. Here is our before/after (running with the Passenger and REE)

    Also – there was a corresponding improvement in average response time – roughly 33%.

  3. I tested similar settings and there was no significant difference, except that RSS growth is a bit unpredictable due to the exponential heap expansion.

  4. I’m curious. Have you thought about giving JRuby a go? From the benchmarks I’ve seen it’s significantly faster than MRI and your guys are already using Scala, so…thoughts?

  5. We have considered it; it doesn’t offer enough benefit versus cost of integration for us right now.

    It may never: the majority of the critical path is already in C.