snax

let me hit you with some knowledge

In which I demonstrate my awesome debugging workflow, and ramble about BleakHouse and Ruby’s memory behavior.

how to debug

I’ve been running BleakHouse in actual, deployed production on CHOW, and have learned some interesting things. The ruby-bleak-house build (which is really Eric Hodel’s work) is rock stable. Every deployed Ruby process on CHOW runs the patched build. BleakHouse itself is not enabled—the point is to have the symbols in place for emergencies.

Now say I notice a problem in a particular process. For example, monit shows this one using 300MB of RAM:


Let’s see what’s up. Using Mauricio’s gdb scripts, we can freeze the live process and inject ourselves into it:

chloe:~ eweaver$ ssh xx-xx.cnet.com
[eweaver@xx-xx.cnet.com ~]$ sudo -u app gdb ruby-bleak-house 20798
Attaching to program: /opt/ruby/1.8.6/bin/ruby-bleak-house, process 20798
(gdb) source ~/tools/ruby-gdb
(gdb) redirect_stdout
$1 = 2
(gdb) 

Some poor user just got their request hung. Hopefully they’ll refresh it and get shifted to another mongrel.

Now, we need to tail the process log so that we can see standard output:

^Z
[1]+  Stopped sudo -u app gdb ruby-bleak-house 20798
[eweaver@xx-xx.cnet.com ~]$ tail -f /tmp/ruby-debug.20798 &
[2] 27095
[eweaver@xx-xx.cnet.com ~]$ fg %1
sudo -u app gdb ruby-bleak-house 20798

Back to our gdb prompt. Let’s look around, get our bearings:

(gdb) eval "caller"
["gems/mongrel-1.0.1/lib/mongrel/configurator.rb:274:in `run'",
"gems/mongrel-1.0.1/lib/mongrel/configurator.rb:274:in `loop'",
...]

Ah, there wasn’t a request after all. We’re just in the mongrel wait loop. Now we know it’s not some particular action jammed up and eating memory. More likely, it’s a slow leak.

Well, what does the heap look like?

(gdb) eval "require '~/chow/vendor/plugins/bleak_house/lib/bleak_house/c'"
nil

Hmm, something went wrong.

(gdb) eval "begin;
  require '~/chow/vendor/plugins/bleak_house/lib/bleak_house/c';
rescue Object => e;
  puts e.inspect;
end"
#<MissingSourceFile: no such file to load -- ~/chow/vendor/plugins/bleak_house/lib/bleak_house/c>

Oops.

(gdb) eval "require '/home/eweaver/chow/vendor/plugins/bleak_house/lib/bleak_house/c'"
["BleakHouse"]

Ok. But wait:

Program received signal SIGPIPE, Broken pipe.
0x0000003ab5bb7ee2 in __write_nocancel () from /lib64/tls/libc.so.6

Crap! Someone sent us a request. Uh…carrying on:

Program received signal SIGPIPE, Broken pipe.
0x0000003ab5bb7ee2 in __write_nocancel () from /lib64/tls/libc.so.6
The program being debugged was signaled while in a function called from GDB.
Evaluation of the expression containing the function (rb_p) will be abandoned.
(gdb) eval "'test'"
warning: Unable to restore previously selected frame.
"test"
(gdb) eval "BleakHouse::CLogger.new.snapshot('/tmp/objs', 'gdb', true)"
Detaching after fork from child process 29366.
(gdb)

Fork? Ok, whatever. Is the objs file there?

[eweaver@xx-xx.cnet.com ~]$ ls -l /tmp/objs
-rw-rw-rw-  1 app app 22865 May 11 22:57 /tmp/objs

It is! Done with gdb:

(gdb) quit
The program is running.  Quit anyway (and detach it)? (y or n) y
Detaching from program: /opt/ruby/1.8.6/bin/ruby-bleak-house, process 20798

Let’s look at that file:

[eweaver@xx-xx.cnet.com ~]$ head /tmp/objs
---
- - 1178949448
  - :"memory usage/swap": 48608
    :"memory usage/real": 296612
    :"heap usage/filled slots": 670252
    :"heap usage/free slots": 1796759
    :"gdb::::String": 138879
    :"gdb::::_node": 480398
    :"gdb::::Regexp": 2281
    :"gdb::::Array": 20747

Our resident set size is very large. But there really aren’t that many filled slots in the Ruby heap. Maybe the usage shot up at one point, and then dropped back down. Maybe there’s a single array of immediate types that grows and grows, or something.

Let’s compare to a young, small process:

[eweaver@xx-xx.cnet.com ~]$ head /tmp/objs2
---
- - 1178950158
  - :"memory usage/swap": 48664
    :"memory usage/real": 90396
    :"heap usage/filled slots": 524930
    :"heap usage/free slots": 227851
    :"gdb::::String": 63716
    :"gdb::::_node": 443825
    :"gdb::::Regexp": 1593
    :"gdb::::Array": 2815

Well, the node count is about the same, which means there isn’t any leak in the AST itself. And the filled vs. free looks much more normal. So…we didn’t learn that much.

(At one point, I managed to get my gdb macros confused, so I killed gdb, leaving myself with an unkillable mongrel in the ‘T’ state. I had to send a -CONT signal and then a -KILL signal to kill the mongrel for real.)

What to do? Let’s make some heapspace fireworks.

bleak_house in real life

BleakHouse is pretty fast. So rather than simulate some crappy usage scenario, I shut down monit and swapped out one of the regular mongrels for one with BLEAK_HOUSE=true. Yeah—BleakHouse itself running in production mode on the live site. It worked great.

I restarted monit and eventually the process got big enough that monit killed it. Then, I copied bleak_house_production.yaml.log (all 147MB) to my local machine, and analyzed it. Let’s look at the graphs.

First, we have the root graph:


Wow. A couple things instantly stand out here:

The boards controller looks pretty suspicious. Before we look at that, though, let’s look at the heap in detail:


What a beautiful graph! And it verifies my guess from above—the filled slots stay within the same general bracket the whole time, but around 22:05 there is a huge, temporary need for lots of objects. Those slots then persist for the rest of the life of the process.

We’re running recent edge, so we should make sure that Rails itself isn’t leaking due to an introduced bug:


Looks pretty much fine. Although it shouldn’t really have an /unknown path…that’s a minor BleakHouse::Analyze bug.

The graph does have a slight upward trend, but basically stays within a consistent bracket after an initialization period. The continuing increase is probably some application code leaking objects, which then get misidentified the next time core is logged and tagged.

Now on to that crazy bastard, the boards controller:


Clearly the show action is to blame. Let’s look at that:


Hrm, ok, well that’s an awful lot of strings. The boards are one of the most heavily-trafficked sections of CHOW, so a leak there will eventually bring down the whole process.

Incidentally, CHOW gets about a 1/2 million page views per day.

onward, ever onward

Of course, now that we have a hypothesis about the action causing the leak, we have to find the exact code from which it springs. The general procedure for this is:

And then…kill the fucker. But that will have to wait for another day.

community response to bleak_house

BleakHouse is getting a lot of attention. There’s an InfoQ article, and one on Ruby Inside. Lourens Naudé wrote some notes about memory usage and used some BleakHouse code. And the Solaris DTrace team sent me a mail.

a detour about profiling

Lourens mentions in his article some tools like a String counter and a debug flag for mongrel. Don’t use those. They use ObjectSpace. ObjectSpace is worse than nothing, because if you try to count references you end up preserving references by mistake. Misinformation is worse than no information. (I’ve removed the old ObjectSpace counter from BleakHouse for this reason.)

Incidentally Lourens references the same critical thread that I had been discussing with various people last week.

Also, the Ruby heap has a tendency to request memory from the OS and never give it back, even if the need goes away. This is normally fine, because unused portions of the space get swapped out. However, if your system is underloaded, there might not be any need to swap. This can give false impressions of the physical memory footprint of your app.

On the other hand, it would be nice if Ruby would just give the space back.

update

I gave BleakHouse its own documentation page.