In which I demonstrate my awesome debugging workflow, and ramble about BleakHouse and Ruby’s memory behavior.
how to debug
I’ve been running BleakHouse in actual, deployed production on CHOW, and have learned some interesting things. The ruby-bleak-house
build (which is really Eric Hodel’s work) is rock stable. Every deployed Ruby process on CHOW runs the patched build. BleakHouse itself is not enabled—the point is to have the symbols in place for emergencies.
Now say I notice a problem in a particular process. For example, monit shows this one using 300MB of RAM:
Let’s see what’s up. Using Mauricio’s gdb scripts, we can freeze the live process and inject ourselves into it:
chloe:~ eweaver$ ssh xx-xx.cnet.com
[eweaver@xx-xx.cnet.com ~]$ sudo -u app gdb ruby-bleak-house 20798
Attaching to program: /opt/ruby/1.8.6/bin/ruby-bleak-house, process 20798
(gdb) source ~/tools/ruby-gdb
(gdb) redirect_stdout
$1 = 2
(gdb)
Some poor user just got their request hung. Hopefully they’ll refresh it and get shifted to another mongrel.
Now, we need to tail the process log so that we can see standard output:
^Z
[1]+ Stopped sudo -u app gdb ruby-bleak-house 20798
[eweaver@xx-xx.cnet.com ~]$ tail -f /tmp/ruby-debug.20798 &
[2] 27095
[eweaver@xx-xx.cnet.com ~]$ fg %1
sudo -u app gdb ruby-bleak-house 20798
Back to our gdb prompt. Let’s look around, get our bearings:
(gdb) eval "caller"
["gems/mongrel-1.0.1/lib/mongrel/configurator.rb:274:in `run'",
"gems/mongrel-1.0.1/lib/mongrel/configurator.rb:274:in `loop'",
...]
Ah, there wasn’t a request after all. We’re just in the mongrel wait loop. Now we know it’s not some particular action jammed up and eating memory. More likely, it’s a slow leak.
Well, what does the heap look like?
(gdb) eval "require '~/chow/vendor/plugins/bleak_house/lib/bleak_house/c'"
nil
Hmm, something went wrong.
(gdb) eval "begin;
require '~/chow/vendor/plugins/bleak_house/lib/bleak_house/c';
rescue Object => e;
puts e.inspect;
end"
#<MissingSourceFile: no such file to load -- ~/chow/vendor/plugins/bleak_house/lib/bleak_house/c>
Oops.
(gdb) eval "require '/home/eweaver/chow/vendor/plugins/bleak_house/lib/bleak_house/c'"
["BleakHouse"]
Ok. But wait:
Program received signal SIGPIPE, Broken pipe.
0x0000003ab5bb7ee2 in __write_nocancel () from /lib64/tls/libc.so.6
Crap! Someone sent us a request. Uh…carrying on:
Program received signal SIGPIPE, Broken pipe.
0x0000003ab5bb7ee2 in __write_nocancel () from /lib64/tls/libc.so.6
The program being debugged was signaled while in a function called from GDB.
Evaluation of the expression containing the function (rb_p) will be abandoned.
(gdb) eval "'test'"
warning: Unable to restore previously selected frame.
"test"
(gdb) eval "BleakHouse::CLogger.new.snapshot('/tmp/objs', 'gdb', true)"
Detaching after fork from child process 29366.
(gdb)
Fork? Ok, whatever. Is the objs
file there?
[eweaver@xx-xx.cnet.com ~]$ ls -l /tmp/objs
-rw-rw-rw- 1 app app 22865 May 11 22:57 /tmp/objs
It is! Done with gdb:
(gdb) quit
The program is running. Quit anyway (and detach it)? (y or n) y
Detaching from program: /opt/ruby/1.8.6/bin/ruby-bleak-house, process 20798
Let’s look at that file:
[eweaver@xx-xx.cnet.com ~]$ head /tmp/objs
---
- - 1178949448
- :"memory usage/swap": 48608
:"memory usage/real": 296612
:"heap usage/filled slots": 670252
:"heap usage/free slots": 1796759
:"gdb::::String": 138879
:"gdb::::_node": 480398
:"gdb::::Regexp": 2281
:"gdb::::Array": 20747
Our resident set size is very large. But there really aren’t that many filled slots in the Ruby heap. Maybe the usage shot up at one point, and then dropped back down. Maybe there’s a single array of immediate types that grows and grows, or something.
Let’s compare to a young, small process:
[eweaver@xx-xx.cnet.com ~]$ head /tmp/objs2
---
- - 1178950158
- :"memory usage/swap": 48664
:"memory usage/real": 90396
:"heap usage/filled slots": 524930
:"heap usage/free slots": 227851
:"gdb::::String": 63716
:"gdb::::_node": 443825
:"gdb::::Regexp": 1593
:"gdb::::Array": 2815
Well, the node count is about the same, which means there isn’t any leak in the AST itself. And the filled vs. free looks much more normal. So…we didn’t learn that much.
(At one point, I managed to get my gdb macros confused, so I killed gdb, leaving myself with an unkillable mongrel in the ‘T
’ state. I had to send a -CONT
signal and then a -KILL
signal to kill the mongrel for real.)
What to do? Let’s make some heapspace fireworks.
bleak_house in real life
BleakHouse is pretty fast. So rather than simulate some crappy usage scenario, I shut down monit and swapped out one of the regular mongrels for one with BLEAK_HOUSE=true
. Yeah—BleakHouse itself running in production mode on the live site. It worked great.
I restarted monit and eventually the process got big enough that monit killed it. Then, I copied bleak_house_production.yaml.log
(all 147MB) to my local machine, and analyzed it. Let’s look at the graphs.
First, we have the root graph:
Wow. A couple things instantly stand out here:
- memory use, the squiggly red line, is growing linearly
- the total heap size stays low, then leaps to a new plateau
- the heap jump correlates with a bizarre request in the boards controller that spikes way out of range
- sometime before that, the boards controller suddenly went from being in the middle of the object-count pack to consistently the largest
The boards controller looks pretty suspicious. Before we look at that, though, let’s look at the heap in detail:
What a beautiful graph! And it verifies my guess from above—the filled slots stay within the same general bracket the whole time, but around 22:05 there is a huge, temporary need for lots of objects. Those slots then persist for the rest of the life of the process.
We’re running recent edge, so we should make sure that Rails itself isn’t leaking due to an introduced bug:
Looks pretty much fine. Although it shouldn’t really have an /unknown
path…that’s a minor BleakHouse::Analyze bug.
The graph does have a slight upward trend, but basically stays within a consistent bracket after an initialization period. The continuing increase is probably some application code leaking objects, which then get misidentified the next time core is logged and tagged.
Now on to that crazy bastard, the boards controller:
Clearly the show
action is to blame. Let’s look at that:
Hrm, ok, well that’s an awful lot of strings. The boards are one of the most heavily-trafficked sections of CHOW, so a leak there will eventually bring down the whole process.
Incidentally, CHOW gets about a 1/2 million page views per day.
onward, ever onward
Of course, now that we have a hypothesis about the action causing the leak, we have to find the exact code from which it springs. The general procedure for this is:
- thrash only that particular action with a single, deterministic request, and make sure you can consistently reproduce the problem
- sprinkle the action with BleakHouse::CLogger snapshots, tagged with something to identify their source (you could do this automatically with
set_trace_func
, perhaps, or manually pass__FILE__
and__LINE__
) - look for large deltas
- narrow it down
And then…kill the fucker. But that will have to wait for another day.
community response to bleak_house
BleakHouse is getting a lot of attention. There’s an InfoQ article, and one on Ruby Inside. Lourens Naudé wrote some notes about memory usage and used some BleakHouse code. And the Solaris DTrace team sent me a mail.
a detour about profiling
Lourens mentions in his article some tools like a String counter and a debug flag for mongrel. Don’t use those. They use ObjectSpace. ObjectSpace is worse than nothing, because if you try to count references you end up preserving references by mistake. Misinformation is worse than no information. (I’ve removed the old ObjectSpace counter from BleakHouse for this reason.)
Incidentally Lourens references the same critical thread that I had been discussing with various people last week.
Also, the Ruby heap has a tendency to request memory from the OS and never give it back, even if the need goes away. This is normally fine, because unused portions of the space get swapped out. However, if your system is underloaded, there might not be any need to swap. This can give false impressions of the physical memory footprint of your app.
On the other hand, it would be nice if Ruby would just give the space back.
update
I gave BleakHouse its own documentation page.
Evan, what a great article. I really look forward to seeing a writeup on what the bug is and how you tracked it down all the way; these are great examples.
I would also love to hear what the DTrace team had to say.
MacPorts users, note that Mauricio’s scripts don’t work on the Ports build of Ruby—it’s missing debugging symbols or something.
The scripts work with the BleakHouse Ruby binary, though. For ultimate convenience, rename the Ports
ruby
and symlink/opt/local/bin/ruby
toruby-bleak-house
instead.I’ll just say this is probably the most useful damn thing for Ruby in a long time. I really wish I’d had it when I was ghetto trolling through ObjectSpace tracking down that memory leak that supposedly didn’t exist in Threads and Arrays.