hello heroku world

I’ve been investigating various platform-as-a-service providers, and did some basic benchmarking on Heroku.

I deployed a number of HTTP hello-world apps on the Cedar stack and hammered them via autobench. The results may be interesting to you if you are trying to maximize your hello-world dollar.

setup

Each Heroku dyno is an lxc container with 512MB of ram and an unclear amount of CPU. The JVM parameters were -Xmx384m -Xss512k -XX:+UseCompressedOops -server -d64.

The driver machine was an EC2 m1.large in us-east-1a, running the default 64-bit Amazon Linux AMI. A single httperf process could successfully generate up to 25,000 rps with the given configuration. Timeouts were set high enough to allow any intermediate queues to stay flooded.

throughput results

In the below graphs, the response rate is the solid line ━━ and the left y axis; connection errors as a percentage are the dashed line ---- and the right y axis. The graphs are heavily splined, as suits a meaningless micro-benchmark.

Note that as the response rates fall away from the gray, dashed x=y line, the server is responding increasingly late and thus would shed sustained load, regardless of the measured connection error rate.

Finagle and Node made good throughput showings—Node had the most consistent performance profile, but Finagle’s best case was better. Sinatra (hosted by Thin) and Tomcat did OK. Jetty collapsed when pushed past its limit, and Bottle (hosted by wsgiref) was effectively non-functional. Finally, the naive C/Accept “stack” demonstrated an amusing combination of poor performance and good scalability.

As the number of dynos increases, the best per-dyno response rate declines from 2500 to below 1000, and the implementations become less and less differentiated. This suggests that there is a non-linear bottleneck in the routing layer. There also appears to be a per-app routing limit around 12,000 rps that no number of dynos can overcome (data not shown). For point of reference,  12,000 rps is the same velocity as the entire Netflix API.

The outcome demonstrates the complicated interaction between implementation scalability, per-dyno scalability, and per-app scalability, none of which are linear constraints.

latency results

Latency was essentially equivalent across all the stacks—C/Accept and Bottle excepted. Again, we see a non-linear performance falloff as dynos increases.

The latency ceiling at 100ms in the first two graphs is caused by Heroku’s load-shedding 500s; ideally httperf would exclude those from the report.

using autobench

Autobench is a tool that automates running httperf at increasing rates. I like it a lot, despite (or because) of its ancient Perl-ness. A few modifications were necessary to make it more useful.

  • Adding a retry loop around the system call to httperf, because sometimes it would wedge, get killed by my supervisor script, and then autobench would return empty data.
  • The addition of --hog to the httperf arguments.
  • Fixing the error percentage to divide by connections only, instead of by HTTP responses, which makes no sense when issuing multiple requests per connection.
  • Only counting HTTP status code 200 as a successful reply.

I am not sure what happens that wedges httperf. Bottle was particularly nasty about this. At this point I should probably rewrite autobench in Ruby and merge it with my supervisor script, and also have it drive hummingbird which appears to be more modern and controllable than httperf.

using httperf

Httperf is also old and crufty, but in C. In order to avoid lots of fd-unavail errors, you have to make sure that your Linux headers set FD_SETSIZE large enough. For the Amazon AMI:

/usr/include/bits/typesizes.h:#define __FD_SETSIZE 65535
/usr/include/linux/posix_types.h:#define __FD_SETSIZE 65535

Fix the --hog bug in httperf itself, and also drop the sample rate to 0.5:

src/httperf.c:#define RATE_INTERVAL 0.5

You can grab the pre-compiled tools below if you don’t want to bother with updating the headers or source manually.

Finally, you need to make sure that your ulimits are ok:

/etc/security/limits.conf:* hard nofile 65535

sources

My benchmark supervisor, pre-compiled tools, and charting script, as well as all the raw data generated by the test, are on Github. Please pardon my child-like R.

Hello-world sources are here:

All the testing ended up costing me $48.52 on Heroku and $23.25 on AWS. I would advise against repeating it to avoid troubling the Heroku ops team, but maybe if you have a real application to test…

ideal hdtv settings for xbox 360

My XBox 360 broke, and since my new one supported HDMI, I reworked the connection to the TV (a Samsung PN50A450 plasma). It’s tricky to get the best performance out of the combination so I wanted to mention it here.

scalers

Even though the HDMI connection is digital, both the XBox and the TV have hardware scalers that degrade the signal. The conversion chain works like this:

Game resolution (for Battlefield 3, 704p) → XBox HD resolution (for standard HD, 720p) → TV native resolution (for this Samsung, 768p)

Remember the XBox is essentially a Windows PC and games can choose whatever resolution they please. Now, in a normal PC, the resolution requested by the game would be transmitted directly to the monitor, and the monitor’s scaler would scale it. If the game chooses the monitor’s native resolution then there is no scaling.

Also remember that a 720p-labeled TV doesn’t necessarily mean the TV’s native resolution is 720p. It’s just 720p “class”.

We can’t eliminate scaling entirely because we can’t change the game’s resolution, but we can still remove one scaler from the chain by having the XBox scale to the native TV resolution. When that happens, the Samsung shuts off its internal scaler (which also handles some post-processing effects). This gives us much sharper detail and reduces the notorious HDTV display latency.

How to do it:

  1. Connect the XBox to the TV via HDMI on HDMI channel 2.
  2. Go into SettingsConsole SettingsDisplayHDTV Settings on the XBox and choose 1360 x 768. If the setting isn’t available, it means you’re connected to the wrong HDMI port.

Crispy pixels! This configuration also works for VGA output if your XBox doesn’t support HDMI.

You can also tell that you’re running at the native resolution via the TV’s menu, because the Detailed Settings option will be grayed out. This is because the TV’s scaler is not running.

colors

Now we need to open up the color response range of the TV while it’s in native resolution in order to return to high contrast. Since the scaler/post-processor is not running, my TV at least can’t do the usual 16-235 levels remapping for video signals.

Go into the service menu on the Samsung. (Dangerous! Stay away from anything that says “calibration” or your TV can become unusable).

  1. With the TV off, press MUTE 1 8 2 POWER on the remote.
  2. Using the up and down menu keys, choose ADC Target.
  3. Use the following settings for 1st PC, 2nd PC, and 2nd HDMI:
    1. Low: 0
    2. High: 255
    3. Delta: 0
  4. Press MUTE MUTE POWER on the remote to save your settings.

 
Go into SettingsConsole SettingsDisplayReference Levels on the XBox and set it to Expanded. Also set HDMI Color Space to RGB.

references

Also note that the Samsung service menu will display the resolution the TV is running at, which is handy.

memcached gem performance across VMs

Thanks to Evan Phoenix, memcached.gem 1.3.2 is compatible with Rubinius again. I have added Rubinius to the release QA, so it will stay this way. 

The master branch is compatible with JRuby, but a JRuby segfault (as well as a mkmf bug) prevents it from working for most people.

vm comparison

Memcached.gem makes an unusual benchmark case for VMs. The gem is highly optimized in general, and specially optimized for MRI. This means it will tend to not reward speedups of “dumb” aspects of MRI because it doesn’t exercise them—contrary to many micro-benchmarks.

                                          user     system      total        real
JRuby-head
set: libm:ascii                       2.440000   1.760000   4.200000 (  8.284000)
get: libm:ascii                       [SEGFAULT]

RBX-head
set: libm:ascii                       1.387198   1.590912   2.978110 (  6.576674)
get: libm:ascii                       2.076829   1.705302   3.782131 (  7.237497)

REE 1.8.7-2011.03
set: libm:ascii                       1.130000   1.530000   2.660000 (  6.331992)
get: libm:ascii                       1.250000   1.540000   2.790000 (  6.142529)

Ruby 1.9.2-p290
set: libm:ascii                       0.860000   1.490000   2.350000 (  5.917467)
get: libm:ascii                       1.030000   1.580000   2.610000 (  6.238965)

JRuby’s performance is surprisingly OK, but only once Hotspot has been convinced to JIT the function to native code (which the benchmark does ahead of time). Rubinius’s performance is good. Ruby 1.9.2 is the fastest.

jruby client comparison

Curiously, memcached.gem is the fastest Ruby memcached client on every VM including JRuby. It is 70% faster than jruby-memcache-client, which wraps Whalin’s Java client via JRuby’s Java integration:

memcached 1.3.3
remix-stash 1.1.3
jruby-memcache-client 1.7.0
dalli 1.1.2
                                          user     system      total        real
set: dalli:bin                       10.720000   7.250000  17.970000 ( 17.859000)
set: libm:ascii                       2.440000   1.760000   4.200000 (  8.284000)
set: libm:bin                         2.280000   1.960000   4.240000 (  8.600000)
set: mclient:ascii                    4.150000   3.010000   7.160000 ( 11.879000)
set: stash:bin                        5.870000   2.970000   8.840000 ( 13.677000)

conclusion

This is great performance for C extensions in JRuby and Rubinius both. It’s handy that MRI’s extension interface is so simple.

One possible performance improvement remains in memcached.gem itself, which is rewriting the bundled copy of libmemcached to talk directly to Ruby instead of via SWIG, which introduces memory copy overhead.

Also, someone needs to write a faster client for JRuby; there’s no reason why binding to a good native library like Whalin’s or xmemcached should be slow. It should be possible to equal the speed of memcached.gem on Ruby 1.9.

simplicity

Maximizing simplicity is the only guaranteed way to minimize software maintenance. Other techniques exist, but are situational. No complex system will be cheaper to maintain than a simple one that meets the same goals.

‘Simple’, pedantically, means ‘not composed of parts’. However! Whatever system you are working on may already be a part of a whole. Your output should reduce the number and size of parts over all, not just in your own project domain.

Electra at the Tomb of Agamemnon, Frederic Leighton

I’ve started asking myself, “does this add the least amount of new code?” A system in isolation may be quite simple, but if it duplicates existing functionality, it has increased complexity. The ideal change is subtractive, reducing the total amount of code: by collapsing features together, removing configuration, or merging overlapping components.

Better to put your configuration in version control you already understand, than introduce a remote discovery server. Better to use the crufty RPC library you already have, than introduce a new one with a handy feature—unless you entirely replace the old one.

Beware the daughter that aspires not to the throne of her mother.

performance engineering at twitter

A few weeks ago I gave a performance engineering talk at QCon Beijing/Tokyo. The abstract and slides are below.

abstract

Twitter has undergone exponential growth with very limited staff, hardware, and time. This talk discusses principles by which the wise performance engineer can make dramatic improvements in a constrained environment. Of course, these apply to any systems architect who wants to do more with less. Principles will be illustrated with concrete examples of successes and lessons learned from Twitter’s development and operations history.

slides

Performance Engineering at Twitter on Prezi

This is the first time I’ve used Prezi; the non-linear flow is compelling.

see it again sam

I will be giving the same talk this fall at QCon São Paulo and QCon San Francisco, so you can catch it there, and I think eventually the video will be online. This was also my first time speaking publicly in two years. Tons of new things to share with the world!

distributed systems primer, updated

Well, it’s been a long time. But! I have five papers to add to my original distributed systems primer:

coordination

CRDTs: Consistency Without Concurrency Control, Mihai Letia, Nuno Preguiça, and Marc Shapiro, 2009.

Guaranteeing eventual consistency by constraining your data structure, rather than adding heavyweight distributed algorithms. FlockDB works this way.

partitioning

The Little Engines That Could: Scaling Online Social Networks, Josep M. Pujol, Vijay Erramilli, Georgos Siganos, Xiaoyuan Yang Nikos Laoutaris, Parminder Chhabra, and Pablo Rodriguez, 2010.

Optimally partitioning overlapping graphs through lazy replication. Think of applying this technique at a cluster level, not just a server level.

Feeding Frenzy: Selectively Materializing Users’ Event Feeds, Adam Silberstein, Jeff Terrace, Brian F. Cooper, and Raghu Ramakrishnan, 2010.

Judicious session management and application of domain knowledge allow for optimal high-velocity mailbox updates in a memory grid. Twitter’s timeline system works this way.

systems integration

Dapper, a Large-Scale Distributed Systems Tracing Infrastructure, Benjamin H. Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag, 2010.

Add a transaction-tracking, sampling profiler to a reusable RPC framework and get full stack visibility without performance degradation.

Forecasting MySQL Scalability with the Universal Scalability Law, Baron Schwartz and Ewen Fortune, 2010.

An example of data-driven scalability modeling in a concurrent system, via a least-squares regression approach.

Happy scaling. Make sure to read the original post if you haven’t.

object allocations on the web

How many objects does a Rails request allocate? Here are Twitter’s numbers:

  • API: 22,700 objects per request
  • Website: 67,500 objects per request
  • Daemons: 27,900 objects per action

I want them to be lower. Overall, we burn 20% of our front-end CPU on garbage collection, which seems high. Each process handles ~29,000 requests before getting killed by the memory limit, and the GC is triggered about every 30 requests.

In memory-managed languages, you pay a performance penalty at object allocation time and also at collection time. Since Ruby lacks a generational GC (although there are patches available), the collection penalty is linear with the number of objects on the heap.

a note about structs and immediates

In Ruby 1.8, Struct instances use fewer bytes and allocate less objects than Hash and friends. This can be an optimization opportunity in circumstances where the Struct class is reusable.

A little bit of code shows the difference (you need REE or Sylvain Joyeux’s patch to track allocations):

GC.enable_stats
def sizeof(obj)
  GC.clear_stats
  obj.clone
  puts "#{GC.num_allocations} allocations"
  GC.clear_stats
  obj.clone
  puts "#{GC.allocated_size} bytes"
end

Let’s try it:

>> Struct.new("Test", :a, :b, :c)
>> struct = Struct::Test.new(1,2,3)
=> #<struct Struct::Test a=1, b=2, c=3>
>> sizeof(struct)
1 allocations
24 bytes

>> hash = {:a => 1, :b => 2, :c => 3}
>> sizeof(hash)
5 allocations
208 bytes

Watch out, though. The Struct class itself is expensive:

>> sizeof(Struct::Test)
29 allocations
1216 bytes

In my understanding, each key in a Hash is a VALUE pointer to another object, while each slot in a Struct is merely a named position.

Immediate types (Fixnum, nil, true, false, and Symbol) don’t allocate, except for Symbol. Symbol is interned and keeps its string representations on a special heap that is not garbage-collected.

your turn

If you have allocation counts from a production web application, I would be delighted to know them. I am especially interested in Python, PHP, and Java.

Python should be about the same as Ruby. PHP, though, discards the entire heap per-request in some configurations, so collection can be dramatically cheaper. And I would expect Java to allocate fewer objects and have a more efficient collection cycle.

scribe client

I’ve released Scribe 0.1, a Ruby client for the Scribe remote log server.

sudo gem install scribe

Usage is simple:

client = Scribe.new
client.log("I'm lonely in a crowded room.", "Rails")

Documentation is here.

about scribe

The primary benefit of Scribe over something like syslog-ng is increased scalability, because of Scribe’s fundamentally distributed architecture. Scribe also does away with the legacy syslog alert levels, and lets you define more application-appropriate categories on the fly instead.

Dmytro Shteflyuk has good article about installing the Scribe server itself on OS X. It would be nice if someone would put it in MacPorts, but it may be blocked on the release of Thrift.

ree

We recently migrated Twitter from a custom Ruby 1.8.6 build to a Ruby Enterprise Edition release candidate, courtesy of Phusion. Our primary motivation was the integration of Brent’s MBARI patches, which increase memory stability.

Some features of REE have no effect on our codebase, but we definitely benefit from the MBARI patchset, the Railsbench tunable GC, and the various leak fixes in 1.8.7p174. These are difficult to integrate and Phusion has done a fine job.

testing notes

I ran into an interesting issue. Ruby is faster if compiled with -Os (optimize for size) than with -O2 or -O3 (optimize for speed). Hongli pointed out that Ruby has poor instruction locality and benefits most from squeezing tightly into the instruction cache. This is an unusual phenomenon, although probably more common in interpreters and virtual machines than in “standard” C programs.

I also tested a build that included Joe Damato’s heaped thread frames, but it would hang Mongrel in rb_thread_schedule() after the first GC run, which is not exactly what we want. Hopefully this can be integrated later.

benchmarks

I ran a suite of benchmarks via Autobench/httperf and plotted them with Plot. The hardware was a 4-core Xeon machine with RHEL5, running 8 Mongrels balanced behind Apache 2.2. I made a typical API request that is answered primarily from composed caches.

As usual, we see that tuning the GC parameters has the greatest impact on throughput, but there is a definite gain from switching to the REE bundle. It’s also interesting how much the standard deviation is improved by the GC settings. (Some data points are skipped due to errors at high concurrency.)

upgrading

Moving from 1.8.6 to REE 1.8.7 was trivial, but moving to 1.9 will be more of an ordeal. It will be interesting to see what patches are still necessary on 1.9. Many of them are getting upstreamed, but some things (such as tcmalloc) will probably remain only available from 3rd parties.

All in all, good times in MRI land.

memcached gem release

One of the hardest gems to install is no more. It’s now easy to install!

Memcached 0.15 features:

  • Update to libmemcached 0.31.1
  • Bundle libmemcached itself with the gem (antifuchs)
  • UDP connection support
  • Unix domain socket support (hellvinz)
  • AUTO_EJECT_HOSTS bugfixes (mattknox)

Install with gem install memcached. Since libmemcached is bundled in, there are no longer any dependencies.

on coordination

Andreas Fuchs suggested several months ago that I include libmemcached itself in the gem, but at the time I resisted. I was wrong.

My opposition was based on the idea that libmemcached itself would be an integration point, so running multiple versions on a system would be bad.

In real life, the hash algorithm became the integration point, not the library itself. And since the library’s ABI kept changing, the gem always required a very specific custom build. This annoyed the public and caused extra work for my operations team, who had to make sure to upgrade both the library and the gem at the same time.

Updates can come thick and fast now because I don’t have to worry about publishing custom builds or waiting for the libmemcached developers to merge my patches.

In retrospect it seems obvious—it’s always a win to remove coordination from a system.

linker woes

Unfortunately, it was easier to make that decision than it was to implement it. Linux and OS X link libraries differently, and I had a lot of trouble making sure that no system-installed version of libmemcached would get linked, instead of the custom one built during gem install.

When you link a shared object, OS X seems to maintain a reference to the original .dylib. Linux does not, and depends on ldconfig and LD_LIBRARY_PRELOAD to find the object at runtime. Since you can’t modify the shell environment from within a running process, there’s no way to override LD_LIBRARY_PRELOAD, so I needed to statically link libmemcached into the gem’s own .so or .bundle.

The only way I could do this on both systems was to configure libmemcached with CFLAGS=-fPIC --disable-shared, rename the libemcached.* static object files to libemcached_gem.*, and pass -lmemcached_gem to the linker rather than -lmemcached. Otherwise the linker would prefer the system-installed dynamic objects, even with the correct paths and -static option set.

Note that you can check what objects a binary has linked to via otool -F on OS X, and ldd on Linux.

Feel free to look at the extconf.rb source and let me know if there’s a better way to do this.

Follow

Get every new post delivered to your Inbox.

Join 67 other followers