distributed systems primer

I’ve been reading a bunch of papers about distributed systems recently, in order to help systematize for myself the thing that we built over the last year. Many of them were originally passed to me by Toby DiPasquale. Here is an annotated list so everyone can benefit.

It helps if you have some algorithms literacy, or have built a system at scale, but don’t let that stop you.

prelude

The Death of Architecture, Julian Browne, 2007.

First, a reminder of what it means to build a system that solves a business problem. Browne built the space-based billing system at Virgin Mobile, one of the most well-known SBAs outside the financial and research industries.

That lovely diagram showing clean service-oriented interfaces, between unified systems of record, holding clean data, performing well-defined business processes is never going to be….You have to roll up your sleeves, talk to the business analysts, developers, operations and make a contribution that makes those boxes and arrows real.

System failures in the web world are most often due to inflated technical requirements and integration risks, not an incorrect decision to skip two-phase commit.

constraints

The Case for Shared Nothing, Michael Stonebraker, 1985.

The source of the shared-nothing paradigm, and importantly, its alternatives. Shared-nothing is a nice hammer, but not every problem is a nail.

Harvest, Yield and Scalable Tolerant systems, Eric Brewer, 1999.

The CAP theorem. Sometimes you just can’t get what you want. (This is related to the Dynamo work, below.)

Distributed Computing Economics, Jim Gray, 2003.

How to predict the cost of the thing you want to build. Via some napkin math, Gray shows why making the cloud cost-efficient for current problems continues to be a struggle.

coordination

Guardians and Actions: Linguistic Support for Robust, Distributed Programs, Barbara Liskov and Robert Scheifler, 1983.

Two-phase commit. Making sure what you plan to do will get done.

Time, Clocks and the Ordering of Events in a Distributed System, Leslie Lamport, 1978.

Distributed systems are inherently relative; there is no privileged position that can describe all events exactly. Lamport clocks (and the closely related vector clocks) let participants agree on the order of events in the world, if you need to care about that.

Paxos Made Simple, Leslie Lamport, 2001.

The consensus problem: how can potentionally faulty processes agree about an element of global state? The Paxos algorithm guarantees correctness during a failure of a minority of nodes. This paper is difficult, but important for the subtleties it reveals.

Also see Paxos Made Live for a discussion of the implementation in Google’s Chubby v2 coordination server. Real life introduces many unfortunate kinks.

encapsulation

Generative Communication in Linda, David Gelernter, 1985.

Tuple spaces, a.k.a. the blackboard pattern, a.k.a. spaced-based architecture. Coordinating a system through the data, not through the behaviors. I will be writing a lot more about this in the future.

partitioning

Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications, Ion Stoica, et al., 2001.

The original distributed hash table paper. Introduced consistent hashing for robust pool rebalancing.

Frangipani: A Scalable Distributed File System, Chandramohan A. Thekkath, Timothy Mann and Edward K. Lee, 1998.

Classic paper regarding modern-style distributed filesystems.

systems integration

Dynamo: Amazon’s Highly Available Key-Value Store, Giuseppe DeCandia, et al., 2007.

A key-value store that spawned numerous clones, it integrated many fundamental ideas from the above works into an actual running system. Cassandra is the most featureful open-source version.

(Also see Tokyo Cabinet. Not a Dynamo clone, per se, but it’s the next most practical alternative aside from MySQL. Tokyo is a networked BerkeleyDB replacement, so the domain code must handle the distributed aspects.)

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services, Matt Welsh, David Culler, and Eric Brewer, 2001.

Great paper on managing the interactions between individual components, and ways to build a well-conditioned service under unpredictable load.

conclusion

That’s all really. Focus on the ideas, not the implementations. Try to see the patterns in existing systems, rather than imagining how they “should” work if everything was perfect. Then you’ll be able to scale to the moon.

The moon is above the cloud and therefore obviously more scaleable.

Update: five additional papers are here.

peeping into memcached

Memcached is generally treated as a black box. But what if you really need to know what’s in there? Not for runtime purposes, but for optimization and capacity planning?

demo

$ sudo peep --pretty 2479

time | exptime | nbytes | clsid |                 key | exprd | flushd
8658 |  613458 |    272 |     5 |      "c3RhdH:171:5" | false |  false
8658 |       0 |      6 |     1 |  "current_c3RhdH:3" | false |  false
8658 |  613458 |    281 |     5 |     "c3RhdH:171:26" | false |  false
8678 |   95078 |      6 |     1 |  "User:1:auth:m4Uq" | false |  false
8658 |       0 |      8 |     2 |   "user_dGltZWxp:4" | false |  false
8686 |  613486 |   1278 |     9 |          "User:1:6" | false |  false
8658 |  613458 |   1286 |     9 |          "User:1:4" | false |  false
...
8658 |  613458 |    283 |     5 |     "c3RhdH:171:28" | false |  false
8658 |  613458 |    277 |     5 |     "c3RhdH:171:30" | false |  false

Peep uses ptrace to freeze a running memcached server, dump the internal key metadata, and return the server to a running state. If you have a good host ejection mechanism in your client, such as in the Twitter libmemcached builds, you won’t even have to change the production server pool. The instance is not restarted, and no data is lost.

Installation instructions are here. Requires Linux.

a note about heap management

Memcached has two separate memory management strategies:

  • On read, if a key is past its expiry time, return NOT FOUND.
  • On write, choose an appropriate slab class for the value; if it’s full, replace the oldest-used (read or written) key with the new one.

Note that the second strategy, LRU eviction, does not check the expiry time at all. This makes eviction an O(1) operation, because the tail of the LRU linked list is immediately available; checking within that list for evicted keys first would raise the order to O(N). This optimization has subtle implications.

the problem

In the bad old days, we added a page cache to Twitter via memcached. We used a generational key scheme: a piece of user metadata was incremented on every change, and used to construct dependent keys. This eliminated direct invalidation concerns.

However, because of the orthogonal expiry and LRU strategies, we knew that the generational keys were affecting our evictions of direct keys. The continual massive onslaught of new generational keys would sweep out older direct keys, even when the generational keys were expired, but the direct keys were fresh.

But how could we prove it was happening? Eventually we ran peep on a representative server. Aggregate results showed that 75% of our cache pool was spent on expired page caches, and our cache horizon was only 5 hours. Moving the generational keys to a separate pool raised our cache horizon to two days, a 10x improvement.

aggregating results

The best way to aggregate peep results is to use MySQL’s excellent grouping abilities. Run peep with the --ugly (unformatted) switch, and pipe it to a file. Then you can load it into an approprate MySQL schema:

CREATE TABLE entries (
  lru_time int(11) default NULL,
  expires_at_time int(11) default NULL,
  value_size int(11) default NULL,
  suffix_size int(11) default NULL,
  it_flag varchar(255) default NULL,
  class_id int(11) default NULL,
  key_size int(11) default NULL,
  key_name varchar(255) default NULL,
  is_expired varchar(255) default NULL,
  is_flushed varchar(255) default NULL
) ENGINE=InnoDB DEFAULT CHARSET=utf8;

LOAD DATA LOCAL INFILE '/tmp/peep.txt' INTO TABLE entries
  FIELDS TERMINATED BY ' | ' LINES TERMINATED BY '\n';

This will load very fast, even for millions of keys on a laptop; LOAD LOCAL INFILE is an optimized pipeline.

Here are some handy queries for revealing cache properties:

fresh vs. expired values, by count and size

SELECT
  is_expired,
  COUNT(*),
  SUM(value_size)
FROM entries
GROUP BY is_expired;

+------------+----------+-----------------+
| is_expired | COUNT(*) | SUM(value_size) |
+------------+----------+-----------------+
|  false     |   688883 |       759401296 |
|  true      |   472755 |      2430092660 |
+------------+----------+-----------------+

histogram of values by size

SELECT
  ROUND(ROUND(LOG10(value_size) * 5, 0) / 5, 1) AS log,
  ROUND(AVG(value_size), 0) AS size,
  COUNT(*),
  RPAD('', COUNT(*) / 6500, '*') AS bar
FROM entries
GROUP BY log
ORDER BY log DESC;

+------+--------+----------+----------------------------------------+
| log  | size   | COUNT(*) | bar                                    |
+------+--------+----------+----------------------------------------+
|  5.6 | 422238 |        3 |                                        |
...
|  1.6 |     39 |     3652 | *                                      |
|  1.4 |     23 |   103434 | ************************************** |
|  1.2 |     14 |     6369 | *                                      |
|  1.0 |     11 |    36588 | ******                                 |
|  0.8 |      6 |    55795 | *********                              |
|  0.6 |      4 |    90332 | **************                         |
|  0.4 |      2 |      239 |                                        |
+------+--------+----------+----------------------------------------+

histogram of values by LRU age

SELECT
  ROUND(ROUND(LOG10(
    (SELECT MAX(lru_time) FROM entries)
    - lru_time) * 5, 0) / 5, 1) AS log,
  ROUND(AVG(
    (SELECT MAX(lru_time) FROM entries)
    - lru_time), -2) AS lru_age,
  COUNT(*),
  RPAD('', COUNT(*) / 6500, '*') AS bar
FROM entries
GROUP BY log
ORDER BY log DESC;

+------+-----------+----------+-------------------------------------+
| log  |   lru_age | COUNT(*) | bar                                 |
+------+-----------+----------+-------------------------------------+
|  4.6 |     34800 |    18064 | ***                                 |
|  4.4 |     24200 |    96739 | ***************                     |
|  4.2 |     15700 |   212865 | *********************************   |
|  4.0 |     10200 |   224703 | *********************************** |
|  3.8 |      6500 |   158067 | ************************            |
|  3.6 |      4100 |   108034 | *****************                   |
...
|  0.4 |         0 |      672 |                                     |
|  0.0 |         0 |      319 |                                     |
| NULL |         0 |    13400 | **                                  |
+------+-----------+----------+-------------------------------------+

histogram of values by size for a namespace

SELECT
  key_name,
  ROUND(ROUND(LOG10(value_size) * 5, 0) / 5, 1) AS log,
  ROUND(AVG(value_size), 0) AS size,
  COUNT(*),
  RPAD('', COUNT(*) / 10000, '*') AS bar
FROM entries
WHERE key_name LIKE '%chunk:%'
GROUP BY log
ORDER BY log desc;

+----------------------+------+-------+----------+----------------------+
| key_name             | log  | size  | COUNT(*) | bar                  |
+----------------------+------+-------+----------+----------------------+
|  reply_chunk:69913   |  3.4 |  2511 |     1032 | **                   |
|  user_chunk:1868495  |  3.2 |  1530 |      972 | **                   |
...
|  user_chunk:1405137  |  1.6 |    40 |     2822 | ******               |
|  reply_chunk:2084579 |  1.4 |    24 |     4361 | *********            |
|  user_chunk:2455162  |  1.2 |    14 |     6141 | ************         |
|  user_chunk:5989656  |  1.0 |    12 |     2477 | *****                |
|  user_chunk:2268781  |  0.8 |     6 |    16527 | ******************** |
+----------------------+------+-------+----------+----------------------+

slab allocation counts by class

SELECT
  class_id AS slab_class,
  MAX(value_size) AS slab_size,
  CEIL(COUNT(*) / ((1024 * 1024) / MAX(value_size))) AS allocated,
  (
    (SELECT COUNT(*) FROM entries
     WHERE class_id = slab_class AND is_expired = "true")
   * 100
   /
    (SELECT COUNT(*)
     FROM entries
     WHERE class_id = slab_class)
  ) AS `%_expired`
FROM entries
GROUP BY class_id
ORDER BY class_id;

+------------+-----------+-----------+-----------+
| slab_class | slab_size | allocated | %_expired |
+------------+-----------+-----------+-----------+
|          1 |         8 |         1 |   99.8938 |
|          2 |        32 |         5 |    6.0320 |
|          3 |        71 |         9 |   36.4859 |
...
|         34 |    187783 |        28 |   61.2903 |
|         35 |    234205 |       130 |    1.8998 |
|         36 |    286110 |         7 |    0.0000 |
|         37 |    309832 |         2 |   75.0000 |
|         38 |    397500 |         1 |  100.0000 |
|         39 |    496014 |         1 |  100.0000 |
+------------+-----------+-----------+-----------+

This is comparable to the memcached stats slabs command, but lets you investigate things like expired ratio by slab, etc.

Play around and see what else you can come up with.

conclusion

Generational keys are often presented as the easy way to cache, but consider their implications. How else are you abusing your cache? Some suggested things to check:

  • Keys without expiry times (which will bite you when you alter the server pool)
  • Excessive use of very large slabs
  • Uneven LRU age by slab (indicating poor slab allocation)
  • Unexpected namespaces (revealing invalidation bugs)

The only unfortunate thing about peep right now is that it blocks the memcached server. I could do an incremental version, but it would be quite slow. The current version is already CPU-intensive, taking 15 minutes to dump a 4GB memcached on modern hardware (turn off your client-facing processes if you run mixed front-ends!). A version in C would be much quicker, if someone is up for it, but for now the Ruby version is sufficient.

Happy peeping.

ruby gc tuning

I’d like to call out something important from my QCon slides: the Railsbench GC settings.

quick study

In my experience, a typical production Rails app on Ruby 1.8 can recover 20% to 40% of user CPU by applying Stefan Kaes’s Railsbench GC patch to the Ruby binary, and using the following environment variables:

RUBY_HEAP_MIN_SLOTS=500000
RUBY_HEAP_SLOTS_INCREMENT=250000
RUBY_HEAP_SLOTS_GROWTH_FACTOR=1
RUBY_GC_MALLOC_LIMIT=50000000

In return, you get a minor increase in peak memory consumption. Not too shabby.

gc behavior

Ruby’s GC is non-generational, which means that the GC looks at every object every time it runs. Java, OCaml, and other static languages have a generational GC, which means that only recent allocations are frequently checked (the older an object is, the less likely it is to be freeable). But Ruby pays a high cost each GC run because it looks at all old objects too; for example, the Rails code itself stored in the AST.

Writing a generational GC is quite difficult, but there is one thing we can do: run the GC less frequently. This is what Stefan’s patches allow.

how to tune your app

If you install this patch from Sylvain Joyeux, you can add object allocation deltas to your Rails log, as well as use Stefan’s GC tracking methods. This gives you visibility into exactly when the GC runs and how much useful work it does. Spin up ab and script a representative page. Also start top in another shell to watch total memory in the Rails process.

Now, do a simulated annealing-like search through the available environment settings.

example

Here is one of our test runs with Ruby’s default GC settings (we use a custom logger at Twitter, but you should be able to arrive at similar output):

tcpu:0.154 alloc:63926 delta:-66290 gccpu:0.078
tcpu:0.067 alloc:63640 delta:63645 gccpu:0.000
tcpu:0.146 alloc:63788 delta:-68896 gccpu:0.078
tcpu:0.060 alloc:63779 delta: 63784 gccpu:0.000
tcpu:0.138 alloc:63787 delta:-75152 gccpu:0.072
tcpu:0.059 alloc:63779 delta: 63784 gccpu:0.000
tcpu:0.138 alloc:63787 delta:-77591 gccpu:0.072
tcpu:0.062 alloc:63779 delta: 63784 gccpu:0.000

With the conservative Ruby defaults, the GC runs every other request, and takes 0.075 seconds, giving us a per-request GC cost of 0.038 seconds—40% of the entire request time.

This is excessive. But if you explore a bit, you will quickly arrive at something like RUBY_HEAP_MIN_SLOTS=500000 RUBY_HEAP_SLOTS_INCREMENT=250000 RUBY_HEAP_SLOTS_GROWTH_FACTOR=1 RUBY_GC_MALLOC_LIMIT=50000000, which is what we use at Twitter.

These particular settings mean:

  • Start with enough memory to hold Rails (Ruby’s default is practically nothing)
  • Increase it linearly if you need more (Ruby’s default is exponential increase)
  • Only garbage-collect every 50 million malloc calls (Ruby’s default is 6x smaller)

Here are the GC timings for the same ab run with these settings applied:

tcpu:0.181 alloc:63829 delta:-763708 gccpu:0.118
tcpu:0.067 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.062 alloc:63777 delta: 63782 gccpu:0.000
tcpu:0.060 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.060 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.060 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.058 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.060 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.063 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.058 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.063 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.059 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.060 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.185 alloc:63828 delta:-761854 gccpu:0.119
tcpu:0.069 alloc:63777 delta: 63782 gccpu:0.000
tcpu:0.065 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.058 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.062 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.060 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.061 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.062 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.062 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.059 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.060 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.060 alloc:63776 delta: 63781 gccpu:0.000
tcpu:0.062 alloc:63775 delta: 63780 gccpu:0.000

Woah! Now the GC runs only every 13 requests, at a slightly higher cost, for a per-request cost of 0.009 seconds. This translates to a general speedup of 34%. The frequency of GC calls corresponds quite directly to the change in RUBY_GC_MALLOC_LIMIT, but if we increase it much more the memory usage balloons.

further reading

Unfortunately, there’s not much clear information on the Ruby garbage collector. But here are a few resources:

We are now experimenting with Brent’s MBARI branch of Ruby 1.8.6, kindly sponsored by EngineYard. So far it looks excellent, expecially in combination with Stefan’s patches. I will publish some results soon.

qcon presentation

My QCon presentation is available.

Improving Running Components at Twitter

Some choice Tweets:

  • philwills: Evan Weaver on scaling twitter at #qcon was full of
    interesting stuff and good questions from audience.
  • markhneedham: fascinating reading these stats about #twitter
    from Evan Weaver’s talk #qcon
  • jurgenonland: sitting at a presentation from Evan Weaver @
    #qcon, wow he must be verry unhappy at his work
  • szegedi: Listening to Evan Weaver talking about Twitter system
    architecture & tuning. Getting to learn from these experiences is priceless.
  • oudenampsen: Was just by Evan Weaver of twitter. Gave the impression that any time he could commit suicide. However interesting.

My presentation abilities have gone from “bad” to “tolerable”, so I’m relatively satisfied with the situation. Clearly I need to be more engaging.

secret codes

Here are some secret codes I am involved with. They are some of the best codes recently coded.

kestrel

A replacement for Starling, the distributed message queue. Written on the JVM (Scala) because of the mature garbage collector. Has a constant performance profile regardless of the size of the queue.

memcached/libmemcached pre-builds

The C library and the Ruby client as a matched pair. Much improved failure handling over the public builds, and Ketama hashing is built-in as always. Switching from ruby-memcache to memcached at Twitter effectively halved our cluster CPU load. (These changes are getting upstreamed, so you can also just wait.)

cache-money

A Rails object cache layer for ActiveRecord. It can do primary key lookups and single index queries solely out of cache. Especially nice if you have replication lag.

peep

A heap inspector for live memcached server instances. They said it couldn’t be done.

thinking sphinx

Pat Allan has been working hard on his Sphinx plugin. He’s adding the missing enterprisey features from Ultrasphinx, so that it can finish its years sleeping gently in legacy apps. (Facets, and non-invasive delta indexing.)

bybusyness

Apache 2.2.10 got a new fair balancing algorithm. Smooths out latencies introduced by single-threaded backends with high standard deviation, such as Mongrel/Rails.

postscripts

Speaking of, I’m making sporadic progress on Mongrel 2, but don’t hold your breath. The main improvement will be Rack support.

A downside of my position at Twitter is meager open-source progress. But hey, we’re hiring (must be in SF or willing to move).

My RSI is slowly going away. The only real solution is to type less. And make sure to keep totally blasting your pecs.

if you’re going to san francisco

As some people already know, I’m moving to San Francisco in early September. One thing I had trouble finding was a map of the San Francisco microclimates. But here’s one:

It’s from this book.

I’ll be living on the border of North Beach and Russian Hill (which is in the warm zone, 6). No place in San Francisco is really “hot”, despite what that map says. But the patterns of fog and relative coolness are correct.

postscript

Sorry that my blogging and open-source output has been light; I’m recovering from serious RSI problems. The fact that you see this post at all is a sign that things are getting better, though.

echoe 3

Echoe 3 is out, right on the heels of Rubygems 1.2. It supports the new runtime vs. development dependencies, and works correctly with the Rubyforge 1.0.0 gem.

It still supports all the usual features like certificate chains, RDoc upload, changeset parsing, manifest building, and cross-packaging. Documentation is here.

By the way, Rubygems 1.2 seems pretty great.

fauna projects on github

Fauna projects are now on GitHub. Go to it.

The Subversion repos are going away, but the gems and forums will remain on RubyForge.

You can also follow Snax/Fauna activity on Twitter.

xapian search plugin

Francis Irving sent me a note about his work on a new Rails search plugin, acts_as_xapian. It uses the Xapian engine, which is a C++ indexer similar to Lucene. A particularly neat feature is built-in spellcheck.

I still plan to benchmark all these plugins on the Wikipedia dataset…it’s been delayed by the new job. If anyone has a big piece of iron I could use for a couple weeks I would appreciate it (16GB ram, hundreds of GB of free diskspace, no production load).

rubygems memory patch

This patch against RubyGems 1.1.1 improves memory usage by not keeping every unused gemspec permanently in memory. It should have low CPU impact as long as you do your gem requires up-front.

For MacPorts:

cd /opt/local/lib/ruby/site_ruby/1.8
curl http://evanweaver.files.wordpress.com/2010/12/rubygems-memory-1_1_1.diff \
  | sudo patch -p0

Incidentally, I used BleakHouse to track down which references were getting retained.

Follow

Get every new post delivered to your Inbox.

Join 66 other followers