distributed systems primer

I’ve been reading a bunch of papers about distributed systems recently, in order to help systematize for myself the thing that we built over the last year. Many of them were originally passed to me by Toby DiPasquale. Here is an annotated list so everyone can benefit.

It helps if you have some algorithms literacy, or have built a system at scale, but don’t let that stop you.

prelude

The Death of Architecture, Julian Browne, 2007.

First, a reminder of what it means to build a system that solves a business problem. Browne built the space-based billing system at Virgin Mobile, one of the most well-known SBAs outside the financial and research industries.

That lovely diagram showing clean service-oriented interfaces, between unified systems of record, holding clean data, performing well-defined business processes is never going to be….You have to roll up your sleeves, talk to the business analysts, developers, operations and make a contribution that makes those boxes and arrows real.

System failures in the web world are most often due to inflated technical requirements and integration risks, not an incorrect decision to skip two-phase commit.

constraints

The Case for Shared Nothing, Michael Stonebraker, 1985.

The source of the shared-nothing paradigm, and importantly, its alternatives. Shared-nothing is a nice hammer, but not every problem is a nail.

Harvest, Yield and Scalable Tolerant systems, Eric Brewer, 1999.

The CAP theorem. Sometimes you just can’t get what you want. (This is related to the Dynamo work, below.)

Distributed Computing Economics, Jim Gray, 2003.

How to predict the cost of the thing you want to build. Via some napkin math, Gray shows why making the cloud cost-efficient for current problems continues to be a struggle.

coordination

Guardians and Actions: Linguistic Support for Robust, Distributed Programs, Barbara Liskov and Robert Scheifler, 1983.

Two-phase commit. Making sure what you plan to do will get done.

Time, Clocks and the Ordering of Events in a Distributed System, Leslie Lamport, 1978.

Distributed systems are inherently relative; there is no privileged position that can describe all events exactly. Lamport clocks (and the closely related vector clocks) let participants agree on the order of events in the world, if you need to care about that.

Paxos Made Simple, Leslie Lamport, 2001.

The consensus problem: how can potentionally faulty processes agree about an element of global state? The Paxos algorithm guarantees correctness during a failure of a minority of nodes. This paper is difficult, but important for the subtleties it reveals.

Also see Paxos Made Live for a discussion of the implementation in Google’s Chubby v2 coordination server. Real life introduces many unfortunate kinks.

encapsulation

Generative Communication in Linda, David Gelernter, 1985.

Tuple spaces, a.k.a. the blackboard pattern, a.k.a. spaced-based architecture. Coordinating a system through the data, not through the behaviors. I will be writing a lot more about this in the future.

partitioning

Chord: A Scalable Peer-to-peer Lookup Service for Internet Applications, Ion Stoica, et al., 2001.

The original distributed hash table paper. Introduced consistent hashing for robust pool rebalancing.

Frangipani: A Scalable Distributed File System, Chandramohan A. Thekkath, Timothy Mann and Edward K. Lee, 1998.

Classic paper regarding modern-style distributed filesystems.

systems integration

Dynamo: Amazon’s Highly Available Key-Value Store, Giuseppe DeCandia, et al., 2007.

A key-value store that spawned numerous clones, it integrated many fundamental ideas from the above works into an actual running system. Cassandra is the most featureful open-source version.

(Also see Tokyo Cabinet. Not a Dynamo clone, per se, but it’s the next most practical alternative aside from MySQL. Tokyo is a networked BerkeleyDB replacement, so the domain code must handle the distributed aspects.)

SEDA: An Architecture for Well-Conditioned, Scalable Internet Services, Matt Welsh, David Culler, and Eric Brewer, 2001.

Great paper on managing the interactions between individual components, and ways to build a well-conditioned service under unpredictable load.

conclusion

That’s all really. Focus on the ideas, not the implementations. Try to see the patterns in existing systems, rather than imagining how they “should” work if everything was perfect. Then you’ll be able to scale to the moon.

The moon is above the cloud and therefore obviously more scaleable.

Update: five additional papers are here.

8 responses

  1. Great collection of papers. However, a tiny nitpick. Lamport clocks are not vector clocks—Lamport clocks are single valued and allow you to infer the causal order of messages. However, Lamport clocks induce a total order on messages which actually gives less information.

    Vector clocks can also tell you if two messages are not causally related as the ordering is not total – it is possible that a message can be ordered neither before not after another. It is easy to move from vector clocks to a total order (based, for example, on node ids), but not the other way around.

  2. Thanks for omitting the Bigtable paper from the list. It contributes little useful information relative to the amount of attention it receives elsewhere.

  3. Good selection of papers. Most of those I haven’t read so that’s this weekend sorted. Impeccable choice of intro too :) (wondered where all that extra traffic was coming from).

  4. Thanks for the list. I’ve just returned from “Death of Architecture” and can only congratulate its author (and evan for choosing this as a prelude).