Well, it’s been a long time. But! I have five papers to add to my original distributed systems primer:
coordination
CRDTs: Consistency Without Concurrency Control, Mihai Letia, Nuno Preguiça, and Marc Shapiro, 2009.
Guaranteeing eventual consistency by constraining your data structure, rather than adding heavyweight distributed algorithms. FlockDB works this way.
partitioning
The Little Engines That Could: Scaling Online Social Networks, Josep M. Pujol, Vijay Erramilli, Georgos Siganos, Xiaoyuan Yang Nikos Laoutaris, Parminder Chhabra, and Pablo Rodriguez, 2010.
Optimally partitioning overlapping graphs through lazy replication. Think of applying this technique at a cluster level, not just a server level.
Feeding Frenzy: Selectively Materializing Users’ Event Feeds, Adam Silberstein, Jeff Terrace, Brian F. Cooper, and Raghu Ramakrishnan, 2010.
Judicious session management and application of domain knowledge allow for optimal high-velocity mailbox updates in a memory grid. Twitter’s timeline system works this way.
systems integration
Dapper, a Large-Scale Distributed Systems Tracing Infrastructure, Benjamin H. Sigelman, Luiz André Barroso, Mike Burrows, Pat Stephenson, Manoj Plakal, Donald Beaver, Saul Jaspan, and Chandan Shanbhag, 2010.
Add a transaction-tracking, sampling profiler to a reusable RPC framework and get full stack visibility without performance degradation.
Forecasting MySQL Scalability with the Universal Scalability Law, Baron Schwartz and Ewen Fortune, 2010.
An example of data-driven scalability modeling in a concurrent system, via a least-squares regression approach.
Happy scaling. Make sure to read the original post if you haven’t.