snax

up and running with cassandra

Cassandra is a hybrid non-relational database in the same class as Google’s BigTable. It is more featureful than a key/value store like Riak, but supports fewer query types than a document store like MongoDB.

Cassandra was started by Facebook and later transferred to the open-source community. It is an ideal runtime database for web-scale domains like social networks.

This post is both a tutorial and a “getting started” overview. You will learn about Cassandra’s features, data model, API, and operational requirements—everything you need to know to deploy a Cassandra-backed service.

April 28, 2011: post updated for Cassandra gem 0.10 and Cassandra version 0.7.

features

There are a number of reasons to choose Cassandra for your website. Compared to other databases, three big features stand out:

Some other features that help put Cassandra above the competition :

installation

You need a Unix system. If you are using Mac OS 10.5, all you need is Git. Otherwise, you need to install Java 1.6, Git 1.6, Ruby, and Rubygems in some reasonable way.

Start a terminal and run:

sudo gem install cassandra

If you are using Mac OS, you need to export the following environment variables:

export JAVA_HOME="/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home"
export PATH="/System/Library/Frameworks/JavaVM.framework/Versions/1.6/Home/bin:$PATH"

Now you can build and start a test server with cassandra_helper:

cassandra_helper cassandra

It runs! In another terminal run:

cassandra_helper data:load

Now your schema is loaded too.

live demo

To insert some data and make some queries, open another terminal window and start irb, the Ruby shell:

irb

In the irb prompt, require the Ruby client library:

require 'rubygems'
require 'cassandra'
include SimpleUUID

Now instantiate a client object:

twitter = Cassandra.new('Twitter')

Let’s insert a few things:

user = {'screen_name' => 'buttonscat'}
twitter.insert(:Users, '5', user)

tweet1 = {'text' => 'Nom nom nom nom nom.', 'user_id' => '5'}
twitter.insert(:Statuses, '1', tweet1)

tweet2 = {'text' => '@evan Zzzz....', 'user_id' => '5', 'reply_to_id' => '8'}
twitter.insert(:Statuses, '2', tweet2)

Notice that the two status records do not have all the same columns. Let’s go ahead and connect them to our user record:

twitter.insert(:UserRelationships, '5', {'user_timeline' => {UUID.new => '1'}})
twitter.insert(:UserRelationships, '5', {'user_timeline' => {UUID.new => '2'}})

The UUID.new call creates a collation key based on the current time; our tweet ids are stored in the values.

Now we can query our user’s tweets:

timeline = twitter.get(:UserRelationships, '5', 'user_timeline', :reversed => true)
timeline.map { |time, id| twitter.get(:Statuses, id, 'text') }
# => ["@evan Zzzz....", "Nom nom nom nom nom."]

Two tweet bodies, returned in recency order—not bad at all. In a similar fashion, each time a user tweets, we could loop through their followers and insert the status key into their follower’s home_timeline relationship, for handling general status delivery.

the data model

Cassandra is best thought of as a 4 or 5 dimensional hash. The usual way to refer to a piece of data is as follows: a keyspace, a column family, a key, an optional super column, and a column. At the end of that chain lies a single, lonely value.

Let’s break down what these layers mean.

After the column family level, the organization can diverge—this is a feature unique to Cassandra. You can choose either:

Super columns and standard columns cannot be mixed at the same (4th) level of dimensionality. You must define at startup which column families contain standard columns, and which contain super columns with standard columns inside them.

Super columns are a great way to store one-to-many indexes to other records: make the sub column names TimeUUIDs (or whatever you’d like to use to sort the index), and have the values be the foreign key. We saw an example of this strategy in the demo, above.

If this is confusing, don’t worry. We’ll now look at two example schemas in depth.

twitter schema

Here is the schema definition we used for the demo, above. It is based on Eric Florenzano’s Twissandra, but updated for 0.7:

{"Twitter":{
    "Users":{
      "comparator_type":"org.apache.cassandra.db.marshal.UTF8Type",
      "column_type":"Standard"},
    "Statuses":{
      "comparator_type":"org.apache.cassandra.db.marshal.UTF8Type",
      "column_type":"Standard"},
    "StatusRelationships":{
      "subcomparator_type":"org.apache.cassandra.db.marshal.TimeUUIDType",
      "comparator_type":"org.apache.cassandra.db.marshal.UTF8Type",
      "column_type":"Super"},
}}

You can load a schema with this command (replace schema.json with your own filename):

bin/cassandra-cli --host localhost --batch < schema.json

The server must be running; as of version 0.7, Cassandra supports updating the schema at runtime.

What could be in StatusRelationships? Maybe a list of users who favorited the tweet? Having a super column family for both record types lets us index each direction of whatever many-to-many relationships we come up with.

Here’s how the data is organized:

Cassandra lets you distribute the keys across the cluster either randomly, or in order, via the Partitioner option in the storage-conf.xml file.

For the Twitter application, if we were using the order-preserving partitioner, all recent statuses would be stored on the same node. This would cause hotspots. Instead, we should use the random partitioner.

Alternatively, we could preface the status keys with the user key, which has less temporal locality. If we used user_id:status_id as the status key, we could do range queries on the user fragment to get tweets-by-user, avoiding the need for a user_timeline super column.

multi-blog schema

Here’s a another schema, suggested to me by Jonathan Ellis, the primary Cassandra maintainer. It’s for a multi-tenancy blog platform:

{"Multiblog":{
    "Blogs":{
      "comparator_type":"org.apache.cassandra.db.marshal.TimeUUIDType",
      "column_type":"Standard"},
    "Comments":{
      "comparator_type":"org.apache.cassandra.db.marshal.TimeUUIDType",
      "column_type":"Standard"}
  },}
Imagine we have a blog named 'The Cutest Kittens'. We will insert a row when the first post is made as follows:
require 'rubygems'
require 'cassandra/0.7'
include SimpleUUID

multiblog = Cassandra.new('Multiblog')

multiblog.insert(:Blogs, 'The Cutest Kittens',
  { UUID.new =>
    '{"title":"Say Hello to Buttons Cat","body":"Buttons is a cute cat."}' })

UUID.new generates a unique, sortable column name, and the JSON hash contains the post details. Let’s insert another:

multiblog.insert(:Blogs, 'The Cutest Kittens',
  { UUID.new =>
    '{"title":"Introducing Commie Cat","body":"Commie is also a cute cat"}' })

Now we can find the latest post with the following query:

post = multiblog.get(:Blogs, 'The Cutest Kittens', :reversed => true).to_a.first

On our website, we can build links based on the readable representation of the UUID:

guid = post.first.to_guid
# => "b06e80b0-8c61-11de-8287-c1fa647fd821"

If the user clicks this string in a permalink, our app can find the post directly via:

multiblog.get(:Blogs, 'The Cutest Kittens', :start => UUID.new(guid), :count => 1)

For comments, we’ll use the post UUID as the outermost key:

multiblog.insert(:Comments, guid,
  {UUID.new => 'I like this cat. - Evan'})
multiblog.insert(:Comments, guid,
  {UUID.new => 'I am cuter. - Buttons'})

Now we can get all comments (oldest first) for a post by calling:

multiblog.get(:Comments, guid)

We could paginate them by passing :start with a UUID. See this presentation to learn more about token-based pagination.

We have sidestepped two problems with this data model: we don’t have to maintain separate indexes for any lookups, and the posts and comments are stored in separate files, where they don’t cause as much write contention. Note that we didn’t need to use any super columns, either.

storage layout and api comparison

The storage strategy for Cassandra’s standard model is the same as BigTable’s. Here’s a comparison chart:

multi-file per-file intra-file
Relational server database table* primary key column value
BigTable cluster table column family key column name column value
Cassandra, standard model cluster keyspace column family key column name column value
Cassandra, super column model

* With fixed column names.

Column families are stored in column-major order, which is why people call BigTable a column-oriented database. This is not the same as a column-oriented OLAP database like Sybase IQ—it depends on whether your data model considers keys to span column families or not.

In row-orientation, the column names are the structure, and you think of the column families as containing keys. This is the convention in relational databases.

In column-orientation, the column names are the data, and the column families are the structure. You think of the key as containing the column family, which is the convention in BigTable. (In Cassandra, super columns are also stored in column-major order—all the sub columns are together.)

In Cassandra’s Ruby API, parameters are expressed in storage order, for clarity:

Relational SELECT `column` FROM `database`.`table` WHERE `id` = key;
BigTable table.get(key, "column_family:column")
Cassandra: standard model keyspace.get("column_family", key, "column")
Cassandra: super column model

Note that Cassandra’s internal Thrift interface mimics BigTable in some ways, but this is being changed.

going to production

Cassandra is an alpha product and could, theoretically, lose your data. In particular, if you change the schema specified in the storage-conf.xml file, you must follow these instructions carefully, or corruption will occur (this is going to be fixed). Also, the on-disk storage format is subject to change, making upgrading a bit difficult.

The biggest deployment is at Facebook, where hundreds of terabytes of token indexes are kept in about a hundred Cassandra nodes. However, their use case allows the data to be rebuilt if something goes wrong. Proceed carefully, keep a backup in an unrelated storage engine…and submit patches if things go wrong. (Some other production deployments are listed here.)

That aside, here is a guide for deploying a production cluster:

Note that your client machines (not servers!) must have accurate clocks for Cassandra to resolve write conflicts properly. Use NTP.

conclusion

There is a misperception that if someone advocates a non-relational database, they either don’t understand SQL optimization, or they are generally a hater. This is not the case.

It is reasonable to seek a new tool for a new problem, and database problems have changed with the rise of web-scale distributed systems. This does not mean that SQL as a general-purpose runtime and reporting tool is going away. However, at web-scale, it is more flexible to separate the concerns. Runtime object lookups can be handled by a low-latency, strict, self-managed system like Cassandra. Asynchronous analytics and reporting can be handled by a high-latency, flexible, un-managed system like Hadoop. And in neither case does SQL lend itself to sharding.

I think that Cassandra is the most promising current implementation of a distributed OLTP database, but much work remains to be done.

Cassandra has excellent performance. There some benchmark results for version 0.5 at the end of the Yahoo performance study.

further resources

At this point, there are many better resources around the web than above. Check the official Cassandra website.