Cassandra - a decentralized structured storage system

14 Dec 2020

Introduction

Cassandra is a distributed storage system that runs over cheap commodity servers and handles high write throughput while maintaining low latency for read operations.
At the time of writing, it was used to support the search for Facebook Inbox.
Link to the paper
Link to the implementation

A table is a distributed multidimensional map.
The key is a string (generally 16-36 bytes long), while the value is a structured object.
Every operation under a single row key is atomic per replica.
Columns are grouped together into sets called column families.
There are two types of columns families:
- Simple families.
- Super column families: visualized as a column family within a column family.
Columns can be sorted by name or time (used to display results in time sorted order).
The API supports insert, get and delete operations.

Any read/write request gets routed to any node in the cluster. The node determines the replicas for a given key and routes the request.
For write query, the system waits for a quorum of replicas to acknowledge the writes’ completion.
For read query, the system either routes the requests to the closest replica (might fetch stale results) or routes the requests to all replicas and waits for a quorum of responses.

Cassandra partitions data across the cluster using consistent hashing with an order-preserving hash function.
The hash function’s output range is treated as a fixed circular ring, and each node is assigned a random position on the ring.
An incoming request specifies a key used to route requests.
One benefit of this approach is that the addition/removal of a node only affects its immediate neighbors.
However, randomly assigning nodes leads to non-uniform data and load distribution.
Cassandra uses the load information and moves lightly loaded nodes to reduce the load on other nodes.

Each data item is replicated at N hosts, where N is the per-instance replication factor.
Cassandra supports the following replication policies: Rack Unaware, Rack Aware (within a datacenter), and Datacenter Aware.
For “Rack Aware” and “Datacenter Aware” strategies, Zookeeper elects a leader among the nodes and holds metadata about which range a node is responsible for.
In case of node failure and network partitions, the quorum requirements are relaxed.

Cluster membership is based on Scuttlebutt, a very efficient anti-entropy Gossip based mechanism.
Cassandra uses a modified version of $\phi$ Accrual Failure Detector for detecting failures, which provides the suspicion level (of failure) for each node.

A node, starting for the first time, chooses a random position in the ring.
This information is persisted on the local disk, on Zookeeper, and gossiped around the cluster (so any node can route any query in the cluster).
During bootstrapping, the newly joined node reads a list of contact points (within the cluster) using a configuration file.

Generally, a write operation involves a write into a commit log (for durability and recoverability), followed by a write into the in-memory data structures.
A read operation starts with querying the in-memory data and then looks into the filesystem.
Read queries on the filesystem use bloom filters.
Column indices are maintained to make it faster to look up relevant columns.

Components implemented in Java.
System control messages use UDP while messages for replication and request routing uses TCP.
A new commit log is rolled out after the older one exceeds 128MB of size.
All the data is indexed using a primary key.
Data on the disk is chunked into sequences of blocks. Each block contains at most 128 keys and is demarcated by a block index.
When the data is written to the disk, a block index is generated and maintained in the memory for faster access.
A compaction process is performed to merge multiple files (on disk) into one file.

Data from MySQL servers is added to Cassandra using MapReduce processes.
Although Cassandra is a completely decentralized system, adding some coordination (via Zookeeper) is helpful.
For Inbox Search, a per-user index is maintained for all the messages.
For “term search”, the key is the userid, and the words in the message become the super column.
For searching all the messages ever sent/received by a user, the key is the userid, and the recipient ids are the super columns.