Cassandra

Apache Cassandra is a highly scalable, high-performance, distributed NoSQL database designed to handle massive amounts of data across many commodity servers, providing high availability with no single point of failure.

Cheat Sheet

Prime Use Case

Use Cassandra when you have massive write-heavy workloads, require linear scalability, and need multi-region availability where 'always-on' write capability is more critical than immediate global consistency.

Critical Tradeoffs

  • AP over CP (Availability/Partition Tolerance over Consistency)
  • Write-optimized (LSM-trees) vs Read-latency (Compaction overhead)
  • Query-first modeling vs Flexible relational modeling
  • Eventual consistency vs Strong consistency (Tunable)

Killer Senior Insight

Cassandra isn't just a database; it's a distributed storage engine that forces you to design your data schema around your specific query patterns (Query-Driven Modeling) rather than your data relationships.

Recognition

Common Interview Phrases

Requirements for 'Infinite' horizontal scalability
High-velocity time-series data (IoT, metrics, logs)
Global multi-datacenter replication with low-latency local writes
Need for high availability where even a few seconds of downtime is unacceptable

Common Scenarios

  • Activity feeds and social media timelines
  • IoT sensor data ingestion and monitoring
  • E-commerce product catalogs and shopping carts
  • Messaging systems and chat history storage

Anti-patterns to Avoid

  • Applications requiring ACID transactions across multiple tables
  • Small datasets that fit on a single large relational instance
  • Use cases requiring frequent ad-hoc queries or complex JOINs
  • Workloads with high update/delete frequency (leading to tombstone issues)

The Problem

The Fundamental Issue

The 'Write Wall' and Single Point of Failure inherent in traditional RDBMS master-slave architectures when scaling to petabytes of data.

What breaks without it

Master nodes become a bottleneck for write throughput

Manual sharding of SQL databases becomes an operational nightmare

Failover mechanisms in RDBMS often lead to minutes of downtime

Cross-region replication latency kills write performance in CP systems

Why alternatives fail

Relational databases (PostgreSQL/MySQL) struggle with horizontal write scaling without complex middleware

MongoDB (in default configurations) prioritizes consistency, which can lead to write unavailability during leader elections

Standard Key-Value stores lack the structured 'Wide-Column' querying capabilities needed for complex time-series data

Mental Model

The Intuition

Imagine a circular ring of lockers. Instead of one manager holding all the keys, every locker owner knows the 'Gossip' about who owns which locker. When you want to store something, you can hand it to any owner, and they'll make sure it gets to the right locker and its neighbors for safekeeping.

Key Mechanics

1

Consistent Hashing: Determines data placement across the cluster ring

2

Gossip Protocol: Peer-to-peer communication for cluster state and health

3

LSM-Trees (Log-Structured Merge-Trees): Converts random writes into sequential I/O via Memtables and SSTables

4

Bloom Filters: Probabilistic data structures used to skip SSTables that don't contain a specific key

5

Hinted Handoff: Temporarily stores writes for a downed node to ensure eventual consistency

Framework

When it's the best choice

  • When write volume exceeds 100k+ operations per second
  • When the system must survive the loss of an entire data center
  • When data access patterns are well-defined and static

When to avoid

  • When you need to perform 'GROUP BY' or 'JOIN' operations at the database level
  • When you have a 'heavy update' workload that modifies the same records repeatedly
  • When you lack the engineering resources to manage compaction and JVM tuning

Fast Heuristics

If you need Joins
PostgreSQL/Spanner
If you need sub-millisecond Key-Value lookups
Redis
If you need massive writes + multi-region
Cassandra
If you need flexible document schemas + secondary indexes
MongoDB

Tradeoffs

+

Strengths

  • Linear Scalability: Adding nodes increases capacity linearly
  • No Single Point of Failure: Peer-to-peer architecture
  • Tunable Consistency: Choose between ONE, QUORUM, or ALL for each query
  • High Write Throughput: LSM-tree architecture is optimized for sequential disk writes

Weaknesses

  • Tombstones: Deletes don't remove data immediately, causing read performance degradation
  • Compaction Debt: Background merging of SSTables can consume significant CPU/IO
  • No Joins/Aggregations: Must be handled at the application layer or via denormalization
  • JVM Garbage Collection: Can cause 'stop-the-world' pauses affecting latency

Alternatives

ScyllaDB
Alternative

When it wins

When you need Cassandra compatibility but want higher performance and lower latency without JVM overhead.

Key Difference

Written in C++ with a shared-nothing architecture, eliminating GC pauses and improving resource utilization.

DynamoDB
Alternative

When it wins

When you want a serverless, fully managed experience and don't want to manage infrastructure.

Key Difference

Proprietary AWS service with a different pricing model (RCU/WCU) and stricter item size limits.

Google Cloud Spanner
Alternative

When it wins

When you need global scale but also require strict ACID transactions and relational semantics.

Key Difference

Uses TrueTime (atomic clocks) to provide external consistency across regions, which is much more expensive.

Execution

Must-hit talking points

  • Explain the difference between Partition Key (data distribution) and Clustering Key (on-disk sorting)
  • Discuss the 'Read Repair' and 'Anti-Entropy (Manual Repair)' mechanisms
  • Mention the 'Write Path': CommitLog -> Memtable -> SSTable
  • Explain how Quorum (R + W > N) ensures strong consistency

Anticipate follow-ups

  • Q:How do you handle hot partitions in Cassandra?
  • Q:What is the impact of a high 'Replication Factor' on write latency?
  • Q:How would you implement a secondary index, and why is it often discouraged?
  • Q:Explain the 'LWT' (Lightweight Transactions) and the Paxos protocol in Cassandra.

Red Flags

Using Cassandra like a relational database (Normalizing data).

Why it fails: Leads to application-side joins and massive performance bottlenecks because Cassandra is designed for denormalized, single-table queries.

Creating 'Unbounded' partitions (e.g., partitioning by a single 'status' column).

Why it fails: Causes 'Hot Partitions' where a single node holds too much data, leading to skewed load and eventual node failure.

Frequent deletes or updates to the same row.

Why it fails: Creates 'Tombstones' which the read scanner must skip, significantly increasing read latency and disk pressure.