The Question

Scalable Distributed Key-Value Store

Design a highly available and scalable distributed key-value store capable of supporting millions of QPS with sub-10ms latency. The system must handle petabytes of data, remain operational during network partitions (AP focused), and provide tunable consistency. Discuss the data distribution strategy, replication mechanisms, conflict resolution, and the underlying storage engine architecture.

LSM-Tree

Consistent Hashing

Gossip Protocol

Vector Clocks

Merkle Trees

Quorum

gRPC

WAL

Bloom Filters

Questions & Insights

Clarifying Questions

What is the primary data model? (e.g., Key-Value, Relational, Document, or Graph?)

Assumption: We will design a Distributed Key-Value Store (similar to Dynamo or Cassandra) as it represents the core challenges of robustness and scale.

What are the scale and performance requirements? (QPS, Latency, Data Volume?)

Assumption: 10M+ QPS (Read/Write combined), <10ms P99 latency, and Petabyte-scale storage.

What is the priority in the CAP theorem? (Consistency vs. Availability?)

Assumption: High Availability (AP system) with Tunable Consistency to ensure "robustness" even during network partitions.

What is the durability requirement?

Assumption: Data must be persisted to disk and replicated across multiple availability zones.

Thinking Process

Core Bottleneck: Maintaining high availability and performance while scaling to millions of requests without a single point of failure.

Progressive Logic:

How do we distribute data across thousands of nodes? (Consistent Hashing)

How do we ensure data isn't lost if a node dies? (Replication & Quorum)

How do we handle write-heavy workloads with low latency? (LSM-Trees)

How do we keep nodes in sync and detect failures? (Gossip Protocol & Merkle Trees)

Bonus Points

Vector Clocks: Used for conflict resolution in an eventually consistent system to track causality.

Hinted Handoff: Ensuring write availability even when target nodes are temporarily down by storing "hints" on healthy neighbors.

LSM-Tree Bloom Filters: Reducing disk I/O for non-existent keys by checking a probabilistic data structure in memory first.

Merkle Tree Anti-Entropy: Efficiently identifying differences between replicas during background synchronization without transferring entire datasets.

Design Breakdown

Functional Requirements

Core Use Cases:

put(key, value): Store a value associated with a unique key.

get(key): Retrieve the value associated with a key.

Scope Control:

In-Scope: Distributed storage, replication, partitioning, and local persistence.

Out-of-Scope: SQL support, complex transactions (ACID across keys), and secondary indexing for the MVP.

Non-Functional Requirements

Scale: Horizontal scalability to handle 1PB+ of data and 10M QPS.

Latency: Sub-10ms for both reads and writes at the node level.

Availability: 99.999% availability (Highly Available).

Consistency: Tunable consistency (Eventual by default, can be Strong via Quorum).

Fault Tolerance: No single point of failure; system must survive the loss of multiple nodes or an entire AZ.

Estimation

Traffic: 10M QPS. Assuming 50/50 Read/Write ratio = 5M Write QPS, 5M Read QPS.

Storage: 1KB average value size * 10 billion keys = 10TB unique data. With 3x replication = 30TB.

Bandwidth: 10M QPS * 1KB = 10GB/s total ingress/egress. This requires a large cluster of nodes with 10Gbps or 25Gbps NICs.

Blueprint

Concise Summary: A peer-to-peer distributed hash table (DHT) using consistent hashing for data distribution and LSM-Trees for local high-performance storage.

Major Components:

Gossip Service: Manages cluster membership and failure detection in a decentralized manner.

Storage Engine (LSM-Tree): Handles high-throughput writes via a Write-Ahead Log (WAL) and sorted string tables (SSTables).

Replication Manager: Coordinates data duplication across N nodes using a Quorum-based consistency model.

Simplicity Audit: This design avoids a centralized master (like HDFS NameNode), eliminating the primary bottleneck and single point of failure for the MVP.

Architecture Decision Rationale:

Why?: A Leaderless architecture (Dynamo-style) is superior for "robustness" because any node can handle any request, providing maximum uptime.

Functional: Supports basic KV operations efficiently.

Non-functional: Scalability is linear; adding nodes increases both capacity and throughput.

High Level Architecture

Sub-system Deep Dive

Service

Topology & Scaling: Stateless coordination logic allows any node to act as a "coordinator." Nodes are arranged in a logical ring (Consistent Hashing).

API Schema Design:

put(key, value, context): gRPC, returns 201 Created. Context includes vector clock metadata.

get(key): gRPC, returns value and context.

Resilience & Reliability:

Quorum (R + W > N): Ensuring consistency by requiring a majority of nodes to acknowledge operations.

Sloppy Quorum: Allowing writes to "handoff" nodes if primary replicas are down.

Storage

Access Pattern: Write-heavy (logging, state tracking). Needs high-speed ingestion.

Database Table Design:

Internal Key: Hash(PartitionKey) + SortKey.

Value: Blob.

Metadata: Timestamp, TTL, Vector Clock.

Technical Selection: LSM-Tree (Log-Structured Merge-Tree).

Rationale: Writes are sequential (appending to WAL and MemTable), which is much faster than B-Tree random I/O.

Distribution Logic: Consistent Hashing with Virtual Nodes.

Prevents "hot spots" by mapping one physical node to multiple points on the hash ring.

Cache

Purpose: Reduce disk read latency for frequently accessed keys (Hot keys).

Key-Value Schema: Key -> Value.

Technical Selection: In-memory Buffer Pool / Block Cache.

Rationale: Integrated directly into the storage engine (like RocksDB's Block Cache) to ensure cache-to-disk consistency.

Failure Handling: LRU eviction; cache is warmed up from SSTables upon node restart.

Data Processing

Processing Model: Compaction (Background Batch).

Processing DAG:

Merge multiple SSTables.

Remove deleted records (Tombstones).

Sort and write new SSTables.

Technical Selection: Custom background threads managed by the storage engine.

Rationale: Standard LSM-Tree maintenance is required to prevent "Read Amplification."

Wrap Up

Advanced Topics

Trade-offs (AP vs CP): We choose Availability (AP). During a network split, the database remains writable. We use Read Repair and Vector Clocks to resolve conflicts later.

Reliability:

Gossip Protocol: Uses periodic, pairwise communication to spread state. If a node stops responding, the cluster marks it "suspect" and then "down."

Anti-Entropy: Using Merkle Trees to compare data ranges between replicas with minimal network overhead.

Bottleneck Analysis:

Hot Keys: Solved via Virtual Nodes and potential application-side caching.

Write Stall: Occurs if compaction cannot keep up with ingestion. Managed by rate-limiting ingress or increasing compaction threads.

Distinguishing Insights: To handle Multi-Region, we implement "Local Quorum" for low-latency writes within a region, while asynchronously replicating to other regions to prevent cross-continent latency from blocking the client.