The Question

Distributed Key-Value Store

Design a highly available, horizontally scalable distributed key-value store. The system must support basic CRUD operations (Put/Get) for small to medium-sized values. Focus on the mechanics of data partitioning, replication strategies, consistency models (CAP trade-offs), and fault tolerance in a leaderless architecture. Assume a scale of 10TB of data and 50k peak QPS.

LSM-Tree

Consistent Hashing

Gossip Protocol

Vector Clocks

Merkle Tree

Quorum

RocksDB

gRPC

Questions & Insights

Clarifying Questions

Scale and Performance: What is the target scale for total data volume and peak throughput (QPS)?

Consistency vs. Availability: Does the system prioritize strict consistency (CP) or high availability under partition (AP)?

Data Size: What is the average size of the key and the value? Are we supporting large blobs or just small metadata?

Persistence: Is this an in-memory store with snapshots (like Redis) or a disk-heavy persistent store (like Cassandra/RocksDB)?

Assumptions for MVP:

Scale: 100M Daily Active Users, ~10k-50k QPS, ~10TB total storage.

Model: AP system (High Availability) with Tuneable Consistency (Quorum-based).

Data Size: Keys < 250 bytes, Values < 10KB.

Persistence: Durable disk storage required.

Thinking Process

How do we scale horizontally without a single point of failure? We use Consistent Hashing to distribute keys across a ring of nodes, minimizing reshuffling during scaling.

How do we ensure high availability even if nodes fail? We implement Replication (N nodes) and use Quorum-based reads/writes (R+W > N) to balance consistency and speed.

How does the cluster maintain state without a master? We use a Gossip Protocol for decentralized membership discovery and failure detection.

How do we handle write-heavy workloads efficiently? We utilize an LSM-Tree (Log-Structured Merge-Tree) storage engine to turn random writes into sequential disk I/O.

Bonus Points

Merkle Trees for Anti-Entropy: Use Merkle (hash) trees to compare data between replicas efficiently, transferring only the delta during synchronization.

Sloppy Quorum & Hinted Handoff: Enhance availability by allowing writes to "neighbor" nodes if the primary replica is down, with a promise to hand the data back once the primary recovers.

Vector Clocks: Implement logical clocks to detect and resolve concurrent write conflicts in an AP system where physical clocks might drift.

Zero-Copy Serialization: Use Protobuf or FlatBuffers for the wire protocol to minimize CPU overhead during high-throughput serialization/deserialization.

Design Breakdown

Functional Requirements

Core Use Cases:

put(key, value): Store a value associated with a unique key.

get(key): Retrieve the value associated with a unique key.

Scope Control:

In-scope: Partitioning, Replication, Consistency management, Persistence.

Out-of-scope: Complex transactions (multi-key), SQL-like querying, TTL (for MVP), and complex data structures (Lists/Sets).

Non-Functional Requirements

Scale: Horizontal scalability to hundreds of nodes.

Latency: Sub-10ms for get and put operations at the 99th percentile.

Availability & Reliability: 99.99% availability; no data loss on single node failure.

Consistency: Eventual consistency with support for "Read-your-writes."

Fault Tolerance: Automatic detection of node failure and data re-balancing.

Estimation

Traffic:

10k QPS average, 50k QPS peak.

Read/Write ratio: 1:1 for a general-purpose store.

Storage:

100M keys * 10KB/value = 1TB raw data.

With 3x replication = 3TB total disk usage.

Bandwidth:

10k QPS * 10KB = 100MB/s ingress/egress. Standard 1Gbps/10Gbps NICs can handle this.

Blueprint

Concise Summary: A decentralized, leaderless distributed KV store using consistent hashing for partitioning and a Quorum-based protocol for replication.

Major Components:

Client SDK: Handles request routing to the correct coordinator node based on the consistent hashing ring.

Coordinator Node: A stateless entry point (any node in the cluster) that manages the replication factor and quorum logic for a specific request.

Storage Node: Houses the LSM-Tree engine for persistent storage and participates in the Gossip protocol for cluster health.

Simplicity Audit: This design avoids the complexity of a centralized "Master" (like HDFS or GFS), removing a single point of failure and complex leader election for the MVP.

Architecture Decision Rationale:

Why leaderless?: It provides the highest write availability and simplest scaling model for an MVP KV store.

Functional Satisfaction: put/get are natively supported by the hashing/replication logic.

Non-functional Satisfaction: Consistent hashing ensures scalability; LSM-trees ensure low-latency writes.

High Level Architecture

Sub-system Deep Dive

Service

Topology & Scaling: Each node is identical (homogeneous). Scaling is achieved by adding nodes to the consistent hashing ring.

API Schema Design:

POST /v1/kv/{key} (gRPC preferred for performance):

Request: byte[] value, Metadata context

Response: Status OK/Error, VectorClock v

GET /v1/kv/{key}:

Request: Key k

Response: byte[] value, VectorClock v

Resilience & Reliability:

Quorum: N=3, W=2, R=2.

Hinted Handoff: If a replica is down, the coordinator writes to a temporary node with a "hint" tag.

Observability: Prometheus metrics for P99 latency and "Gossip convergence time."

Storage

Access Pattern: Write-heavy or balanced; requires high-speed sequential writes.

Database Table Design (Internal LSM):

Memtable: In-memory sorted buffer (Skip-list).

SSTable: Sorted String Table on disk (Immutable).

Commit Log: WAL (Write Ahead Log) for crash recovery.

Technical Selection: LSM-Tree (RocksDB-style).

Rationale: High write throughput. Deletes are handled via "tombstones" to maintain sequentiality.

Distribution Logic:

Consistent Hashing: Use 256 virtual nodes per physical node to prevent "hot spots."

Replication: Clockwise walk on the ring to pick the first N unique physical nodes.

Infrastructure (Optional)

Distributed Coordination:

Gossip Protocol: Nodes exchange state (heartbeats, versioning) every 1s with a random set of peers.

Failure Detector: Phi Accrual Failure Detector (suspicion-based) to handle network jitters.

Wrap Up

Advanced Topics

Trade-offs (CAP): This is an AP system. During a network partition, we allow writes to ensure availability, accepting that reads might be briefly stale until Read Repair or Anti-Entropy runs.

Reliability:

Merkle Trees: Used during background synchronization to reduce network bandwidth.

Checksums: Every data block in the SSTable has a CRC checksum to detect bit-rot.

Bottleneck Analysis:

Compaction Storms: LSM-tree compaction can consume high CPU/IO. Mitigation: Limit compaction threads or use tiered compaction strategies.

Hot Keys: A single key getting millions of hits. Mitigation: Client-side caching or further sub-partitioning (though sub-partitioning adds significant complexity for an MVP).

Security:

TLS 1.3 for inter-node communication.

Simple ACLs based on key prefixes.