The Question
Design

Distributed Key-Value Store

Design a highly available, horizontally scalable distributed key-value store. The system must support basic CRUD operations (Put/Get) for small to medium-sized values. Focus on the mechanics of data partitioning, replication strategies, consistency models (CAP trade-offs), and fault tolerance in a leaderless architecture. Assume a scale of 10TB of data and 50k peak QPS.
LSM-Tree
Consistent Hashing
Gossip Protocol
Vector Clocks
Merkle Tree
Quorum
RocksDB
gRPC
Questions & Insights

Clarifying Questions

Scale and Performance: What is the target scale for total data volume and peak throughput (QPS)?
Consistency vs. Availability: Does the system prioritize strict consistency (CP) or high availability under partition (AP)?
Data Size: What is the average size of the key and the value? Are we supporting large blobs or just small metadata?
Persistence: Is this an in-memory store with snapshots (like Redis) or a disk-heavy persistent store (like Cassandra/RocksDB)?
Assumptions for MVP:
Scale: 100M Daily Active Users, ~10k-50k QPS, ~10TB total storage.
Model: AP system (High Availability) with Tuneable Consistency (Quorum-based).
Data Size: Keys < 250 bytes, Values < 10KB.
Persistence: Durable disk storage required.

Thinking Process

How do we scale horizontally without a single point of failure? We use Consistent Hashing to distribute keys across a ring of nodes, minimizing reshuffling during scaling.
How do we ensure high availability even if nodes fail? We implement Replication (N nodes) and use Quorum-based reads/writes (R+W > N) to balance consistency and speed.
How does the cluster maintain state without a master? We use a Gossip Protocol for decentralized membership discovery and failure detection.
How do we handle write-heavy workloads efficiently? We utilize an LSM-Tree (Log-Structured Merge-Tree) storage engine to turn random writes into sequential disk I/O.

Bonus Points

Merkle Trees for Anti-Entropy: Use Merkle (hash) trees to compare data between replicas efficiently, transferring only the delta during synchronization.
Sloppy Quorum & Hinted Handoff: Enhance availability by allowing writes to "neighbor" nodes if the primary replica is down, with a promise to hand the data back once the primary recovers.
Vector Clocks: Implement logical clocks to detect and resolve concurrent write conflicts in an AP system where physical clocks might drift.
Zero-Copy Serialization: Use Protobuf or FlatBuffers for the wire protocol to minimize CPU overhead during high-throughput serialization/deserialization.
Design Breakdown

Functional Requirements

Core Use Cases:
put(key, value): Store a value associated with a unique key.
get(key): Retrieve the value associated with a unique key.
Scope Control:
In-scope: Partitioning, Replication, Consistency management, Persistence.
Out-of-scope: Complex transactions (multi-key), SQL-like querying, TTL (for MVP), and complex data structures (Lists/Sets).

Non-Functional Requirements

Scale: Horizontal scalability to hundreds of nodes.
Latency: Sub-10ms for get and put operations at the 99th percentile.
Availability & Reliability: 99.99% availability; no data loss on single node failure.
Consistency: Eventual consistency with support for "Read-your-writes."
Fault Tolerance: Automatic detection of node failure and data re-balancing.

Estimation

Traffic:
10k QPS average, 50k QPS peak.
Read/Write ratio: 1:1 for a general-purpose store.
Storage:
100M keys * 10KB/value = 1TB raw data.
With 3x replication = 3TB total disk usage.
Bandwidth:
10k QPS * 10KB = 100MB/s ingress/egress. Standard 1Gbps/10Gbps NICs can handle this.

Blueprint

Concise Summary: A decentralized, leaderless distributed KV store using consistent hashing for partitioning and a Quorum-based protocol for replication.
Major Components:
Client SDK: Handles request routing to the correct coordinator node based on the consistent hashing ring.
Coordinator Node: A stateless entry point (any node in the cluster) that manages the replication factor and quorum logic for a specific request.
Storage Node: Houses the LSM-Tree engine for persistent storage and participates in the Gossip protocol for cluster health.
Simplicity Audit: This design avoids the complexity of a centralized "Master" (like HDFS or GFS), removing a single point of failure and complex leader election for the MVP.
Architecture Decision Rationale:
Why leaderless?: It provides the highest write availability and simplest scaling model for an MVP KV store.
Functional Satisfaction: put/get are natively supported by the hashing/replication logic.
Non-functional Satisfaction: Consistent hashing ensures scalability; LSM-trees ensure low-latency writes.

High Level Architecture

Sub-system Deep Dive

Service

Topology & Scaling: Each node is identical (homogeneous). Scaling is achieved by adding nodes to the consistent hashing ring.
API Schema Design:
POST /v1/kv/{key} (gRPC preferred for performance):
Request: byte[] value, Metadata context
Response: Status OK/Error, VectorClock v
GET /v1/kv/{key}:
Request: Key k
Response: byte[] value, VectorClock v
Resilience & Reliability:
Quorum: N=3, W=2, R=2.
Hinted Handoff: If a replica is down, the coordinator writes to a temporary node with a "hint" tag.
Observability: Prometheus metrics for P99 latency and "Gossip convergence time."

Storage

Access Pattern: Write-heavy or balanced; requires high-speed sequential writes.
Database Table Design (Internal LSM):
Memtable: In-memory sorted buffer (Skip-list).
SSTable: Sorted String Table on disk (Immutable).
Commit Log: WAL (Write Ahead Log) for crash recovery.
Technical Selection: LSM-Tree (RocksDB-style).
Rationale: High write throughput. Deletes are handled via "tombstones" to maintain sequentiality.
Distribution Logic:
Consistent Hashing: Use 256 virtual nodes per physical node to prevent "hot spots."
Replication: Clockwise walk on the ring to pick the first N unique physical nodes.

Infrastructure (Optional)

Distributed Coordination:
Gossip Protocol: Nodes exchange state (heartbeats, versioning) every 1s with a random set of peers.
Failure Detector: Phi Accrual Failure Detector (suspicion-based) to handle network jitters.
Wrap Up

Advanced Topics

Trade-offs (CAP): This is an AP system. During a network partition, we allow writes to ensure availability, accepting that reads might be briefly stale until Read Repair or Anti-Entropy runs.
Reliability:
Merkle Trees: Used during background synchronization to reduce network bandwidth.
Checksums: Every data block in the SSTable has a CRC checksum to detect bit-rot.
Bottleneck Analysis:
Compaction Storms: LSM-tree compaction can consume high CPU/IO. Mitigation: Limit compaction threads or use tiered compaction strategies.
Hot Keys: A single key getting millions of hits. Mitigation: Client-side caching or further sub-partitioning (though sub-partitioning adds significant complexity for an MVP).
Security:
TLS 1.3 for inter-node communication.
Simple ACLs based on key prefixes.