The Question
DesignDistributed Key-Value Store Design
Design a highly available and scalable distributed key-value storage system capable of handling terabytes of data with low-latency writes. The system should support tunable consistency levels and ensure data durability even in the event of hardware failure.
LSM-Tree
Consistent Hashing
Gossip Protocol
Write-Ahead Log
Bloom Filter
Questions & Insights
Clarifying Questions
Scale & Performance: What is the expected Write/Read throughput (QPS) and the target latency for P99?
Data Size: What is the total dataset size, and what are the average/maximum sizes for keys and values?
Consistency vs. Availability: In the event of a network partition, should the system prioritize Strong Consistency or High Availability (CAP Theorem)?
Persistence: Is this an in-memory store with optional snapshots (like Redis) or a durable-first store (like Cassandra/LevelDB)?
Assumptions for MVP:
Scale: 100k+ QPS, 10TB total storage.
Data: Key size ~100 bytes, Value size ~10 KB.
Consistency: Tunable consistency (eventual by default, support for Quorum).
Persistence: Highly durable, using an LSM-Tree based storage engine.
Thinking Process
The core challenge of a distributed KV store is balancing write performance with data durability and horizontal scalability.
How to handle high-velocity writes? Use an LSM-Tree (Log-Structured Merge-Tree) to turn random writes into sequential I/O.
How to scale horizontally without a single point of failure? Implement Consistent Hashing with virtual nodes to distribute data across a peer-to-peer ring.
How to ensure data durability and availability? Use a Write-Ahead Log (WAL) and synchronous/asynchronous replication across Nodes.
How to resolve conflicts in a distributed environment? Utilize Vector Clocks or "Last Write Wins" (LWW) with NTP-synchronized clocks.
Bonus Points
Bloom Filters: Use Bloom Filters in the storage engine to prevent unnecessary disk seeks for non-existent keys.
Hinted Handoff: Implement hinted handoff to handle temporary node failures, ensuring high availability even when nodes are flapping.
Phi Accrual Failure Detector: Instead of a simple timeout, use an adaptive failure detector that adjusts to network congestion.
SSTable Compaction Strategies: Discuss the trade-offs between Size-Tiered Compaction (better for writes) and Leveled Compaction (better for reads/space efficiency).
Design Breakdown
Functional Requirements
Put(key, value): Store a value associated with a unique key.
Get(key): Retrieve the latest value associated with the key.
Delete(key): Mark a key for deletion (using tombstones).
Non-Functional Requirements
Scalability: Horizontal scaling by adding nodes to the cluster.
High Availability: Continuous operation even if multiple nodes fail.
Durability: Data must be persisted to non-volatile storage before acknowledging a "Put".
Low Latency: Sub-10ms response times for Get/Put operations.
Estimation
Writes: 20k QPS. Daily: 20,000 \times 86,400 \approx 1.7B writes.
Storage: 1.7B \times 10KB \approx 17TB per day (uncompressed).
Bandwidth: 20k \times 10KB \approx 200MB/s ingress.
Memory: To keep indexes and Bloom filters for 10TB of data, we need roughly 1-2% of total data size in RAM (\approx 100-200GB across the cluster).
Blueprint
Concise Summary: A distributed, peer-to-peer Key-Value store using Consistent Hashing for sharding and an LSM-Tree engine for high-performance persistent storage.
Major Components:
Client SDK: Handles routing logic and load balancing based on the hash ring.
Storage Node: The core unit containing the Storage Engine (LSM-Tree) and Replication Manager.
Gossip Protocol: Maintains cluster state and membership without a centralized coordinator.
Simplicity Audit: This design avoids complex centralized lock managers or heavy orchestration (like Zookeeper) by using decentralized membership and consistent hashing.
Architecture Decision Rationale:
LSM-Tree over B+ Tree: Better write throughput for the 10TB scale by leveraging sequential disk I/O.
Consistent Hashing: Minimizes data movement during cluster resizing.
Quorum-based replication: Provides tunable consistency to meet various application needs.
High Level Architecture
Sub-system Deep Dive
Service
Topology: Peer-to-peer (P2P) ring architecture. Any node can act as a coordinator for a specific request.
API Spec:
POST /v1/kv/{key}: Body: value, Headers: X-Consistency-Level: Quorum.GET /v1/kv/{key}: Returns value and version_tag.Consistent Hashing: The hash space is a circle (0 to 2^{32}-1). Nodes and keys are mapped to this circle. "Virtual Nodes" (vnodes) are used to ensure uniform data distribution and prevent hotspots.
Storage
Data Model: Simple Byte Array for values; UTF-8 strings for keys.
Storage Engine (LSM-Tree):
WAL: Every write is first appended to a log for crash recovery.
Memtable: An in-memory sorted structure (SkipList or Red-Black Tree). When full, it is flushed to disk.
SSTables: Sorted String Tables on disk. Once written, they are immutable.
Compaction: Background threads merge smaller SSTables into larger ones, removing deleted keys (tombstones).
Indexing: A Sparse Index is kept in memory to point to blocks within SSTables.
Wrap Up
Advanced Topics
Monitoring:
Metrics: Write Latency, Read Latency, Compaction Backlog, Disk Usage, Gossip Convergence Time.
Tools: Prometheus for metrics, Jaeger for distributed tracing of requests across nodes.
Trade-offs:
Consistency vs. Latency: Using R=1, W=1 (Eventual Consistency) provides lowest latency but risks stale reads. R+W > N (Quorum) ensures strong consistency but increases latency.
Bottlenecks: Compaction can cause I/O spikes. We implement "Compaction Throttling" to limit disk I/O used by background tasks during peak traffic.
Failure Handling:
Node Down: Gossip detects failure; Hinted Handoff stores data on a neighbor until the node returns.
Permanent Loss: Merkle Trees (Anti-entropy) are used to synchronize data between replicas in the background to fix inconsistencies.
Alternatives:
B-Tree: Better for read-heavy workloads with very little data change, but poor performance for the high-volume write requirement of this system.
FoundationDB (Shared-Disk): Uses a decoupled transaction log and storage, but much more complex to implement for an MVP.