The Question

Distributed Key-Value Store Design

Design a highly available and horizontally scalable distributed Key-Value store. The system must support millions of requests per second, handle petabytes of data, and provide tunable consistency models. Detail how you would handle data partitioning, replication across multiple nodes, conflict resolution for concurrent writes, and decentralized failure detection without a single point of failure.

Consistent Hashing

LSM-tree

RocksDB

Gossip Protocol

Vector Clocks

Quorum

Bloom Filters

Merkle Tree

Hinted Handoff

Questions & Insights

Clarifying Questions

What is the typical size of the values?

Assumption: Values are relatively small (average 1KB, max 1MB) to fit a standard Key-Value use case.

What is the required consistency model?

Assumption: Tunable consistency is required. The system should support both strong and eventual consistency depending on client configuration (R + W > N).

What is the expected scale in terms of throughput and data volume?

Assumption: 1 Million QPS (Read-heavy, 9:1 ratio) and 10TB total data.

Is multi-region availability a requirement for the MVP?

Assumption: No, the MVP will focus on a single multi-AZ cluster with a leaderless architecture for high availability.

Thinking Process

To design a distributed Key-Value store that scales horizontally, we must solve for data partitioning, availability, and conflict resolution.

How do we distribute data without creating hotspots? We use Consistent Hashing with virtual nodes to ensure even distribution and minimize remapping during scale-out.

How do we ensure the system stays available if nodes fail? We implement a leaderless replication model (Dynamo-style) where data is replicated across

N

odes, allowing any node to handle requests.

How do we handle concurrent writes to the same key? We use Vector Clocks for causality tracking and versioning, with "Last Write Wins" (LWW) as a simplified conflict resolution strategy for the MVP.

How do we detect and recover from node failures? We use a Gossip Protocol for decentralized membership and failure detection, combined with Hinted Handoff for temporary failures and Merkle Trees for permanent anti-entropy.

Bonus Points

LSM-tree Storage Engine: Utilizing an LSM-tree (Log-Structured Merge-tree) like RocksDB for the underlying storage engine to optimize for high write throughput and SSD endurance.

Merkle Tree Anti-Entropy: Implementing Merkle trees to detect data inconsistencies between replicas efficiently by comparing hash trees instead of entire datasets.

Vector Clocks vs. HLC: Discussing the trade-offs between Vector Clocks (causal consistency) and Hybrid Logical Clocks (HLC) for providing a more user-friendly timestamp-based ordering while maintaining causality.

Zero-Copy Transfers: Using sendfile or direct I/O for data replication to bypass the kernel-user space buffer copying, maximizing network throughput.

Design Breakdown

Functional Requirements

Core Use Cases:

put(key, value): Stores a value associated with a key.

get(key): Retrieves the value associated with a key.

Scope Control:

In-scope: Distributed storage, partitioning, replication, conflict resolution, and failure detection.

Out-of-scope: Complex transactions (ACID), secondary indexing, and SQL-like querying.

Non-Functional Requirements

Scale: Support for 1M+ QPS and 10TB+ of data.

Latency: P99 read/write latency < 10ms.

Availability & Reliability: 99.99% availability (High availability via replication).

Consistency: Tunable consistency (CAP trade-off: AP by default, CP optional).

Fault Tolerance: No single point of failure; must survive node and rack failures.

Security: Basic AuthN/AuthZ at the coordinator level and encryption at rest.

Estimation

Traffic Estimation:

Total QPS: 1,000,000.

Read QPS (90%): 900,000.

Write QPS (10%): 100,000.

Storage Estimation:

10TB total data.

With 3x replication: 30TB total raw storage.

Average key-value pair size: 1KB.

Bandwidth Estimation:

Read Bandwidth: 900k * 1KB = 900 MB/s.

Write Bandwidth: 100k 1KB 3 (replication) = 300 MB/s.

Total: ~1.2 GB/s.

Blueprint

Concise Summary: A leaderless, Dynamo-style distributed Key-Value store utilizing consistent hashing for partitioning and a Quorum-based protocol for tunable consistency.

Major Components:

Client Library: Handles request routing using a hash ring and provides retry logic.

Coordinator Node: Any node in the cluster that receives a request and manages the Quorum (R/W) across replicas.

Storage Engine (LSM-tree): Local per-node persistence layer optimized for high-speed writes.

Gossip Service: Decentralized background service for cluster membership and health monitoring.

Simplicity Audit: This design avoids complex leader election (Paxos/Raft) in favor of a simpler "Any-Node-is-Coordinator" model, which is easier to scale and maintain for a basic Key-Value store.

Architecture Decision Rationale:

Why leaderless?: It maximizes availability and write throughput, as any node can accept writes, avoiding the bottleneck of a single leader.

Functional Requirement Satisfaction: put and get are handled via direct hash-based routing.

Non-functional Requirement Satisfaction: Scalability is achieved via consistent hashing; availability via replication; and latency via local LSM-tree writes.

High Level Architecture

Sub-system Deep Dive

Service

Topology & Scaling

Leaderless Topology: Every node can act as a coordinator. Nodes are arranged in a logical ring (Consistent Hashing).

Scaling Signals: Triggered by disk utilization or CPU saturation. New nodes join the ring and steal virtual nodes from neighbors.

Failure Domains: Replicas are placed across different Availability Zones (AZs) using a "Rack Aware" placement policy.

API Schema Design

put(key, value, context): REST/gRPC. Context contains vector clock info. Returns 201 Created.

get(key): REST/gRPC. Returns Value + Context.

Idempotency: Provided by unique keys and versioning (Vector Clocks).

Resilience & Reliability

Quorum: Configurable

N

(replicas),

W

rite quorum),

R

ead quorum).

Hinted Handoff: If a node is down, the coordinator stores the "hint" on another node to be replayed later.

Observability

Metrics: Request latency (P99), Quorum failures, Gossip convergence time.

Security

mTLS: Required for all node-to-node communication (Replication, Gossip).

Storage

Access Pattern

High-frequency random reads and sequential writes. Write-ahead logging (WAL) for durability.

Database Table Design

Local Schema (SSTables):

Key: Binary/String (Primary Key)

Value: Binary Blob

Timestamp/VectorClock: Metadata for conflict resolution.

Technical Selection

RocksDB: Used as the local storage engine. Rationale: High performance, excellent compaction strategies, and industry-standard for building K-V stores (e.g., used in TiKV, CockroachDB).

Distribution Logic

Consistent Hashing: Keys are hashed to a 128-bit ring space. Each node is assigned multiple "virtual nodes" to prevent data skew.

Reliability & Recovery

Merkle Trees: Generated per partition. Nodes compare trees during background "Anti-entropy" rounds to sync missing data.

Wrap Up

Advanced Topics

Trade-offs (PACELC): In the event of a partition (P), we prioritize Availability (A) over Consistency (C). Under normal conditions (E), we prioritize Latency (L).

Conflict Resolution:

Vector Clocks: Essential for detecting causal relationships. If two versions are concurrent (not ancestor/descendant), the system can either perform a "read repair" or let the client resolve the conflict (e.g., merging shopping carts).

Bottleneck Analysis:

Hot Keys: A single key getting millions of hits will overwhelm its specific hash-range nodes. Mitigation: Client-side caching or "sharding" hot keys by appending a random suffix.

Gossip Storms: In very large clusters, gossip traffic can consume significant bandwidth. Mitigation: Increase gossip interval or move to a hierarchical gossip structure.

Distinguishing Insights:

Phi Accrual Failure Detector: Instead of a simple binary "up/down" heartbeat, use a suspicion-based detector (like Cassandra) that adjusts to network conditions to prevent flapping.

Bloom Filters: Every node maintains Bloom Filters for its SSTables to skip disk I/O for keys that definitely do not exist, significantly boosting get() performance for missing keys.