The Question
Design

Scalable Distributed Sharded KV Store

Design a distributed data storage system that can handle petabyte-scale datasets and millions of requests per second. Explain how you would implement sharding, ensure data durability during node failures, and handle 'hot spots' in the data distribution while maintaining low-latency access.
LSM-Tree
Raft
Consistent Hashing
etcd
Redis
SSTables
Bloom Filter
mTLS
gRPC
Questions & Insights

Clarifying Questions

What is the scale of the data and throughput?
Assumption: We need to support 100TB+ of data with 1 million+ Queries Per Second (QPS), with an 80/20 read/write ratio.
What are the consistency and availability requirements (CAP)?
Assumption: We prioritize Availability and Partition Tolerance (AP) but provide Tunable Consistency (similar to DynamoDB) where single-key operations can be configured for Strong Consistency.
What is the primary access pattern?
Assumption: Point lookups and small range scans based on a partition key.
How should the system handle growth and rebalancing?
Assumption: The system must support horizontal scaling without downtime, using consistent hashing to minimize data movement during reshards.

Thinking Process

Partitioning Strategy: How do we map a key to a specific node? We will use Consistent Hashing with Virtual Nodes to ensure even distribution and minimize remapping during cluster membership changes.
Physical Storage Engine: How does a single shard handle high-speed writes? We will use LSM-Trees (Log-Structured Merge-trees) to turn random writes into sequential I/O.
Request Routing: How does a client find the data? We will implement a Metadata Service (backed by Etcd/Zookeeper) to store the shard map, which is cached by clients for direct-to-node routing.
Availability & Replication: How do we handle node failure? We will use Synchronous Replication within a "Replication Group" (using a consensus protocol like Raft) to ensure no data loss on leader failure.

Bonus Points

Hot Partition Mitigation: Implementation of "Dynamic Splitting" where heavily accessed shards are automatically split into smaller ranges based on real-time QPS metrics.
Checksumming & Entropy Reduction: Using Merkle Trees for background anti-entropy processes to detect and repair data inconsistencies between replicas without scanning the entire dataset.
Write-Optimized Path: Using a Commit Log (WAL) that bypasses the file system cache (O_DIRECT) to ensure durability with minimal latency jitter.
Zero-Copy Transfers: Utilizing sendfile or io_uring for shard migration to saturate network bandwidth while minimizing CPU overhead.
Design Breakdown

Functional Requirements

Store and retrieve key-value pairs at petabyte scale.
Support atomic "Put", "Get", and "Delete" operations on a single key.
Provide a "Scan" operation for keys within a specific shard.
Support horizontal scaling (adding nodes) and high availability (handling node failures).

Non-Functional Requirements

Low Latency: Sub-10ms P99 for reads and writes.
High Scalability: Linear scaling as more storage nodes are added.
Durability: No data loss after a write is acknowledged.
Availability: 99.99% uptime (Multi-AZ deployment).

Estimation

Data Volume: 100 TB.
Node Capacity: 2 TB per node \rightarrow 50 nodes (minimum).
Redundancy: 3x replication \rightarrow 150 nodes.
Throughput: 1M QPS.
Network: If avg. object size is 1KB, 1M QPS = 1GB/s total traffic. Distributed across 150 nodes, each node handles ~6.7MB/s, which is well within 10Gbps NIC limits.

Blueprint

Concise Summary: A distributed, sharded key-value store utilizing Consistent Hashing for data distribution and a Raft-based consensus group per shard for high availability.
Major Components:
Client SDK: Handles request routing by localizing the shard map and performing smart retries.
Metadata Service: A strongly consistent store (Etcd) that maintains the mapping of key ranges to storage nodes.
Storage Node: The workhorse that manages local data using LSM-Trees and participates in replication groups.
Simplicity Audit: This architecture avoids complex global transactions or multi-shard joins, focusing purely on single-key performance which is the "M" in MVP for a sharded structure.
Architecture Decision Rationale:
Why this architecture?: Consistent hashing handles the dynamic nature of cloud environments (node failures/additions) better than static sharding.
Functional Satisfaction: Direct-to-node access via SDK ensures functional requirements are met with minimal overhead.
Non-functional Satisfaction: LSM-Trees provide the required write-throughput, while Raft ensures the required durability and availability.

High Level Architecture

Sub-system Deep Dive

Service

Topology & Scaling
Stateless/Stateful: Storage nodes are stateful; Metadata service is stateful (Etcd).
Scaling is triggered by disk utilization (>70%) or CPU load.
Rollout: Canary deployment per shard group to ensure no data corruption occurs during upgrades.
API Schema Design
get(key): REST/gRPC. Returns Value + Version.
put(key, value): REST/gRPC. Idempotent via ClientID + SequenceNumber.
delete(key): REST/gRPC. Logical deletion via Tombstones.
Resilience & Reliability
Circuit Breaker: SDK trips if a specific node times out, falling back to replicas for read-only operations.
Backoff: Exponential backoff with jitter for transient network errors.

Storage

Access Pattern
High write volume (LSM-tree optimized) and point-read volume (Bloom filters + Index optimized).
Database Table Design
Local SSTable format: [Key, Value, Timestamp, Tombstone_Flag].
Keys are sorted within SSTable files to support range scans within a shard.
Technical Selection
LSM-Tree: Chosen for write performance.
S3 (Cold Tiering): Optional for long-term archival of old SSTable files.
Distribution Logic
ShardID = Hash(PartitionKey) % TotalVirtualNodes.
Virtual Nodes: Each physical node manages ~100-200 virtual nodes to ensure balanced distribution.
Reliability & Recovery
RPO = 0: Achieved via Raft log persistence before write acknowledgment.
RTO < 30s: Leader election time in case of a node failure.

Cache

Purpose & Justification: Mitigate "Hot Key" issues where a single key exceeds the QPS capacity of a single Storage Node.
Key-Value Schema: Key -> Value.
Technical Selection: Redis.
Failure Handling: Cache-aside pattern. If Redis is down, traffic falls back to Storage Nodes (which have internal Block Caches).

Messaging

Purpose & Decoupling: Used for the Raft WAL (Write Ahead Log) to ensure ordered replication.
Delivery Semantics: Exactly-once processing within the Raft state machine.
Technical Selection: Custom internal implementation or NATS/gRPC streams for low-latency log replication.

Infrastructure (Optional)

Observability
Prometheus/Grafana: Monitoring "Shard Heatmaps" to identify imbalance.
Distributed Tracing: Jaeger for tracing a get request from SDK to specific Storage Node.
Wrap Up

Advanced Topics

Trade-offs (PACELC): The system is PA/EC. In the event of a partition, we prioritize Availability. Under normal operation, we trade some latency for Consistency (via Raft).
Optimization (LSM Compaction): To prevent "Write Stalls" during heavy compaction, we implement Leveled Compaction which bounds the number of files in each level, keeping read amplification low.
Bottleneck Analysis: The Metadata Service could become a bottleneck if every request queries it.
Fix: The Client SDK caches the shard map and only refreshes it on a MOVED error (similar to Redis Cluster).
Security: Data is encrypted at rest using AES-256. mTLS is used for all node-to-node communication (Raft heartbeat and log replication).