DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
The Question
Design

Scalable Distributed Key-Value Store

Design a globally distributed, sharded data store capable of handling hundreds of terabytes of data and millions of concurrent requests. The system must support tunable consistency, automatic rebalancing as nodes are added or removed, and maintain high availability even during regional network partitions. Detail your strategy for data partitioning, conflict resolution, and intra-cluster coordination.
Consistent Hashing
LSM-Tree
RocksDB
Gossip Protocol
Merkle Tree
Quorum
Vector Clocks
mTLS
SSTable
Questions & Insights

Clarifying Questions

What is the primary access pattern and data type? (e.g., Simple Key-Value pairs, Document-based, or Time-series?)
Assumption: We are building a high-performance Distributed Key-Value Store (like DynamoDB or Cassandra) to handle arbitrary byte arrays.
What are the scale requirements for storage and throughput?
Assumption: Support 100 TB of data with 1 Million+ QPS (Read/Write ratio 80/20).
What consistency model is required?
Assumption: Tunable consistency (Strong for writes, Eventual or Strong for reads) to satisfy CAP theorem trade-offs depending on use-case.
How should the system handle rebalancing and node failures?
Assumption: Automatic rebalancing using consistent hashing with virtual nodes to prevent "hot shards."

Thinking Process

Core Bottleneck: Hot spotting and data skew in a massive distributed environment.
Key Focus Areas:
Partitioning Strategy: How to map keys to physical nodes efficiently.
High Availability: Replication strategies (Quorum-based).
Membership & Health: How nodes know about each other without a single point of failure.
Progressive Questions:
How do we map a key to a specific server without losing data when a server is added? (Consistent Hashing).
How do we ensure data isn't lost if that server dies? (N-way Replication).
How do we ensure all replicas eventually agree on the data? (Gossip Protocol & Merkle Trees).
How does a client find the right node without a centralized bottleneck? (Metadata Service or Client-side Routing).

Bonus Points

LSM-Tree Storage Engine: Use Log-Structured Merge-trees (like RocksDB) for the local storage on each shard to optimize for high-write throughput.
Merkle Tree Anti-Entropy: Use Merkle trees to detect inconsistencies between replicas quickly during background synchronization, minimizing network bandwidth.
Vector Clocks / Version Stamps: Implement logical clocks to resolve "last-write-wins" conflicts in an eventually consistent multi-master setup.
Rack-Aware Placement: Ensure replicas are placed across different failure domains (racks/availability zones) to survive data center-level outages.
Design Breakdown

Functional Requirements

Core Use Cases:
Put(key, value): Store a value associated with a unique key.
Get(key): Retrieve the value associated with a unique key.
Delete(key): Remove the key-value pair.
Scope Control:
In-Scope: Horizontal scaling, partitioning, replication, and fault tolerance.
Out-of-Scope: Complex queries (Joins), secondary indexing (MVP), and multi-item transactions.

Non-Functional Requirements

Scale: Horizontal scalability to 1000+ nodes.
Latency: Sub-10ms P99 for Get operations; sub-20ms for Put.
Availability & Reliability: 99.99% availability (High Availability via replication).
Consistency: Tunable (Eventual by default, Strong via Quorum).
Fault Tolerance: No single point of failure; automatic recovery from node crashes.
Security: TLS for data-in-transit and basic RBAC for API access.

Estimation

Traffic Estimation:
Total QPS: 1,000,000.
Read QPS (80%): 800,000.
Write QPS (20%): 200,000.
Storage Estimation:
Avg Object Size: 1 KB.
Total Data: 100 TB.
With 3x Replication: 300 TB.
Bandwidth Estimation:
Ingress (Writes): 200k * 1KB = 200 MB/s.
Egress (Reads): 800k * 1KB = 800 MB/s.

Blueprint

Concise Summary: A decentralized distributed hash table (DHT) using consistent hashing to partition data across a cluster of commodity servers, utilizing a leaderless replication model for high availability.
Major Components:
Client Library: Handles request routing by calculating the key's hash and identifying target nodes.
Coordinator Node: Any node in the cluster can act as a coordinator to manage a specific request's lifecycle.
Partition Manager: Manages the consistent hashing ring and virtual node assignments.
Storage Engine (LSM): Local node persistence layer optimized for high-speed writes.
Simplicity Audit: This design avoids a centralized "Master" node for the data path, preventing a common bottleneck and simplifying the scaling process.
Architecture Decision Rationale:
Why this architecture?: Consistent hashing is the industry standard for sharded data structures as it minimizes data movement during rebalancing.
Functional Satisfaction: Directly addresses Put/Get operations via key-based routing.
Non-functional Satisfaction: Peer-to-peer nature ensures high availability and no single point of failure.

High Level Architecture

Sub-system Deep Dive

Service

Topology & Scaling:
Stateless Coordinators: Any node can receive a request. If it doesn't own the data, it acts as a proxy (Coordinator) to the correct shard.
Scaling: Adding a node triggers a "hinted handoff" or background transfer of virtual nodes.
API Schema Design:
POST /v1/data/{key}: Write value. Returns 201 Created.
GET /v1/data/{key}: Read value. Returns 200 OK + Body.
Consistency Level: Header param X-Consistency-Level: ONE | QUORUM | ALL.
Resilience & Reliability:
Retries: Exponential backoff for 5xx errors.
Circuit Breaker: Trips if a specific shard node times out repeatedly.

Storage

Access Pattern: Write-heavy (LSM-friendly) with high-volume random reads.
Database Table Design (Local):
Each node runs a local instance of RocksDB.
Key: PartitionKey + SortKey (if applicable).
Value: Blob + VersionTag.
Technical Selection: RocksDB.
Rationale: Efficient memtable/SSTable architecture allows for sequential writes to disk, maximizing IOPS on SSDs.
Distribution Logic:
Consistent Hashing: Keyspace [0, 2^160 - 1] treated as a ring.
Virtual Nodes: Each physical node manages ~256 virtual nodes to ensure uniform distribution.
Replication: Write to N successive nodes on the ring.
Reliability & Recovery:
Hinted Handoff: If a node is down, replicas store a "hint" and deliver it when the node returns.
Anti-Entropy: Weekly Merkle tree comparisons to fix silent bit rot.

Cache

Purpose & Justification: Reduce disk I/O for frequently accessed "hot" keys.
Key-Value Schema:
Key: Key_String, Value: Blob.
TTL: 3600s.
Technical Selection: In-memory LRU Cache (within the node process).
Failure Handling: Cache misses fall back to the local LSM-Tree.

Infrastructure (Optional)

Distributed Coordination:
Gossip Protocol: Nodes exchange heartbeats and "Ring State" every 1s.
Phi Accrual Failure Detector: Adaptive algorithm to determine if a node is "down" based on network jitter.
Wrap Up

Advanced Topics

Trade-offs (CAP Theorem): This is primarily an AP (Available/Partition-Tolerant) system. It favors availability over strong consistency during a network partition, but provides "Tunable Consistency" to act as a CP system if required (Wait for Quorum).
Reliability:
Quorum formula: R + W > N. If N=3, then R=2, W=2 ensures a read always sees the latest write.
Bottleneck Analysis:
Hot Shards: If one key is accessed 1M times/sec, consistent hashing alone fails.
Optimization: Implement "Request Splitting" or front-side caching (Redis/Memcached) specifically for celebrity keys.
Security:
All node-to-node communication is encrypted via mTLS.
Data is encrypted at rest using AES-256 (provided by the storage engine).