The Question

Design a High-Performance Distributed In-Memory Cache

Design a distributed in-memory key-value caching system capable of handling 1 million queries per second with sub-millisecond latency. The system must scale horizontally to accommodate 100TB of data, provide high availability in the event of node failures, and implement an efficient eviction strategy. Discuss your approach to data partitioning, handling hot keys, and maintaining cluster membership in a dynamic environment.

Consistent Hashing

LRU Cache

Zookeeper

gRPC

mTLS

Slab Allocation

TCP/IP

Redis

Questions & Insights

Clarifying Questions

What is the primary goal? (e.g., General-purpose cache like Redis/Memcached or an application-specific cache?)

What is the scale? (What is the target QPS, total data size, and number of concurrent connections?)

What are the consistency requirements? (Is eventual consistency acceptable, or do we need strong consistency across replicas?)

What eviction policies are required? (LRU, LFU, or just TTL-based?)

What is the read/write ratio? (Usually, caches are read-heavy, but some use cases like session management involve frequent writes.)

Assumptions for MVP:

Scale: 1M+ QPS, 100TB total data across the cluster.

Consistency: Eventual consistency (performance is prioritized over strict consistency).

Eviction: LRU (Least Recently Used) is the default.

Availability: 99.99% availability with automatic failover.

Latency: Sub-millisecond for hits.

Thinking Process

The core challenge of a distributed cache is horizontal scaling and data locality.

How to partition data efficiently? We use Consistent Hashing to minimize data movement when nodes are added or removed.

How to handle high availability? We implement Primary-Backup Replication within hash rings.

How to find where data lives? Clients use a Configuration Service (Control Plane) to discover the cluster topology.

How to manage memory? Each node implements an internal LRU Cache using a Doubly Linked List and a Hash Map.

Bonus Points

Hot Key Mitigation: Implementation of "Dynamic Replication" or "Local L1 Caching" on the client-side for keys that exceed a specific QPS threshold.

Zero-Copy Serialization: Using Protobuf or FlatBuffers to minimize CPU overhead during network I/O.

Memory Management: Mentioning Slab Allocation to prevent memory fragmentation, similar to Memcached’s architecture.

Thundering Herd Protection: Using Request Collapsing (Singleflight) or "Promise-based" fetching to ensure only one request goes to the database for a cache miss.

Design Breakdown

Functional Requirements

Core Use Cases:

PUT(key, value, ttl): Store data with an optional expiration.

GET(key): Retrieve data by key.

DELETE(key): Explicitly remove data.

Scope Control:

In-Scope: Distributed sharding, replication, LRU eviction, and health monitoring.

Out-of-Scope: Complex data structures (Sets, Hashes - MVP focuses on String K/V), Persistence/Snapshots (RDB/AOF), and Multi-region replication.

Non-Functional Requirements

Scale: Must scale horizontally to hundreds of nodes.

Latency: Sub-millisecond response time for GET requests.

Availability & Reliability: Highly available; failure of a single node should not cause system-wide downtime.

Consistency: Eventual consistency between replicas to maximize throughput.

Fault Tolerance: Automatic rebalancing and partition handling via consistent hashing.

Security: Basic Auth and TLS for inter-node and client-server communication.

Estimation

Traffic Estimation:

Total QPS: 1,000,000.

Read/Write: 9:1 (900k Read, 100k Write).

Storage Estimation:

Avg Object Size: 1 KB.

100M keys = 100 GB.

With 2x replication = 200 GB.

Bandwidth Estimation:

1M QPS * 1 KB = 1 GB/s total throughput.

Blueprint

The design follows a decentralized architecture where clients are "smart" and perform sharding via consistent hashing. A central Configuration Service manages the cluster membership.

Major Components:

Client Library: Handles consistent hashing logic and connection pooling.

Cache Node: Stores data in memory with an LRU eviction policy.

Configuration Service: Tracks healthy nodes and distributes the "Hash Ring" to clients.

Simplicity Audit: This is the simplest design because it avoids a proxy layer (reducing hop latency) and uses proven consistent hashing for scaling.

Architecture Decision Rationale:

Why?: A decentralized client-side sharding model offers the lowest possible latency and removes the bottleneck of a centralized proxy.

Functional Satisfaction: Meets all basic K/V requirements.

Non-functional Satisfaction: Scalable via adding nodes; highly available via replication slots.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing: Not applicable for a backend distributed cache; traffic is internal to the VPC.

Security & Perimeter: Internal traffic uses mTLS and Security Groups to restrict access to the cache port (e.g., 6379).

Service

Topology & Scaling: Stateless cache nodes deployed in an Auto-Scaling Group (ASG). Scaling is triggered by memory utilization.

API Schema Design:

GET /v1/cache/{key} -> Response: {value: bytes, ttl: long}

PUT /v1/cache/{key} -> Request: {value: bytes, ttl: long}

Protocol: gRPC (for low-latency binary serialization).

Resilience & Reliability:

Client-side Retries: With exponential backoff for GET requests.

Consistent Hashing: Uses virtual nodes (e.g., 200 per physical node) to ensure even distribution and prevent "hot spots" during reshuffling.

Storage

Access Pattern: In-memory, high-frequency read/write.

Database Table Design:

LRU Structure:

HashMap<Key, Node>: For O(1) lookup.

DoublyLinkedList: To maintain access order for O(1) eviction.

Technical Selection: Custom In-memory Engine (or Redis for MVP).

Distribution Logic:

Sharding: Key-based consistent hashing.

Replication: Asynchronous replication to a "Slave" node in a different Availability Zone (AZ).

Reliability & Recovery: For MVP, if a node restarts, it starts cold. Data is repopulated from the source database (Cache-Aside pattern).

Infrastructure (Optional)

Distributed Coordination: Zookeeper or Etcd is used for "Service Discovery" and "Health Monitoring." It maintains the authoritative list of active Cache Nodes. When a node fails, Zookeeper updates the "Cluster Map," which clients pull to update their local consistent hashing ring.

Wrap Up

Advanced Topics

Trade-offs:

CAP Theorem: This system prioritizes Availability and Partition Tolerance (AP). During a partition, clients might read stale data from a replica if the primary is unreachable.

Reliability & Failure Handling:

Cache Stampede: Clients implement "Lease-based" or "Locking" mechanisms to ensure only one client refreshes a specific key from the DB at a time.

Bottleneck Analysis:

Hot Keys: If one key receives 50% of traffic, consistent hashing alone fails. Optimization: Clients detect local frequency and cache "hot keys" in-process for 5-10 seconds (L1 Cache).

Security & Privacy: All data in transit is encrypted via TLS. Data at rest (if using persistence later) is encrypted via KMS.

Distinguishing Insights:

Connection Management: Using an "Epoll" based event loop (like Nginx or Redis) on the cache nodes allows a single node to handle 10k+ concurrent connections efficiently without thread-per-connection overhead.