The Question

Scalable Distributed In-Memory Cache Design

Design a high-performance distributed cache system capable of handling 10M+ queries per second with sub-millisecond latency. The system must support horizontal scaling, handle node failures gracefully, and provide a strategy for mitigating hot keys and cache stampedes. Focus on the sharding mechanism, replication strategy, and how the system maintains high availability while ensuring efficient memory management.

Consistent Hashing

ZooKeeper

LRU Eviction

Gossip Protocol

Leader-Follower Replication

mTLS

MurmurHash3

L1 Cache

Questions & Insights

Clarifying Questions

What is the target scale? (Assumption: 10M+ QPS, 5-10 TB of total cached data).

What is the typical payload size? (Assumption: Average 1KB, max 100KB).

What are the latency requirements? (Assumption: Sub-millisecond p99 for reads).

What is the consistency model? (Assumption: Eventual consistency is sufficient for an MVP cache; availability is prioritized).

Should the cache handle its own persistence? (Assumption: No, this is an in-memory look-aside cache MVP. Persistence is out of scope).

Thinking Process

Core Bottleneck: How to distribute keys across a dynamic set of nodes without massive re-sharding during scale-out or failure?

Key Strategy: Implement Consistent Hashing with virtual nodes to ensure uniform distribution and minimal data movement.

Node Discovery: Use a centralized Configuration Service (e.g., ZooKeeper/Etcd) to maintain the membership list and hash ring state.

High Availability: Use a Leader-Follower Replication model within shard groups to ensure the cache remains "warm" even if a primary node fails.

Bonus Points

Hot-key Mitigation: Implement a "Local L1 Cache" on the client-side or use "Bounded Load" consistent hashing to prevent single-node saturation for viral keys.

Lease-based Consistency: Use leases to prevent the "Thundering Herd" problem and ensure that only one client updates a cache entry from the database at a time.

Zero-Copy Networking: Use DPDK or RDMA for high-performance data transfer to bypass kernel overhead in ultra-low latency environments.

Memory Management: Discuss custom memory allocators (like Jemalloc) or Slab Allocation to prevent memory fragmentation in long-running processes.

Design Breakdown

Functional Requirements

Core Use Cases:

Put(key, value, ttl): Store data with an expiration.

Get(key): Retrieve data by key.

Delete(key): Explicitly remove data.

Scope Control:

In-Scope: Sharding, Eviction (LRU), Replication, and Service Discovery.

Out-of-Scope: Persistent storage (AOF/RDB), complex data structures (sets/hashes), and multi-region synchronization.

Non-Functional Requirements

Scale: Horizontal scaling to hundreds of nodes; support for 10M+ QPS.

Latency: Sub-millisecond response time for standard GET operations.

Availability & Reliability: 99.99% availability; handle node failures without data loss for "warm" items.

Consistency: Eventual consistency; update-to-read delay should be minimal.

Fault Tolerance: Automatic failover for shard leaders.

Estimation

Traffic Estimation:

10M QPS total.

90% Reads (9M QPS), 10% Writes (1M QPS).

Storage Estimation:

100 million active keys * 1KB avg size = 100GB per hour of active data.

5TB total memory for a 48-hour TTL window (considering some churn).

Bandwidth Estimation:

10M QPS * 1KB = 10GB/s total egress. This requires a 100Gbps network backbone or multiple 10Gbps NICs distributed across nodes.

Blueprint

Concise Summary: A peer-to-peer distributed cache where clients use a consistent hashing algorithm to locate the correct shard. A configuration service manages node health and the hash ring.

Major Components:

Client Library: Responsible for consistent hashing, connection pooling, and local L1 caching.

Configuration Service: Acts as the source of truth for the hash ring and node health.

Cache Shard (Node): In-memory storage engine using LRU eviction and leader-follower replication.

Simplicity Audit: This design avoids complex "master" nodes that can become bottlenecks, relying on client-side smarts for routing.

Architecture Decision Rationale:

Why this architecture?: Consistent hashing allows for seamless scaling. Client-side routing reduces internal network hops (no proxy overhead).

Functional Satisfaction: Supports basic K/V operations with high throughput.

Non-functional Satisfaction: Scalable via sharding, available via replication, and fast via in-memory storage.

High Level Architecture

Sub-system Deep Dive

Service

Topology & Scaling: Shards are deployed in Leader-Follower pairs across Availability Zones. Scaling is achieved by adding new nodes to the Hash Ring in the Configuration Service.

API Schema Design:

GET /v1/key/{key}: REST or gRPC. Returns value + TTL metadata.

PUT /v1/key/{key}: Body: binary_data, Header: X-TTL-Seconds.

Resilience & Reliability:

Retries: Client performs 1 retry on a different replica if the leader is unreachable.

Circuit Breaker: If a shard is consistently slow, the client library marks it "unhealthy" and fails fast to prevent cascading application latency.

Observability:

Metrics: Cache Hit/Miss ratio, Memory usage, Eviction rate, P99 Latency.

Security: Service-to-service Auth via IAM or mTLS.

Storage

Access Pattern: 90:10 Read:Write. O(1) lookups.

Database Table Design (In-Memory Map):

Key: String/Hash (Primary Key).

Value: Byte Array.

ExpiresAt: Timestamp.

LRU_Pointers: Doubly linked list pointers for eviction.

Technical Selection: Custom In-Memory Store (similar to Redis/Memcached).

Distribution Logic:

Consistent Hashing: Use MurmurHash3 for key distribution.

Virtual Nodes: Assign 256 virtual nodes per physical node to prevent "lumpy" data distribution.

Reliability & Recovery:

Replication is asynchronous for performance.

On leader failure, the configuration service promotes a follower to leader and updates the ring.

Messaging

Purpose & Decoupling: Used for Invalidation Propagation.

Event Schema: {"action": "invalidate", "key": "user_123"}.

Technical Selection: Internal Gossip Protocol or a lightweight Bus.

Failure Handling: If invalidation fails, TTL ensures eventually consistent cleanup.

Infrastructure (Optional)

Distributed Coordination:

ZooKeeper: Stores the "Ring State" (which physical nodes map to which hash ranges).

Clients watch ZooKeeper nodes for changes to update their local hash ring copy immediately.

Wrap Up

Advanced Topics

Trade-offs (AP vs CP): This is an AP system. In a partition, we prefer returning stale data or a cache miss rather than blocking.

Reliability:

Cache Stampede: When a hot key expires, multiple clients might hit the DB. Solution: Soft TTLs (refresh in background before expiry) or Wait-on-First-Query logic.

Bottleneck Analysis:

Hot Shards: If one key is accessed 1M times/sec, one shard will die.

Optimization: The client library detects high-frequency keys and promotes them to a "Local L1 Cache" (In-process memory) for 1-5 seconds.

Security: Data is encrypted in transit using TLS.

Distinguishing Insights: To handle TCP Incast, we use jumbo frames or tuned TCP stacks to prevent congestion when many shards respond to a multi-get request simultaneously.