The Question
DesignDistributed In-Memory Cache
Design a distributed in-memory caching system similar to Redis or Memcached. The system should provide low-latency key-value access, support horizontal scaling across multiple nodes, handle cache eviction and invalidation, and remain available during node failures.
Consistent Hashing
LRU Eviction
etcd
HashMap
Doubly Linked List
Questions & Insights
Thinking Process
To design a production-grade distributed cache, we must solve for data locality, horizontal scalability, and high availability. The strategy focuses on minimizing network hops and ensuring uniform load distribution.
How do we distribute data across nodes to minimize reshuffling during scaling? Use Consistent Hashing with virtual nodes to ensure minimal key movement when nodes are added or removed.
How do we handle memory exhaustion on a single node? Implement a pluggable eviction policy (e.g., LRU - Least Recently Used) to maintain a fixed memory footprint.
How does a client know which node to talk to without a bottleneck proxy? Use a "Smart Client" approach where the client fetches a membership map from a Configuration Service and performs hashing locally.
How do we maintain availability if a node crashes? Implement primary-secondary replication for each shard (partition) within the hash ring.
Bonus Points
Hot Key Mitigation: Implement "Bounded Loads" or "Request Hedging" where frequently accessed keys are replicated across multiple adjacent nodes in the hash ring to prevent a single node from being overwhelmed.
Zero-Copy Serialization: Use Protobuf or FlatBuffers for the wire protocol to minimize CPU overhead during serialization/deserialization, as distributed caches are often CPU-bound by networking.
Memory Fragmentation Management: Utilize a Slab Allocator (similar to Memcached) to pre-allocate memory chunks of fixed sizes, preventing OS-level heap fragmentation.
Probabilistic Eviction: Use TinyLFU (Least Frequently Used) rather than strict LRU to improve hit rates by tracking frequency sketches with minimal memory overhead.
Design Breakdown
Functional Requirements
Put(key, value, ttl): Store a value with an optional expiration.Get(key): Retrieve a value by key.Delete(key): Manually remove a key.Node Discovery: Automatic detection of new or failed cache nodes.
Non-Functional Requirements
Ultra-Low Latency: Sub-millisecond response times for
Get operations.Scalability: Linear scaling as more nodes are added.
Availability: High availability via replication (survive N-1 node failures in a shard).
Consistency: Eventual consistency for the MVP to prioritize performance.
Estimation
Total Data: 10 TB.
Node RAM: 128 GB.
Required Nodes: ~100 nodes (including 2x replication + 20% overhead).
Throughput: 1 Million+ Queries Per Second (QPS).
Network: 10 Gbps NICs are standard; serialization is the likely bottleneck before bandwidth.
Blueprint
Concise Summary: A peer-to-peer distributed hash table (DHT) using consistent hashing. Clients act as "smart" participants, routing requests directly to the correct shard.
Major Components:
Smart Client Library: Performs consistent hashing and maintains a local cache of the cluster membership map.
Cache Node: An in-memory storage engine using a Hash Map and an LRU doubly-linked list for eviction.
Configuration Service (etcd/Zookeeper): Acts as the source of truth for node health and the current hash ring state.
Simplicity Audit: This architecture removes the need for a centralized load balancer/proxy, eliminating a single point of failure and reducing latency by one network hop.
High Level Architecture
Sub-system Deep Dive
Service
Topology & Scaling: Sharded peer-to-peer topology. Scaling is achieved by adding nodes to the hash ring. The
etcd service triggers a watch event to the Smart Client Lib when the ring changes.API Spec:
gRPC/TCP: Used for high-performance binary communication.
rpc Get(KeyRequest) returns (ValueResponse)rpc Put(PutRequest) returns (Empty)Storage
Data Model: Key-value pairs stored in a concurrent Hash Map.
Database Logic:
Hash Ring: Use MurmurHash3 or Ketama for uniform distribution.
Virtual Nodes: Each physical node is assigned 100-200 virtual nodes on the ring to balance load.
Eviction: A Doubly Linked List + Hash Map implementation of LRU. When memory limit is reached, the tail of the list is evicted.
Cache
Data Structures: ConcurrentHashMap for O(1) lookups.
TTL Support: A Min-Heap or a background "sweeper" thread to clear expired keys that haven't been evicted by the LRU logic.
Write Policy: Write-around (Client writes to cache, then to DB) or Write-through (Cache handles DB update). For MVP, we assume a simple cache-aside pattern managed by the Client.
Wrap Up
Advanced Topics
Trade-offs:
Consistency vs. Latency: We chose Eventual Consistency for replication. A "Put" might return success after writing to the primary, while secondaries catch up asynchronously to keep latency low.
Bottlenecks:
Configuration Service: If
etcd is down, new clients cannot join, and membership changes aren't propagated. However, existing clients can continue using their cached ring.Hot Keys: A single key requested by millions (e.g., a celebrity profile) will hit one shard.
Failure Handling:
Replication: Each shard has a Primary and N Secondaries.
State Transfer: When a new node joins, it takes over a portion of the ring from its neighbor. A background process migrates relevant keys to the new node.
Alternatives & Optimization:
Proxy Mode: Instead of a Smart Client, use a proxy (like Twemproxy or Envoy). This simplifies client logic but adds a hop (latency) and a management layer.
UDP: For
Get operations where some packet loss is acceptable, UDP can be used to reduce TCP handshake/congestion control overhead.