The Question
DesignDesign a Distributed Streaming Platform
Design a distributed, horizontally scalable, and persistent message streaming system capable of handling millions of messages per second with low-latency delivery. The system must support high durability, strictly ordered delivery within partitions, and allow multiple independent consumers to read the same data stream without interference. Address how you would optimize for high throughput on commodity hardware, manage cluster metadata without external dependencies, and handle broker failures without data loss.
Sequential I/O
Zero-Copy
Raft
Append-Only Log
Page Cache
Partitioning
ISR
CDC
Questions & Insights
Clarifying Questions
What is the scale of throughput and storage?
Assumption: We need to support 1 million messages per second (Write QPS) with 1KB average message size, and data retention for 7 days.
What are the durability and consistency requirements?
Assumption: We require high durability (no data loss) and "at-least-once" delivery guarantees by default, with support for "exactly-once" via idempotent producers.
Is strict ordering required?
Assumption: Strict ordering is required at the partition level, but not globally across the entire topic.
How many consumers/producers should the system handle?
Assumption: Thousands of concurrent producers and consumers grouped into consumer groups for load balancing.
Thinking Process
Core Bottleneck: Disk I/O and Network saturation are the primary constraints for a distributed log.
Key Strategy: Leverage sequential write patterns (Append-only logs), Page Cache, and Zero-copy techniques to bypass application-layer overhead.
Progressive Questions:
How do we store data to achieve high throughput on commodity hardware? (Sequential Log).
How do we scale a single topic beyond one machine? (Partitioning).
How do we ensure high availability when a broker fails? (Leader-Follower Replication).
How do we manage cluster state and metadata efficiently? (KRaft/Consensus protocol).
Bonus Points
Zero-Copy Optimization: Explain the use of
sendfile() syscall to transfer data from Page Cache directly to the Network Socket, skipping the user-space buffer.Tiered Storage: Discuss offloading older log segments to S3/Object Storage to provide "infinite" retention while keeping the local disk footprint small and cost-effective.
Batching & Compression: Implement end-to-end compression (Producer compresses, Broker stores as-is, Consumer decompresses) to save bandwidth and IOPS.
Controller Quorum (KRaft): Move away from external dependencies like ZooKeeper to an internal Raft-based metadata quorum for faster failover and simplified operations.
Design Breakdown
Functional Requirements
Core Use Cases:
Producers can publish messages to specific topics.
Consumers can subscribe to topics and read messages from a specific offset.
Support for Consumer Groups to parallelize message processing.
Retention based on time (e.g., 7 days) or size.
Scope Control:
In-Scope: Log storage, partitioning, replication, metadata management, and basic producer/consumer APIs.
Out-of-Scope: Stream processing DSL (like Kafka Streams), Connectors (Kafka Connect), or Schema Registry.
Non-Functional Requirements
Scale: Horizontal scalability to handle petabytes of data and millions of messages/sec.
Latency: Sub-10ms end-to-end latency for P99.
Availability: 99.99% availability via leader-follower replication and automated failover.
Consistency: Configurable (e.g.,
acks=all for strong consistency, acks=1 for performance).Fault Tolerance: Resilience against broker crashes and disk failures.
Estimation
Traffic: 1M msgs/sec * 1KB = 1GB/s (Incoming). With a replication factor of 3, internal traffic is 3GB/s.
Storage: 1GB/s 86,400s 7 days ≈ 600 TB (Raw) * 3 (Replication) = 1.8 PB total.
Bandwidth:
Ingress: 1GB/s.
Egress: Variable, but assuming 2x consumption (two different consumer groups) = 2GB/s.
Blueprint
Concise Summary: A distributed, partitioned commit log service where data is written sequentially to disk and replicated across a cluster of brokers managed by a Raft-based metadata quorum.
Major Components:
Producers: Client applications that push data to brokers using a partitioning strategy (Round-robin or Key-based).
Brokers: The storage nodes that manage partitions, handle writes/reads, and replicate data.
Metadata Quorum (Controller): A subset of brokers using Raft to manage partition leadership and cluster membership.
Consumers: Client applications that pull data and track their progress via offsets.
Simplicity Audit: This architecture is the simplest because it treats the log as the source of truth and uses the OS page cache for performance instead of a complex custom caching layer.
Architecture Decision Rationale:
Sequential I/O: Hard drives and SSDs are significantly faster at sequential access than random access.
Partitioning: Allows a single topic to be distributed across many brokers, enabling linear horizontal scaling.
Pull-based Consumers: Consumers control the flow of data, preventing them from being overwhelmed.
High Level Architecture
Sub-system Deep Dive
Service
Topology & Scaling
Stateless/Stateful: Brokers are stateful (hold data). Scaling requires adding brokers and reassigning partitions.
Failure Domains: Partitions are replicated across different racks/Availability Zones.
API Schema Design
ProduceRequest: Topic, Partition, Key, Value, Timestamp.FetchRequest: Topic, Partition, Offset, MaxBytes.OffsetCommitRequest: GroupID, Topic, Partition, Offset.Resilience & Reliability
Retries: Producers retry on transient failures (e.g., leader election in progress).
In-Sync Replicas (ISR): Only replicas that are caught up with the leader are eligible to become leaders.
Storage
Access Pattern
High-volume sequential writes (Append) and high-volume sequential reads (Scan).
Log Design
Log Segments: A partition is split into segments (e.g., 1GB files) to make deletion easy.
Indexes: Sparse index mapping offsets to physical file positions to speed up searches.
Technical Selection
Disk Storage: Standard Linux filesystem (XFS/ext4). Use the OS Page Cache for hot data.
Distribution Logic
Sharding: Topic -> Partitions.
Replication: Leader handles all reads/writes; followers pull from the leader to stay in sync.
Wrap Up
Advanced Topics
Trade-offs (Consistency vs Availability): Kafka follows the PACELC theorem. In the event of a network partition (P), you can choose between Consistency (L:
acks=all, min.insync.replicas) and Availability (E: acks=1).Reliability & Failure Handling:
Zombies: Use epoch/fencing tokens in the metadata quorum to prevent old leaders from accepting writes.
Backpressure: Natural pull-based mechanism; if consumers are slow, data builds up on the broker disk, not in the broker's RAM.
Bottleneck Analysis:
Fan-out problem: If 100 consumers read the same topic, the network egress is 100x the ingress. Optimization: Use compression and potentially cache-friendly consumer offsets.
Security:
TLS for data in transit.
SASL/SCRAM for authentication.
ACLs for topic-level authorization.