Kafka

A distributed, partitioned, and replicated commit log service that provides the functionality of a messaging system with a unique design centered around high-throughput, durability, and stream processing.

Cheat Sheet

Prime Use Case

Use Kafka when you need to decouple high-volume data producers from consumers, require message persistence for replayability, or need to build real-time streaming pipelines.

Critical Tradeoffs

  • High throughput vs. Low end-to-end latency
  • Durability and persistence vs. Operational complexity
  • Strict ordering (within partition) vs. Global scalability

Killer Senior Insight

Kafka is not a 'message queue' in the traditional sense; it is a distributed append-only log. This fundamental shift from 'push-and-delete' (RabbitMQ) to 'append-and-persist' allows for multiple consumers to read the same data at their own pace, enabling the entire event-driven architecture.

Recognition

Common Interview Phrases

We need to handle millions of events per second.
Multiple downstream systems need the same data stream.
We need to be able to 'rewind' and re-process data if a bug is found.
The system must handle massive spikes in traffic without crashing producers.

Common Scenarios

  • Real-time clickstream analysis
  • Log aggregation and monitoring pipelines
  • Change Data Capture (CDC) for database synchronization
  • Event Sourcing and CQRS architectures

Anti-patterns to Avoid

  • Using Kafka as a primary long-term database for complex queries.
  • Implementing a simple task queue where messages must be deleted immediately after processing.
  • Scenarios requiring sub-millisecond end-to-end latency (where a memory-only bus or direct RPC is better).

The Problem

The Fundamental Issue

The 'N-to-M' integration mess where every producer must know about every consumer, leading to fragile, tightly coupled systems that cannot scale independently.

What breaks without it

Producers crash because consumers are too slow (lack of backpressure/buffering).

Data is lost during consumer downtime because there is no persistent buffer.

Adding a new consumer requires modifying the producer logic.

Why alternatives fail

Traditional Message Brokers (RabbitMQ) store state about what has been consumed in memory, which bottlenecks at high volumes.

Direct API calls (REST/gRPC) lack durability and fail if the receiver is offline.

Database polling introduces massive latency and puts unnecessary load on the storage layer.

Mental Model

The Intuition

Imagine a giant, infinite scroll of paper (the Log). Producers write at the very bottom. Consumers are readers who each have their own physical bookmark (the Offset). Multiple readers can read the same scroll at different speeds, and if a reader gets tired, they can come back later and start exactly where their bookmark was left.

Key Mechanics

1

Partitioning: Dividing a topic into multiple logs to allow parallel processing across different brokers.

2

Consumer Groups: A mechanism to divide the work of reading a topic among multiple instances for horizontal scaling.

3

In-Sync Replicas (ISR): The set of brokers that have successfully caught up with the leader, ensuring no data loss on failover.

4

Zero-Copy Optimization: Using the sendfile system call to move data from the disk directly to the network card buffer, bypassing user-space overhead.

Framework

When it's the best choice

  • When you have multiple disparate consumers for the same data.
  • When you need to retain data for days or weeks for potential re-processing.
  • When the system must scale to terabytes of data per day.

When to avoid

  • Small-scale applications where the overhead of managing Zookeeper/KRaft and brokers outweighs the benefits.
  • When you need complex routing logic (e.g., header-based routing) which is better handled by RabbitMQ.
  • When you need 'exactly-once' semantics across external non-transactional systems (Kafka's EOS only applies within Kafka).

Fast Heuristics

If 'High Throughput + Replayability' then Kafka.
If 'Complex Routing + Low Latency + Small Volume' then RabbitMQ.
If 'Cloud Native + Managed + Low Ops' then Kinesis or Pub/Sub.

Tradeoffs

+

Strengths

  • Massive scalability via partitioning.
  • High durability through disk persistence and replication.
  • Decoupling of producers and consumers (time and space).
  • Excellent ecosystem (Kafka Connect, Kafka Streams).

Weaknesses

  • Significant operational complexity (tuning, monitoring, rebalancing).
  • No built-in global ordering (only per-partition).
  • Latency is higher than in-memory queues due to disk I/O and replication.

Alternatives

RabbitMQ
Alternative

When it wins

When you need complex message routing (Exchange types) and don't need to replay messages.

Key Difference

RabbitMQ is a 'smart broker, dumb consumer' model; Kafka is a 'dumb broker, smart consumer' model.

Amazon Kinesis
Alternative

When it wins

When you want a fully managed service on AWS and don't want to manage infrastructure.

Key Difference

Kinesis has stricter limits on shard scaling and retention compared to a self-managed Kafka cluster.

Apache Pulsar
Alternative

When it wins

When you need multi-tenancy and a clear separation between storage and serving layers.

Key Difference

Pulsar uses a tiered architecture (BookKeeper for storage) which makes scaling storage and compute independent.

Execution

Must-hit talking points

  • Mention 'Page Cache' and 'Sequential I/O' as the reasons for Kafka's speed.
  • Discuss the 'Consumer Group Rebalance' protocol and its impact on availability.
  • Explain the 'Controller' node's role in partition leadership and metadata management.
  • Differentiate between 'At-least-once', 'At-most-once', and 'Exactly-once' delivery semantics.

Anticipate follow-ups

  • Q:How do you handle a 'hot partition' problem?
  • Q:What happens if the replication factor is 3 and 2 brokers go down?
  • Q:How does Kafka's KRaft mode replace the need for Zookeeper?
  • Q:Explain the 'Log Compaction' feature and its use cases.

Red Flags

Setting too few partitions for a topic.

Why it fails: Partitions are the unit of parallelism. If you have 10 consumers but only 2 partitions, 8 consumers will sit idle, bottlenecking your system.

Ignoring the 'unclean.leader.election.enable' setting.

Why it fails: If set to true, the system might elect a leader that isn't in the ISR, leading to data loss for the sake of availability.

Using a large message size (e.g., > 1MB) without tuning.

Why it fails: Kafka is optimized for small to medium messages; very large messages can cause memory pressure and slow down the replication protocol.