The Question
DesignReal-Time Ad Click Aggregator
Design a high-throughput system capable of aggregating ad clicks and views in real-time. The system must join two distinct data streams—ad events and user action events—based on a shared identifier. It should support multi-dimensional aggregation (e.g., by geography and user profile) and provide low-latency results for analytical dashboards. Key considerations include handling out-of-order events, ensuring exactly-once processing for financial accuracy, and managing the state required for joining events that may occur far apart in time.
Apache Flink
Kafka
ClickHouse
RocksDB
S3
Protobuf
OLAP
CDC
Questions & Insights
Clarifying Questions
What is the expected scale of events? (Assumption: 100k events/sec for
adEvents and 10k-20k events/sec for userEvents).What is the acceptable latency for the aggregated results? (Assumption: Near real-time, < 10 seconds for the dashboard to reflect changes).
What is the typical time window between an ad being served and a user click? (Assumption: 99% of clicks happen within 30 minutes of the serve event).
Is "exactly-once" processing required for billing, or is "at-least-once" acceptable for analytics? (Assumption: High accuracy is required for billing, necessitating exactly-once semantics).
How many dimensions are we aggregating by? (Assumption: adId, region, userAction, and 3-5 profile metadata categories).
Thinking Process
The Join Bottleneck: Since events arrive on two separate streams, the system must perform a Stateful Stream-Stream Join using a common identifier (
servingId).State Management: How do we store
adEvents that haven't been clicked yet without exhausting memory? We use a windowed state with a TTL.Partitioning Strategy: To perform a join efficiently, both streams must be partitioned by the same key (
servingId) across the messaging and processing layers.Aggregation Sink: Where do we store the high-velocity results? We need a write-heavy OLAP store or a Key-Value store with atomic increments.
Bonus Points
Watermarking for Late Events: Implementing custom watermarking in the stream processor to handle out-of-order events coming from different mobile devices with clock drift.
Side Input Enrichment: If profile metadata is too large for the event stream, use a Broadcast State or Side Inputs to join against a slowly-changing dimension (SCD) user profile table.
Exactly-Once Semantics: Utilizing Kafka's idempotent producers and Flink’s two-phase commit (2PC) sink to ensure no double-counting of ad clicks.
Backpressure Handling: Configuring Flink's credit-based flow control to prevent the ingestion layer from overwhelming the join state during traffic spikes.
Design Breakdown
Functional Requirements
Core Use Cases:
Ingest
adEvent and userEvent streams.Join events by
servingId.Aggregate counts by
adId, region, userAction, and profileMetadata.Provide a real-time view of these metrics.
Scope Control:
In-Scope: Stream ingestion, stateful join, real-time aggregation, and persistent storage of results.
Out-of-Scope: Ad serving logic, user profile management (CRUD), long-term cold storage (Data Lake) archiving.
Non-Functional Requirements
Scale: Must handle ~120k total QPS (writes) and scale horizontally.
Latency: End-to-end processing latency under 10 seconds.
Availability & Reliability: 99.99% availability; data must not be lost if a node fails.
Consistency: Strong consistency for click counts (exactly-once semantics).
Fault Tolerance: Automatic recovery from checkpointing in case of processor failure.
Security & Privacy: Masking PII in
profile metadata before aggregation.Estimation
Traffic Estimation:
Total Events: ~10 Billion events/day.
Average QPS: 115k/sec.
Peak QPS: 250k/sec (during major events/holidays).
Storage Estimation:
Raw event size: ~500 bytes.
Aggregated daily growth: ~10GB (assuming high cardinality of dimensions).
Stream state (30-min window): 120k eps 1800s 500 bytes = ~108 GB of active state in RAM/SSD.
Bandwidth Estimation:
Inbound: 120k * 500 bytes = 60 MB/s.
Blueprint
Concise Summary: A streaming architecture where two Kafka topics are joined via a stateful Flink job using a 30-minute window, with the results aggregated and persisted into a high-performance OLAP database (ClickHouse) for real-time querying.
Major Components:
Ingestion Service: Stateless microservice to receive events via HTTP/gRPC and publish to Kafka.
Kafka: Messaging backbone to decouple ingestion from processing and ensure data persistence.
Flink (Stream Processor): Performs the windowed join, geo-profiling transformation, and windowed aggregation.
ClickHouse: Columnar OLAP database used to store and query aggregated metrics with sub-second latency.
Simplicity Audit: This design avoids complex lambda architectures by using a unified streaming engine (Flink) that handles both the join and the aggregation in one pipeline.
Architecture Decision Rationale:
Why this architecture?: Flink is the industry standard for stateful stream-stream joins. ClickHouse is chosen over traditional RDBMS because it handles the high-insert volume of aggregated shards much better.
Functional Requirement Satisfaction: The join on
servingId allows us to combine adId with user metadata.Non-functional Requirement Satisfaction: Kafka and Flink both provide horizontal scaling and fault tolerance via partitioning and checkpoints.
High Level Architecture
Sub-system Deep Dive
Service
Topology & Scaling:
The Ingestion API is a set of stateless Go/Java containers scaled based on CPU and Network I/O.
Distributed across 3 Availability Zones (AZs) for high availability.
API Schema Design:
POST /v1/events/ad (Protobuf): { serving_id: UUID, ad_id: String, timestamp: Long }POST /v1/events/user (Protobuf): { serving_id: UUID, action: Enum, region: String, metadata: Map }Idempotency:
serving_id acts as the unique key to prevent duplicate processing of the same ad impression.Resilience & Reliability:
Local buffering in the Ingestion API to handle brief Kafka hiccups.
Timeouts set to 200ms to ensure low latency for ad-serving callers.
Storage
Access Pattern:
Heavy inserts from Flink (upserts or appends).
Low-latency analytical queries from the Dashboard.
Database Table Design (ClickHouse):
Table:
ad_aggregatesFields:
timestamp (DateTime), ad_id (String), region (String), user_action (String), profile_segment (String), count (AggregateFunction).Engine:
SummingMergeTree (Automatically merges counts for same dimensions).Primary Key:
(ad_id, timestamp, region).Technical Selection:
ClickHouse: Specifically designed for real-time OLAP. It handles millions of inserts per second and allows for very fast aggregation queries across dimensions like geo and profile.
Distribution Logic:
Sharded by
ad_id to ensure queries for a specific ad are localized, though ClickHouse also handles distributed queries well.Messaging
Purpose & Decoupling: Kafka provides a 7-day retention buffer, allowing for reprocessing if the Flink job fails or needs a logic update.
Throughput & Partitioning:
Topics:
ad_events, user_events.Partition Key:
serving_id. This is critical: both events for the same serving_id MUST land in the same Kafka partition so the same Flink consumer instance can join them.Technical Selection: Kafka. It supports the high throughput and persistent storage needed for replayability.
Data Processing
Processing Model: Stream-Stream Join.
Processing DAG:
Source: Read from
ad_events and user_events.KeyBy: Group by
serving_id.Interval Join:
adEvent JOIN userEvent WHERE userEvent.ts BETWEEN adEvent.ts AND adEvent.ts + 30 minutes.Map: Transform joined record into an aggregation-ready format (e.g., flattening profile metadata).
Windowed Aggregate: 1-minute tumbling windows to reduce the number of writes to ClickHouse.
Sink: Write to ClickHouse using the JDBC or ClickHouse-specific sink.
Correctness Guarantees: Exactly-once via Flink Checkpointing and idempotent writes to ClickHouse.
Technical Selection: Apache Flink. Chosen for its superior handling of late-arriving data and built-in state management for large joins.
Infrastructure (Optional)
Observability:
Prometheus/Grafana: Monitoring Kafka lag and Flink checkpoint durations.
Distributed Tracing: Jaeger to trace a
serving_id from the Ingestion API through Kafka to Flink.Wrap Up
Advanced Topics
Trade-offs: We trade memory for accuracy. Keeping a 30-minute window of
adEvents in Flink state requires significant RAM/SSD, but it ensures we don't miss clicks that happen minutes after the view.Reliability: If Flink crashes, it resumes from the last successful checkpoint in S3, re-consuming from the Kafka offsets saved in that checkpoint to ensure no data loss.
Bottleneck Analysis: The main bottleneck will be the State Size. If we have millions of "zombie"
adEvents (ads served but never clicked), the state grows.Optimization: Use a TTL (Time-to-Live) on the Flink state to automatically evict
adEvents after 30 minutes.Security: Data is encrypted at rest in Kafka and S3. ClickHouse access is restricted via RBAC.
Alternative: Using Spark Streaming. However, Spark's micro-batching model has higher latency for joins compared to Flink's continuous processing model.