The Question

Real-Time Ad Click Aggregator

Design a high-throughput system capable of aggregating ad clicks and views in real-time. The system must join two distinct data streams—ad events and user action events—based on a shared identifier. It should support multi-dimensional aggregation (e.g., by geography and user profile) and provide low-latency results for analytical dashboards. Key considerations include handling out-of-order events, ensuring exactly-once processing for financial accuracy, and managing the state required for joining events that may occur far apart in time.

Apache Flink

Kafka

ClickHouse

RocksDB

Protobuf

OLAP

CDC

Questions & Insights

Clarifying Questions

What is the expected scale of events? (Assumption: 100k events/sec for adEvents and 10k-20k events/sec for userEvents).

What is the acceptable latency for the aggregated results? (Assumption: Near real-time, < 10 seconds for the dashboard to reflect changes).

What is the typical time window between an ad being served and a user click? (Assumption: 99% of clicks happen within 30 minutes of the serve event).

Is "exactly-once" processing required for billing, or is "at-least-once" acceptable for analytics? (Assumption: High accuracy is required for billing, necessitating exactly-once semantics).

How many dimensions are we aggregating by? (Assumption: adId, region, userAction, and 3-5 profile metadata categories).

Thinking Process

The Join Bottleneck: Since events arrive on two separate streams, the system must perform a Stateful Stream-Stream Join using a common identifier (servingId).

State Management: How do we store adEvents that haven't been clicked yet without exhausting memory? We use a windowed state with a TTL.

Partitioning Strategy: To perform a join efficiently, both streams must be partitioned by the same key (servingId) across the messaging and processing layers.

Aggregation Sink: Where do we store the high-velocity results? We need a write-heavy OLAP store or a Key-Value store with atomic increments.

Bonus Points

Watermarking for Late Events: Implementing custom watermarking in the stream processor to handle out-of-order events coming from different mobile devices with clock drift.

Side Input Enrichment: If profile metadata is too large for the event stream, use a Broadcast State or Side Inputs to join against a slowly-changing dimension (SCD) user profile table.

Exactly-Once Semantics: Utilizing Kafka's idempotent producers and Flink’s two-phase commit (2PC) sink to ensure no double-counting of ad clicks.

Backpressure Handling: Configuring Flink's credit-based flow control to prevent the ingestion layer from overwhelming the join state during traffic spikes.

Design Breakdown

Functional Requirements

Core Use Cases:

Ingest adEvent and userEvent streams.

Join events by servingId.

Aggregate counts by adId, region, userAction, and profileMetadata.

Provide a real-time view of these metrics.

Scope Control:

In-Scope: Stream ingestion, stateful join, real-time aggregation, and persistent storage of results.

Out-of-Scope: Ad serving logic, user profile management (CRUD), long-term cold storage (Data Lake) archiving.

Non-Functional Requirements

Scale: Must handle ~120k total QPS (writes) and scale horizontally.

Latency: End-to-end processing latency under 10 seconds.

Availability & Reliability: 99.99% availability; data must not be lost if a node fails.

Consistency: Strong consistency for click counts (exactly-once semantics).

Fault Tolerance: Automatic recovery from checkpointing in case of processor failure.

Security & Privacy: Masking PII in profile metadata before aggregation.

Estimation

Traffic Estimation:

Total Events: ~10 Billion events/day.

Average QPS: 115k/sec.

Peak QPS: 250k/sec (during major events/holidays).

Storage Estimation:

Raw event size: ~500 bytes.

Aggregated daily growth: ~10GB (assuming high cardinality of dimensions).

Stream state (30-min window): 120k eps 1800s 500 bytes = ~108 GB of active state in RAM/SSD.

Bandwidth Estimation:

Inbound: 120k * 500 bytes = 60 MB/s.

Blueprint

Concise Summary: A streaming architecture where two Kafka topics are joined via a stateful Flink job using a 30-minute window, with the results aggregated and persisted into a high-performance OLAP database (ClickHouse) for real-time querying.

Major Components:

Ingestion Service: Stateless microservice to receive events via HTTP/gRPC and publish to Kafka.

Kafka: Messaging backbone to decouple ingestion from processing and ensure data persistence.

Flink (Stream Processor): Performs the windowed join, geo-profiling transformation, and windowed aggregation.

ClickHouse: Columnar OLAP database used to store and query aggregated metrics with sub-second latency.

Simplicity Audit: This design avoids complex lambda architectures by using a unified streaming engine (Flink) that handles both the join and the aggregation in one pipeline.

Architecture Decision Rationale:

Why this architecture?: Flink is the industry standard for stateful stream-stream joins. ClickHouse is chosen over traditional RDBMS because it handles the high-insert volume of aggregated shards much better.

Functional Requirement Satisfaction: The join on servingId allows us to combine adId with user metadata.

Non-functional Requirement Satisfaction: Kafka and Flink both provide horizontal scaling and fault tolerance via partitioning and checkpoints.

High Level Architecture

Sub-system Deep Dive

Service

Topology & Scaling:

The Ingestion API is a set of stateless Go/Java containers scaled based on CPU and Network I/O.

Distributed across 3 Availability Zones (AZs) for high availability.

API Schema Design:

POST /v1/events/ad (Protobuf): { serving_id: UUID, ad_id: String, timestamp: Long }

POST /v1/events/user (Protobuf): { serving_id: UUID, action: Enum, region: String, metadata: Map }

Idempotency: serving_id acts as the unique key to prevent duplicate processing of the same ad impression.

Resilience & Reliability:

Local buffering in the Ingestion API to handle brief Kafka hiccups.

Timeouts set to 200ms to ensure low latency for ad-serving callers.

Storage

Access Pattern:

Heavy inserts from Flink (upserts or appends).

Low-latency analytical queries from the Dashboard.

Database Table Design (ClickHouse):

Table: ad_aggregates

Fields: timestamp (DateTime), ad_id (String), region (String), user_action (String), profile_segment (String), count (AggregateFunction).

Engine: SummingMergeTree (Automatically merges counts for same dimensions).

Primary Key: (ad_id, timestamp, region).

Technical Selection:

ClickHouse: Specifically designed for real-time OLAP. It handles millions of inserts per second and allows for very fast aggregation queries across dimensions like geo and profile.

Distribution Logic:

Sharded by ad_id to ensure queries for a specific ad are localized, though ClickHouse also handles distributed queries well.

Messaging

Purpose & Decoupling: Kafka provides a 7-day retention buffer, allowing for reprocessing if the Flink job fails or needs a logic update.

Throughput & Partitioning:

Topics: ad_events, user_events.

Partition Key: serving_id. This is critical: both events for the same serving_id MUST land in the same Kafka partition so the same Flink consumer instance can join them.

Technical Selection: Kafka. It supports the high throughput and persistent storage needed for replayability.

Data Processing

Processing Model: Stream-Stream Join.

Processing DAG:

Source: Read from ad_events and user_events.

KeyBy: Group by serving_id.

Interval Join: adEvent JOIN userEvent WHERE userEvent.ts BETWEEN adEvent.ts AND adEvent.ts + 30 minutes.

Map: Transform joined record into an aggregation-ready format (e.g., flattening profile metadata).

Windowed Aggregate: 1-minute tumbling windows to reduce the number of writes to ClickHouse.

Sink: Write to ClickHouse using the JDBC or ClickHouse-specific sink.

Correctness Guarantees: Exactly-once via Flink Checkpointing and idempotent writes to ClickHouse.

Technical Selection: Apache Flink. Chosen for its superior handling of late-arriving data and built-in state management for large joins.

Infrastructure (Optional)

Observability:

Prometheus/Grafana: Monitoring Kafka lag and Flink checkpoint durations.

Distributed Tracing: Jaeger to trace a serving_id from the Ingestion API through Kafka to Flink.

Wrap Up

Advanced Topics

Trade-offs: We trade memory for accuracy. Keeping a 30-minute window of adEvents in Flink state requires significant RAM/SSD, but it ensures we don't miss clicks that happen minutes after the view.

Reliability: If Flink crashes, it resumes from the last successful checkpoint in S3, re-consuming from the Kafka offsets saved in that checkpoint to ensure no data loss.

Bottleneck Analysis: The main bottleneck will be the State Size. If we have millions of "zombie" adEvents (ads served but never clicked), the state grows.

Optimization: Use a TTL (Time-to-Live) on the Flink state to automatically evict adEvents after 30 minutes.

Security: Data is encrypted at rest in Kafka and S3. ClickHouse access is restricted via RBAC.

Alternative: Using Spark Streaming. However, Spark's micro-batching model has higher latency for joins compared to Flink's continuous processing model.