The Question

Scalable Ads Click Aggregation System

Design a high-throughput system to aggregate ad clicks for real-time advertiser dashboards. The system must handle a peak of 1 million clicks per second, provide sub-minute latency for reports, and ensure exactly-once processing for accurate billing. Address challenges such as late-arriving data, hot-key distributions for viral ads, and high-availability storage for time-series aggregates.

Kafka

Apache Flink

Cassandra

Redis

NoSQL

API Gateway

Time-series Storage

Kappa Architecture

Questions & Insights

Clarifying Questions

What is the expected scale (QPS)? Assumption: The system should handle an average of 100,000 clicks per second, with peaks of up to 1,000,000 clicks per second.

What is the required data freshness? Assumption: Advertisers should see aggregated results within 10-30 seconds (near real-time).

What are the windowing requirements? Assumption: Support for 1-minute, 5-minute, and 1-hour fixed (tumbling) windows.

How critical is accuracy? Assumption: Extremely critical for billing purposes. We must implement exactly-once processing semantics to avoid over-counting or under-counting clicks.

What is the retention period for raw and aggregated data? Assumption: Raw data for 7 days (for auditing); aggregated data for 2 years.

Thinking Process

Core Bottleneck: Ingesting a massive volume of click events while ensuring they are not lost or double-counted during the aggregation phase.

Progressive Logic:

How do we ingest clicks without blocking the user? (Asynchronous ingestion via a Message Queue).

How do we handle massive scale and deduplication? (Distributed Stream Processing with stateful windowing).

How do we store time-series aggregates for efficient querying? (NoSQL or OLAP storage optimized for time-range queries).

How do we ensure exactly-once semantics? (Idempotent keys and transactional offsets).

Bonus Points

Watermarking for Late Events: Implementing a slack period (e.g., 5 seconds) to handle clicks that arrive slightly out of order due to network jitter, ensuring window completeness.

Two-Stage Aggregation: Reducing "hot-key" issues (e.g., a viral Super Bowl ad) by performing local in-memory pre-aggregation at the ingestion layer or within the stream processor before sending to the global aggregator.

Lambda/Kappa Trade-off: Justifying a Kappa architecture for simplicity in MVP, while acknowledging that a batch layer (Lambda) could be added later for deep historical auditing and correction.

Exactly-Once with Kafka-Flink-Cassandra: Detail the use of Kafka transaction offsets and Flink's two-phase commit sink to achieve end-to-end consistency.

Design Breakdown

Functional Requirements

Core Use Cases:

Capture ad click events from mobile/web clients.

Aggregate clicks per Ad ID and Campaign ID over specific time windows.

Provide an API for internal dashboards to query aggregated click counts.

Scope Control:

In-Scope: Click ingestion, real-time aggregation, and storage of aggregates.

Out-of-Scope: Ad serving logic, fraud detection (though we will provide hooks for it), and billing payment processing.

Non-Functional Requirements

Scale: Support 1M peak QPS.

Latency: End-to-end latency < 30 seconds for aggregations.

Availability & Reliability: 99.99% availability; data must not be lost once acknowledged by the ingestion API.

Consistency: Exactly-once processing for billing accuracy.

Fault Tolerance: Automatic recovery from node failures in the stream processing cluster without data duplication.

Security & Privacy: TLS for all data in transit; PII scrubbing for click metadata.

Estimation

Traffic Estimation:

Average QPS: 100k clicks/sec.

Peak QPS: 1M clicks/sec.

Storage Estimation:

Aggregated Record: ad_id (16 bytes) + timestamp (8 bytes) + count (8 bytes) + metadata (~68 bytes) = ~100 bytes.

If we have 10k active ads and 1-minute windows: 10k * 1,440 mins/day = 14.4M records/day.

14.4M * 100 bytes = 1.44 GB/day for aggregated data.

Raw data: 1M peak QPS * 100 bytes = 100 MB/sec peak = 8.6 TB/day (requires retention management).

Bandwidth Estimation:

Inbound: 1M clicks/sec * 500 bytes (raw request) = 500 MB/sec.

Outbound (API Query): Negligible compared to ingestion.

Blueprint

Concise Summary: A Kappa-architecture pipeline where clicks are ingested via an API Gateway into Kafka, processed in real-time using Apache Flink for windowed aggregation, and stored in Cassandra for time-series querying.

Major Components:

Ingestion Service: Stateless microservice to validate and produce click events to Kafka.

Kafka: Distributed message bus acting as a buffer and source of truth for the stream processor.

Apache Flink: Stateful stream processing engine for windowing and deduplication.

Redis: Distributed cache for temporary click deduplication (idempotency check) before ingestion.

Cassandra: Wide-column store optimized for high-write volume and time-series aggregation queries.

Simplicity Audit: This design avoids complex batch-processing pipelines (MapReduce) by using a single stream processing path (Flink) which is sufficient for 10-30 second latency requirements.

Architecture Decision Rationale:

Why this architecture?: Kafka/Flink/Cassandra is the industry standard for high-throughput, exactly-once time-series aggregation.

Functional Requirement Satisfaction: Meets near real-time aggregation and query requirements via windowed processing and optimized storage.

Non-functional Requirement Satisfaction: Flink provides horizontal scaling and exactly-once state management; Kafka ensures no data loss during spikes.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing: Global DNS (Route53) redirects traffic to the nearest regional Load Balancer to minimize click latency.

Security & Perimeter:

API Gateway: Handles TLS termination and basic request validation.

Rate Limiting: Applied at the API Gateway level to prevent DDoS attacks from botnets.

Service

Topology & Scaling:

Ingestion Service: Stateless, deployed in multiple Availability Zones (AZs). Scales based on CPU and Incoming Request count.

Query Service: Read-only service scaling based on QPS for dashboards.

API Schema Design:

Endpoint: POST /v1/clicks

Payload: { "ad_id": "string", "timestamp": "long", "user_id": "string", "request_id": "uuid" }

Idempotency: request_id is used to prevent duplicate clicks within a short window.

Resilience & Reliability:

Retries: Client-side retries with exponential backoff if the Ingestion Service is unavailable.

Circuit Breakers: Implemented between the Ingestion Service and Kafka to prevent cascading failures.

Storage

Access Pattern: High-frequency writes from Flink (aggregated results); low-frequency, high-volume reads (range scans) from the Query Service.

Database Table Design:

Table: aggregated_clicks

Partition Key: ad_id, window_type (e.g., 1m, 1h)

Clustering Key: window_start_time (Descending)

Columns: click_count, unique_user_count (optional).

Technical Selection: Cassandra.

Rationale: Excellent for time-series data where writes are dominant. Horizontal scaling allows for seamless growth as ad volume increases.

Distribution Logic: Partitioning by ad_id ensures all data for a specific ad is colocated for fast dashboard queries, but we use a synthetic shard key if one ad_id becomes too "hot".

Cache

Purpose & Justification: Used for Deduplication. To ensure we don't count the same request_id twice within a 1-minute window (e.g., double clicks).

Key-Value Schema:

Key: click_dedup:{request_id}

Value: 1

TTL: 60 seconds.

Technical Selection: Redis.

Rationale: In-memory speed is required for the high-frequency deduplication check before writing to Kafka.

Messaging

Purpose & Decoupling: Kafka decouples the high-velocity ingestion from the complex stream processing. It provides a persistent buffer during Flink maintenance or outages.

Event / Topic Schema:

Topic: raw_clicks

Partition Key: ad_id (Ensures ordering per ad for the aggregator).

Throughput & Partitioning: 100 partitions per topic to allow for 100+ concurrent Flink consumers.

Technical Selection: Kafka.

Rationale: High throughput, low latency, and strong durability.

Data Processing

Processing Model: Streaming (Apache Flink).

Processing DAG:

Source: Read from Kafka raw_clicks.

Map: Parse JSON and extract timestamp.

Watermark: Assign watermarks based on timestamp with a 5-second delay.

Window: keyBy(ad_id).window(TumblingEventTimeWindows.of(Time.minutes(1))).

Aggregate: Sum clicks per ad.

Sink: Upsert results to Cassandra.

Correctness Guarantees: Flink's Checkpoints and Savepoints enable exactly-once processing by persisting state (partial aggregates) to durable storage (S3).

Technical Selection: Apache Flink.

Rationale: Native support for event-time processing and complex windowing.

Infrastructure (Optional)

Observability:

Metrics: Prometheus tracking "Consumer Lag" in Kafka (crucial indicator of system health).

Logging: Structured logs for click events to allow for post-hoc auditing.

Wrap Up

Advanced Topics

Trade-offs (PACELC): We choose Availability over Consistency (AP) for the ingestion layer (clicks must always be accepted), but we enforce Strong Consistency within the processing layer (Flink state) to ensure billing accuracy.

Reliability: Flink's exactly-once sink to Cassandra is implemented via "Idempotent Writes" (using the window start time and Ad ID as a unique primary key). If Flink restarts and re-processes a window, it simply overwrites the same record in Cassandra with the same value.

Hot Spot Optimization: For viral ads, we can implement sub-keying. Instead of grouping by ad_id, we group by ad_id + random_shard_id, aggregate, and then run a second-stage aggregation to sum the shards.

Security: Request signatures (HMAC) from the ad-client to the Ingestion Service to prevent spoofed clicks.