The Question
DesignScalable Ads Click Aggregation System
Design a high-throughput system to aggregate ad clicks for real-time advertiser dashboards. The system must handle a peak of 1 million clicks per second, provide sub-minute latency for reports, and ensure exactly-once processing for accurate billing. Address challenges such as late-arriving data, hot-key distributions for viral ads, and high-availability storage for time-series aggregates.
Kafka
Apache Flink
Cassandra
Redis
NoSQL
API Gateway
Time-series Storage
Kappa Architecture
Questions & Insights
Clarifying Questions
What is the expected scale (QPS)? Assumption: The system should handle an average of 100,000 clicks per second, with peaks of up to 1,000,000 clicks per second.
What is the required data freshness? Assumption: Advertisers should see aggregated results within 10-30 seconds (near real-time).
What are the windowing requirements? Assumption: Support for 1-minute, 5-minute, and 1-hour fixed (tumbling) windows.
How critical is accuracy? Assumption: Extremely critical for billing purposes. We must implement exactly-once processing semantics to avoid over-counting or under-counting clicks.
What is the retention period for raw and aggregated data? Assumption: Raw data for 7 days (for auditing); aggregated data for 2 years.
Thinking Process
Core Bottleneck: Ingesting a massive volume of click events while ensuring they are not lost or double-counted during the aggregation phase.
Progressive Logic:
How do we ingest clicks without blocking the user? (Asynchronous ingestion via a Message Queue).
How do we handle massive scale and deduplication? (Distributed Stream Processing with stateful windowing).
How do we store time-series aggregates for efficient querying? (NoSQL or OLAP storage optimized for time-range queries).
How do we ensure exactly-once semantics? (Idempotent keys and transactional offsets).
Bonus Points
Watermarking for Late Events: Implementing a slack period (e.g., 5 seconds) to handle clicks that arrive slightly out of order due to network jitter, ensuring window completeness.
Two-Stage Aggregation: Reducing "hot-key" issues (e.g., a viral Super Bowl ad) by performing local in-memory pre-aggregation at the ingestion layer or within the stream processor before sending to the global aggregator.
Lambda/Kappa Trade-off: Justifying a Kappa architecture for simplicity in MVP, while acknowledging that a batch layer (Lambda) could be added later for deep historical auditing and correction.
Exactly-Once with Kafka-Flink-Cassandra: Detail the use of Kafka transaction offsets and Flink's two-phase commit sink to achieve end-to-end consistency.
Design Breakdown
Functional Requirements
Core Use Cases:
Capture ad click events from mobile/web clients.
Aggregate clicks per Ad ID and Campaign ID over specific time windows.
Provide an API for internal dashboards to query aggregated click counts.
Scope Control:
In-Scope: Click ingestion, real-time aggregation, and storage of aggregates.
Out-of-Scope: Ad serving logic, fraud detection (though we will provide hooks for it), and billing payment processing.
Non-Functional Requirements
Scale: Support 1M peak QPS.
Latency: End-to-end latency < 30 seconds for aggregations.
Availability & Reliability: 99.99% availability; data must not be lost once acknowledged by the ingestion API.
Consistency: Exactly-once processing for billing accuracy.
Fault Tolerance: Automatic recovery from node failures in the stream processing cluster without data duplication.
Security & Privacy: TLS for all data in transit; PII scrubbing for click metadata.
Estimation
Traffic Estimation:
Average QPS: 100k clicks/sec.
Peak QPS: 1M clicks/sec.
Storage Estimation:
Aggregated Record:
ad_id (16 bytes) + timestamp (8 bytes) + count (8 bytes) + metadata (~68 bytes) = ~100 bytes.If we have 10k active ads and 1-minute windows: 10k * 1,440 mins/day = 14.4M records/day.
14.4M * 100 bytes = 1.44 GB/day for aggregated data.
Raw data: 1M peak QPS * 100 bytes = 100 MB/sec peak = 8.6 TB/day (requires retention management).
Bandwidth Estimation:
Inbound: 1M clicks/sec * 500 bytes (raw request) = 500 MB/sec.
Outbound (API Query): Negligible compared to ingestion.
Blueprint
Concise Summary: A Kappa-architecture pipeline where clicks are ingested via an API Gateway into Kafka, processed in real-time using Apache Flink for windowed aggregation, and stored in Cassandra for time-series querying.
Major Components:
Ingestion Service: Stateless microservice to validate and produce click events to Kafka.
Kafka: Distributed message bus acting as a buffer and source of truth for the stream processor.
Apache Flink: Stateful stream processing engine for windowing and deduplication.
Redis: Distributed cache for temporary click deduplication (idempotency check) before ingestion.
Cassandra: Wide-column store optimized for high-write volume and time-series aggregation queries.
Simplicity Audit: This design avoids complex batch-processing pipelines (MapReduce) by using a single stream processing path (Flink) which is sufficient for 10-30 second latency requirements.
Architecture Decision Rationale:
Why this architecture?: Kafka/Flink/Cassandra is the industry standard for high-throughput, exactly-once time-series aggregation.
Functional Requirement Satisfaction: Meets near real-time aggregation and query requirements via windowed processing and optimized storage.
Non-functional Requirement Satisfaction: Flink provides horizontal scaling and exactly-once state management; Kafka ensures no data loss during spikes.
High Level Architecture
Sub-system Deep Dive
Edge (Optional)
Content Delivery & Traffic Routing: Global DNS (Route53) redirects traffic to the nearest regional Load Balancer to minimize click latency.
Security & Perimeter:
API Gateway: Handles TLS termination and basic request validation.
Rate Limiting: Applied at the API Gateway level to prevent DDoS attacks from botnets.
Service
Topology & Scaling:
Ingestion Service: Stateless, deployed in multiple Availability Zones (AZs). Scales based on CPU and Incoming Request count.
Query Service: Read-only service scaling based on QPS for dashboards.
API Schema Design:
Endpoint:
POST /v1/clicksPayload:
{ "ad_id": "string", "timestamp": "long", "user_id": "string", "request_id": "uuid" }Idempotency:
request_id is used to prevent duplicate clicks within a short window.Resilience & Reliability:
Retries: Client-side retries with exponential backoff if the Ingestion Service is unavailable.
Circuit Breakers: Implemented between the Ingestion Service and Kafka to prevent cascading failures.
Storage
Access Pattern: High-frequency writes from Flink (aggregated results); low-frequency, high-volume reads (range scans) from the Query Service.
Database Table Design:
Table:
aggregated_clicksPartition Key:
ad_id, window_type (e.g., 1m, 1h)Clustering Key:
window_start_time (Descending)Columns:
click_count, unique_user_count (optional).Technical Selection: Cassandra.
Rationale: Excellent for time-series data where writes are dominant. Horizontal scaling allows for seamless growth as ad volume increases.
Distribution Logic: Partitioning by
ad_id ensures all data for a specific ad is colocated for fast dashboard queries, but we use a synthetic shard key if one ad_id becomes too "hot".Cache
Purpose & Justification: Used for Deduplication. To ensure we don't count the same
request_id twice within a 1-minute window (e.g., double clicks).Key-Value Schema:
Key:
click_dedup:{request_id}Value:
1TTL: 60 seconds.
Technical Selection: Redis.
Rationale: In-memory speed is required for the high-frequency deduplication check before writing to Kafka.
Messaging
Purpose & Decoupling: Kafka decouples the high-velocity ingestion from the complex stream processing. It provides a persistent buffer during Flink maintenance or outages.
Event / Topic Schema:
Topic:
raw_clicksPartition Key:
ad_id (Ensures ordering per ad for the aggregator).Throughput & Partitioning: 100 partitions per topic to allow for 100+ concurrent Flink consumers.
Technical Selection: Kafka.
Rationale: High throughput, low latency, and strong durability.
Data Processing
Processing Model: Streaming (Apache Flink).
Processing DAG:
Source: Read from Kafka
raw_clicks.Map: Parse JSON and extract timestamp.
Watermark: Assign watermarks based on
timestamp with a 5-second delay.Window:
keyBy(ad_id).window(TumblingEventTimeWindows.of(Time.minutes(1))).Aggregate: Sum clicks per ad.
Sink: Upsert results to Cassandra.
Correctness Guarantees: Flink's Checkpoints and Savepoints enable exactly-once processing by persisting state (partial aggregates) to durable storage (S3).
Technical Selection: Apache Flink.
Rationale: Native support for event-time processing and complex windowing.
Infrastructure (Optional)
Observability:
Metrics: Prometheus tracking "Consumer Lag" in Kafka (crucial indicator of system health).
Logging: Structured logs for click events to allow for post-hoc auditing.
Wrap Up
Advanced Topics
Trade-offs (PACELC): We choose Availability over Consistency (AP) for the ingestion layer (clicks must always be accepted), but we enforce Strong Consistency within the processing layer (Flink state) to ensure billing accuracy.
Reliability: Flink's exactly-once sink to Cassandra is implemented via "Idempotent Writes" (using the window start time and Ad ID as a unique primary key). If Flink restarts and re-processes a window, it simply overwrites the same record in Cassandra with the same value.
Hot Spot Optimization: For viral ads, we can implement sub-keying. Instead of grouping by
ad_id, we group by ad_id + random_shard_id, aggregate, and then run a second-stage aggregation to sum the shards.Security: Request signatures (HMAC) from the ad-client to the Ingestion Service to prevent spoofed clicks.