The Question

Ad Click-Through Rate (CTR) Prediction System

Design a high-scale Ad Click-Through Rate (CTR) prediction system capable of processing 100k+ QPS with sub-100ms latency. The system must handle a corpus of 10 million ads and 100 million users. Detail the multi-stage ranking funnel (retrieval and ranking), feature engineering for high-cardinality ID features, strategies for addressing training-serving skew, and how to maintain model calibration for downstream bidding. Explicitly discuss data ingestion, distributed training for massive datasets, and real-time monitoring of model performance and data drift.

DeepFM

Factorization Machines

Kafka

Spark

Flink

Redis

FAISS

TensorFlow

Horovod

Isotonic Regression

ONNX

Questions & Insights

Clarifying Questions

Business Goal: Is the North Star metric pure Click-Through Rate (CTR) or are we optimizing for Expected Value (e.g.,

eCPM = CTR \times CPC

)? Assumption: We optimize for CTR to maximize user engagement and platform revenue in a CPC model.

Constraints & Scale: What is the traffic volume (QPS) and the size of the ad corpus? Assumption: 100k QPS, 10M active ads, and a P99 latency budget of <100ms for the entire ranking stack.

Edge Cases: How do we handle new ads (Cold Start) and how fast must the model react to changing user trends? Assumption: We need near real-time updates for ad features and a mechanism to explore new ads (Epsilon-greedy).

Data Freshness: Is there a requirement for online learning? Assumption: We will use hourly/daily batch training with real-time feature updates for the MVP.

Thinking Process

Identify the Funnel: With 10M ads, we cannot rank all of them with a complex model. I need a multi-stage architecture: Retrieval (filtering 10M to 1k) and Ranking (scoring the top 1k).

Address Data Characteristics: Ad click data is extremely sparse (CTR is often <1%) and features are high-cardinality (User IDs, Ad IDs). My model choice must handle feature interactions effectively without manual engineering.

Latency vs. Complexity: A Deep Neural Network (DNN) is great, but I must ensure the feature retrieval and model inference fit within the 100ms budget.

The AdTech "Gotcha": Calibration is critical. A model that predicts 0.1 CTR when the real CTR is 0.01 will break the bidding logic.

Elite Bonus Points

Position Bias Modeling: In AdTech, ads at the top get more clicks regardless of quality. I will implement a "Position Feature" during training but set it to a default "Position 1" during inference to decorrelate relevance from placement.

Calibration for Downstream Bidding: Using Isotonic Regression or Platt Scaling to ensure the predicted probability

P(\text{click})

matches the empirical click rate, which is vital for second-price auctions.

Delayed Feedback Handling: Clicks can happen minutes or hours after an impression. I'll use a "negative sampling with later correction" strategy or a wait-time window to avoid false negatives in training.

Feature Hashing & Embedding Versioning: To handle the massive ID space (billions of users), I'll use the "Hashing Trick" to bound the vocabulary size and ensure embedding versioning to prevent serving/training mismatches.

Design Breakdown

Requirements

Product Goal: Maximize the probability of a user clicking an ad given their context.

Success Metrics:

Online Metrics: CTR, Revenue Per Mille (RPM), Lift over baseline.

Offline Metrics: AUC-ROC (ranking quality), Log-Loss (calibration quality), PR-AUC (due to class imbalance).

Guardrail Metrics: P99 Latency (<100ms), Training Time, Model Size.

System Constraints: 100M+ DAU, 10M+ ads, distributed training on TBs of data.

Data Availability: Impression logs (features), Click logs (labels), Ad Metadata, User Profiles.

ML Problem Framing

ML Task Type: Binary Classification (Pointwise Ranking).

Prediction Target:

P(\text{click} = 1 | \text{user, ad, context})

Inputs:

User: Historical clicks (last 50), demographics, device.

Ad: Category, Advertiser ID, Creative type, historical CTR (1h, 1d, 7d).

Context: Time of day, App ID, Geo-location.

Outputs: A calibrated probability score between 0 and 1.

ML Challenges: Extreme class imbalance (99% negative), high cardinality categorical features, and non-stationary data (trends change).

Design Summary & MVP

Concise Summary: A two-stage system using a fast filtering heuristic for retrieval followed by a DeepFM (Factorization-Machine supported Neural Network) for high-precision ranking.

Model Architecture & Selection:

Baseline Model: Logistic Regression with manual feature crosses (e.g., User_Category X Ad_Category).

Target Model: DeepFM.

Choice Rationale: DeepFM captures both low-order (FM component) and high-order (DNN component) feature interactions automatically, which is superior to manual crossing in high-cardinality spaces.

ML Life Cycle Summary: Data is ingested via Kafka, joined on RequestID, features are stored in a Feature Store, and the model is trained offline in batch. Serving happens via a low-latency C++ or Go microservice.

Simplicity Audit: I am avoiding Online Learning (RL) for the MVP as batch training with frequent updates (e.g., hourly) provides 90% of the value with 10% of the infrastructure complexity.

Architecture Decision Rationale:

Why this?: DeepFM provides the best balance of accuracy and latency for CTR.

Requirement Satisfaction: The two-stage funnel ensures we hit the <100ms latency target while still considering 10M+ items.

System Architecture

Pipeline Deep Dive

Data Pipeline

Data Source: Ad impression logs (every time an ad is shown) and Click logs (every time a user clicks).

Data Ingestion: Kafka acts as the buffer. Click events are joined with impression events using a RequestID.

Data Storage: Raw logs in S3 (Data Lake). Processed training samples in a Data Warehouse (BigQuery/Snowflake) partitioned by hour.

Data Processing: Spark handles the heavy lifting of joining logs with a 24-hour window to account for delayed clicks.

Data Quality: We implement "Negative Downsampling" to reduce storage costs while keeping all positive (click) samples.

Feature Pipeline

Feature Definition:

Dense: User age, historical CTR (float).

Sparse: UserID, AdID, Category (categorical).

Feature Engineering: Standardize dense features. Use Embedding Layers for sparse features to map them to low-dimensional vectors.

Online Feature Pipeline: Flink calculates real-time counters (e.g., "how many times has this ad been clicked in the last 15 minutes?") and pushes to Redis.

Feature Store: We use a Feature Store (e.g., Feast) to ensure that the same transformation logic used during training is applied during inference, preventing Training-Serving Skew.

Model Architecture

Problem Formulation: Binary classification using Cross-Entropy Loss.

Candidate Model Families:

Logistic Regression (too simple).

XGBoost (hard to scale with 1B+ rows and high-cardinality IDs).

DeepFM (Winner): Integrates Factorization Machines (FM) and Deep Neural Networks (DNN).

Architecture Design:

Wide Component: Captures memorization (e.g., specific AdID).

Deep Component: Generalization via embeddings of categorical features.

Inference Optimization: Use TensorRT or ONNX Runtime for C++ optimized execution to keep ranking under 20ms for 500 candidates.

Training Pipeline

Dataset Construction: 1:10 positive-to-negative sampling ratio. We use a time-based split (Train on days 1-20, Test on day 21) to prevent looking into the future.

Training Infrastructure: Distributed TensorFlow or PyTorch using Horovod for multi-GPU training.

Retraining Strategy: Daily full-model retraining + hourly incremental "fine-tuning" on the embedding layers to capture the latest trends.

Serving Pipeline

Serving Pattern: Online inference for the Ranking stage.

Latency Optimization:

Retrieval: Use Approximate Nearest Neighbor (ANN) search (e.g., FAISS) on user/item embeddings or a simple heuristic (Top 1000 ads by category).

Parallelism: Fetch features from Redis in parallel with the Retrieval step.

Reliability: If the Ranking service times out, fall back to a "Popularity-based" ranking.

Evaluation Pipeline

Offline Evaluation:

AUC-ROC: Measures ranking order quality.

Log-Loss: Measures the accuracy of the probability estimate.

Online Evaluation: 2-way A/B test. Group A: Baseline (Heuristic); Group B: DeepFM. Monitor CTR and Revenue per request.

Monitoring Pipeline

System Monitoring: Latency histograms, throughput, and error rates (5xx codes).

Data Monitoring: Track KL-Divergence between training and serving feature distributions.

Model Monitoring: Monitor the Calibration Plot (Predicted vs. Actual CTR) to ensure the model isn't over/under-predicting.

Wrap Up

Final Evaluation

Observability: We use a dashboard to track "Model Staleness"—if the model hasn't been retrained in 24 hours, trigger an alert.

Feedback Loop: Clicks are fed back into the training pipeline hourly.

Trade-offs: We trade off model complexity (e.g., Transformers) for latency (DeepFM), ensuring we can scale to thousands of candidates per request.

Distinguishing Insights: In a real-world AdTech system, the "Feedback Loop" is poisoned by Exploration. We must use techniques like Bandit algorithms (UCB/Thompson Sampling) in the retrieval stage to ensure we aren't just showing the same high-CTR ads and ignoring potentially better new ones.