The Question
ML DesignAd Click-Through Rate (CTR) Prediction System
Design a high-scale Ad Click-Through Rate (CTR) prediction system capable of processing 100k+ QPS with sub-100ms latency. The system must handle a corpus of 10 million ads and 100 million users. Detail the multi-stage ranking funnel (retrieval and ranking), feature engineering for high-cardinality ID features, strategies for addressing training-serving skew, and how to maintain model calibration for downstream bidding. Explicitly discuss data ingestion, distributed training for massive datasets, and real-time monitoring of model performance and data drift.
DeepFM
Factorization Machines
Kafka
Spark
Flink
Redis
FAISS
TensorFlow
Horovod
Isotonic Regression
ONNX
Questions & Insights
Clarifying Questions
Business Goal: Is the North Star metric pure Click-Through Rate (CTR) or are we optimizing for Expected Value (e.g., eCPM = CTR \times CPC)? Assumption: We optimize for CTR to maximize user engagement and platform revenue in a CPC model.
Constraints & Scale: What is the traffic volume (QPS) and the size of the ad corpus? Assumption: 100k QPS, 10M active ads, and a P99 latency budget of <100ms for the entire ranking stack.
Edge Cases: How do we handle new ads (Cold Start) and how fast must the model react to changing user trends? Assumption: We need near real-time updates for ad features and a mechanism to explore new ads (Epsilon-greedy).
Data Freshness: Is there a requirement for online learning? Assumption: We will use hourly/daily batch training with real-time feature updates for the MVP.
Thinking Process
Identify the Funnel: With 10M ads, we cannot rank all of them with a complex model. I need a multi-stage architecture: Retrieval (filtering 10M to 1k) and Ranking (scoring the top 1k).
Address Data Characteristics: Ad click data is extremely sparse (CTR is often <1%) and features are high-cardinality (User IDs, Ad IDs). My model choice must handle feature interactions effectively without manual engineering.
Latency vs. Complexity: A Deep Neural Network (DNN) is great, but I must ensure the feature retrieval and model inference fit within the 100ms budget.
The AdTech "Gotcha": Calibration is critical. A model that predicts 0.1 CTR when the real CTR is 0.01 will break the bidding logic.
Elite Bonus Points
Position Bias Modeling: In AdTech, ads at the top get more clicks regardless of quality. I will implement a "Position Feature" during training but set it to a default "Position 1" during inference to decorrelate relevance from placement.
Calibration for Downstream Bidding: Using Isotonic Regression or Platt Scaling to ensure the predicted probability P(\text{click}) matches the empirical click rate, which is vital for second-price auctions.
Delayed Feedback Handling: Clicks can happen minutes or hours after an impression. I'll use a "negative sampling with later correction" strategy or a wait-time window to avoid false negatives in training.
Feature Hashing & Embedding Versioning: To handle the massive ID space (billions of users), I'll use the "Hashing Trick" to bound the vocabulary size and ensure embedding versioning to prevent serving/training mismatches.
Design Breakdown
Requirements
Product Goal: Maximize the probability of a user clicking an ad given their context.
Success Metrics:
Online Metrics: CTR, Revenue Per Mille (RPM), Lift over baseline.
Offline Metrics: AUC-ROC (ranking quality), Log-Loss (calibration quality), PR-AUC (due to class imbalance).
Guardrail Metrics: P99 Latency (<100ms), Training Time, Model Size.
System Constraints: 100M+ DAU, 10M+ ads, distributed training on TBs of data.
Data Availability: Impression logs (features), Click logs (labels), Ad Metadata, User Profiles.
ML Problem Framing
ML Task Type: Binary Classification (Pointwise Ranking).
Prediction Target: P(\text{click} = 1 | \text{user, ad, context}).
Inputs:
User: Historical clicks (last 50), demographics, device.
Ad: Category, Advertiser ID, Creative type, historical CTR (1h, 1d, 7d).
Context: Time of day, App ID, Geo-location.
Outputs: A calibrated probability score between 0 and 1.
ML Challenges: Extreme class imbalance (99% negative), high cardinality categorical features, and non-stationary data (trends change).
Design Summary & MVP
Concise Summary: A two-stage system using a fast filtering heuristic for retrieval followed by a DeepFM (Factorization-Machine supported Neural Network) for high-precision ranking.
Model Architecture & Selection:
Baseline Model: Logistic Regression with manual feature crosses (e.g.,
User_Category X Ad_Category).Target Model: DeepFM.
Choice Rationale: DeepFM captures both low-order (FM component) and high-order (DNN component) feature interactions automatically, which is superior to manual crossing in high-cardinality spaces.
ML Life Cycle Summary: Data is ingested via Kafka, joined on
RequestID, features are stored in a Feature Store, and the model is trained offline in batch. Serving happens via a low-latency C++ or Go microservice.Simplicity Audit: I am avoiding Online Learning (RL) for the MVP as batch training with frequent updates (e.g., hourly) provides 90% of the value with 10% of the infrastructure complexity.
Architecture Decision Rationale:
Why this?: DeepFM provides the best balance of accuracy and latency for CTR.
Requirement Satisfaction: The two-stage funnel ensures we hit the <100ms latency target while still considering 10M+ items.
System Architecture
Pipeline Deep Dive
Data Pipeline
Data Source: Ad impression logs (every time an ad is shown) and Click logs (every time a user clicks).
Data Ingestion: Kafka acts as the buffer. Click events are joined with impression events using a
RequestID. Data Storage: Raw logs in S3 (Data Lake). Processed training samples in a Data Warehouse (BigQuery/Snowflake) partitioned by hour.
Data Processing: Spark handles the heavy lifting of joining logs with a 24-hour window to account for delayed clicks.
Data Quality: We implement "Negative Downsampling" to reduce storage costs while keeping all positive (click) samples.
Feature Pipeline
Feature Definition:
Dense: User age, historical CTR (float).
Sparse: UserID, AdID, Category (categorical).
Feature Engineering: Standardize dense features. Use Embedding Layers for sparse features to map them to low-dimensional vectors.
Online Feature Pipeline: Flink calculates real-time counters (e.g., "how many times has this ad been clicked in the last 15 minutes?") and pushes to Redis.
Feature Store: We use a Feature Store (e.g., Feast) to ensure that the same transformation logic used during training is applied during inference, preventing Training-Serving Skew.
Model Architecture
Problem Formulation: Binary classification using Cross-Entropy Loss.
Candidate Model Families:
Logistic Regression (too simple).
XGBoost (hard to scale with 1B+ rows and high-cardinality IDs).
DeepFM (Winner): Integrates Factorization Machines (FM) and Deep Neural Networks (DNN).
Architecture Design:
Wide Component: Captures memorization (e.g., specific AdID).
Deep Component: Generalization via embeddings of categorical features.
Inference Optimization: Use TensorRT or ONNX Runtime for C++ optimized execution to keep ranking under 20ms for 500 candidates.
Training Pipeline
Dataset Construction: 1:10 positive-to-negative sampling ratio. We use a time-based split (Train on days 1-20, Test on day 21) to prevent looking into the future.
Training Infrastructure: Distributed TensorFlow or PyTorch using
Horovod for multi-GPU training.Retraining Strategy: Daily full-model retraining + hourly incremental "fine-tuning" on the embedding layers to capture the latest trends.
Serving Pipeline
Serving Pattern: Online inference for the Ranking stage.
Latency Optimization:
Retrieval: Use Approximate Nearest Neighbor (ANN) search (e.g., FAISS) on user/item embeddings or a simple heuristic (Top 1000 ads by category).
Parallelism: Fetch features from Redis in parallel with the Retrieval step.
Reliability: If the Ranking service times out, fall back to a "Popularity-based" ranking.
Evaluation Pipeline
Offline Evaluation:
AUC-ROC: Measures ranking order quality.
Log-Loss: Measures the accuracy of the probability estimate.
Online Evaluation: 2-way A/B test. Group A: Baseline (Heuristic); Group B: DeepFM. Monitor CTR and Revenue per request.
Monitoring Pipeline
System Monitoring: Latency histograms, throughput, and error rates (5xx codes).
Data Monitoring: Track KL-Divergence between training and serving feature distributions.
Model Monitoring: Monitor the Calibration Plot (Predicted vs. Actual CTR) to ensure the model isn't over/under-predicting.
Wrap Up
Final Evaluation
Observability: We use a dashboard to track "Model Staleness"—if the model hasn't been retrained in 24 hours, trigger an alert.
Feedback Loop: Clicks are fed back into the training pipeline hourly.
Trade-offs: We trade off model complexity (e.g., Transformers) for latency (DeepFM), ensuring we can scale to thousands of candidates per request.
Distinguishing Insights: In a real-world AdTech system, the "Feedback Loop" is poisoned by Exploration. We must use techniques like Bandit algorithms (UCB/Thompson Sampling) in the retrieval stage to ensure we aren't just showing the same high-CTR ads and ignoring potentially better new ones.