DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
The Question
ML Design

Ad Click-Through Rate (CTR) Prediction System

Design a high-scale Ad Click-Through Rate (CTR) prediction system capable of processing 100k+ QPS with sub-100ms latency. The system must handle a corpus of 10 million ads and 100 million users. Detail the multi-stage ranking funnel (retrieval and ranking), feature engineering for high-cardinality ID features, strategies for addressing training-serving skew, and how to maintain model calibration for downstream bidding. Explicitly discuss data ingestion, distributed training for massive datasets, and real-time monitoring of model performance and data drift.
DeepFM
Factorization Machines
Kafka
Spark
Flink
Redis
FAISS
TensorFlow
Horovod
Isotonic Regression
ONNX
Questions & Insights

Clarifying Questions

Business Goal: Is the North Star metric pure Click-Through Rate (CTR) or are we optimizing for Expected Value (e.g., eCPM = CTR \times CPC)? Assumption: We optimize for CTR to maximize user engagement and platform revenue in a CPC model.
Constraints & Scale: What is the traffic volume (QPS) and the size of the ad corpus? Assumption: 100k QPS, 10M active ads, and a P99 latency budget of <100ms for the entire ranking stack.
Edge Cases: How do we handle new ads (Cold Start) and how fast must the model react to changing user trends? Assumption: We need near real-time updates for ad features and a mechanism to explore new ads (Epsilon-greedy).
Data Freshness: Is there a requirement for online learning? Assumption: We will use hourly/daily batch training with real-time feature updates for the MVP.

Thinking Process

Identify the Funnel: With 10M ads, we cannot rank all of them with a complex model. I need a multi-stage architecture: Retrieval (filtering 10M to 1k) and Ranking (scoring the top 1k).
Address Data Characteristics: Ad click data is extremely sparse (CTR is often <1%) and features are high-cardinality (User IDs, Ad IDs). My model choice must handle feature interactions effectively without manual engineering.
Latency vs. Complexity: A Deep Neural Network (DNN) is great, but I must ensure the feature retrieval and model inference fit within the 100ms budget.
The AdTech "Gotcha": Calibration is critical. A model that predicts 0.1 CTR when the real CTR is 0.01 will break the bidding logic.

Elite Bonus Points

Position Bias Modeling: In AdTech, ads at the top get more clicks regardless of quality. I will implement a "Position Feature" during training but set it to a default "Position 1" during inference to decorrelate relevance from placement.
Calibration for Downstream Bidding: Using Isotonic Regression or Platt Scaling to ensure the predicted probability P(\text{click}) matches the empirical click rate, which is vital for second-price auctions.
Delayed Feedback Handling: Clicks can happen minutes or hours after an impression. I'll use a "negative sampling with later correction" strategy or a wait-time window to avoid false negatives in training.
Feature Hashing & Embedding Versioning: To handle the massive ID space (billions of users), I'll use the "Hashing Trick" to bound the vocabulary size and ensure embedding versioning to prevent serving/training mismatches.
Design Breakdown

Requirements

Product Goal: Maximize the probability of a user clicking an ad given their context.
Success Metrics:
Online Metrics: CTR, Revenue Per Mille (RPM), Lift over baseline.
Offline Metrics: AUC-ROC (ranking quality), Log-Loss (calibration quality), PR-AUC (due to class imbalance).
Guardrail Metrics: P99 Latency (<100ms), Training Time, Model Size.
System Constraints: 100M+ DAU, 10M+ ads, distributed training on TBs of data.
Data Availability: Impression logs (features), Click logs (labels), Ad Metadata, User Profiles.

ML Problem Framing

ML Task Type: Binary Classification (Pointwise Ranking).
Prediction Target: P(\text{click} = 1 | \text{user, ad, context}).
Inputs:
User: Historical clicks (last 50), demographics, device.
Ad: Category, Advertiser ID, Creative type, historical CTR (1h, 1d, 7d).
Context: Time of day, App ID, Geo-location.
Outputs: A calibrated probability score between 0 and 1.
ML Challenges: Extreme class imbalance (99% negative), high cardinality categorical features, and non-stationary data (trends change).

Design Summary & MVP

Concise Summary: A two-stage system using a fast filtering heuristic for retrieval followed by a DeepFM (Factorization-Machine supported Neural Network) for high-precision ranking.
Model Architecture & Selection:
Baseline Model: Logistic Regression with manual feature crosses (e.g., User_Category X Ad_Category).
Target Model: DeepFM.
Choice Rationale: DeepFM captures both low-order (FM component) and high-order (DNN component) feature interactions automatically, which is superior to manual crossing in high-cardinality spaces.
ML Life Cycle Summary: Data is ingested via Kafka, joined on RequestID, features are stored in a Feature Store, and the model is trained offline in batch. Serving happens via a low-latency C++ or Go microservice.
Simplicity Audit: I am avoiding Online Learning (RL) for the MVP as batch training with frequent updates (e.g., hourly) provides 90% of the value with 10% of the infrastructure complexity.
Architecture Decision Rationale:
Why this?: DeepFM provides the best balance of accuracy and latency for CTR.
Requirement Satisfaction: The two-stage funnel ensures we hit the <100ms latency target while still considering 10M+ items.

System Architecture

Pipeline Deep Dive

Data Pipeline

Data Source: Ad impression logs (every time an ad is shown) and Click logs (every time a user clicks).
Data Ingestion: Kafka acts as the buffer. Click events are joined with impression events using a RequestID.
Data Storage: Raw logs in S3 (Data Lake). Processed training samples in a Data Warehouse (BigQuery/Snowflake) partitioned by hour.
Data Processing: Spark handles the heavy lifting of joining logs with a 24-hour window to account for delayed clicks.
Data Quality: We implement "Negative Downsampling" to reduce storage costs while keeping all positive (click) samples.

Feature Pipeline

Feature Definition:
Dense: User age, historical CTR (float).
Sparse: UserID, AdID, Category (categorical).
Feature Engineering: Standardize dense features. Use Embedding Layers for sparse features to map them to low-dimensional vectors.
Online Feature Pipeline: Flink calculates real-time counters (e.g., "how many times has this ad been clicked in the last 15 minutes?") and pushes to Redis.
Feature Store: We use a Feature Store (e.g., Feast) to ensure that the same transformation logic used during training is applied during inference, preventing Training-Serving Skew.

Model Architecture

Problem Formulation: Binary classification using Cross-Entropy Loss.
Candidate Model Families:
Logistic Regression (too simple).
XGBoost (hard to scale with 1B+ rows and high-cardinality IDs).
DeepFM (Winner): Integrates Factorization Machines (FM) and Deep Neural Networks (DNN).
Architecture Design:
Wide Component: Captures memorization (e.g., specific AdID).
Deep Component: Generalization via embeddings of categorical features.
Inference Optimization: Use TensorRT or ONNX Runtime for C++ optimized execution to keep ranking under 20ms for 500 candidates.

Training Pipeline

Dataset Construction: 1:10 positive-to-negative sampling ratio. We use a time-based split (Train on days 1-20, Test on day 21) to prevent looking into the future.
Training Infrastructure: Distributed TensorFlow or PyTorch using Horovod for multi-GPU training.
Retraining Strategy: Daily full-model retraining + hourly incremental "fine-tuning" on the embedding layers to capture the latest trends.

Serving Pipeline

Serving Pattern: Online inference for the Ranking stage.
Latency Optimization:
Retrieval: Use Approximate Nearest Neighbor (ANN) search (e.g., FAISS) on user/item embeddings or a simple heuristic (Top 1000 ads by category).
Parallelism: Fetch features from Redis in parallel with the Retrieval step.
Reliability: If the Ranking service times out, fall back to a "Popularity-based" ranking.

Evaluation Pipeline

Offline Evaluation:
AUC-ROC: Measures ranking order quality.
Log-Loss: Measures the accuracy of the probability estimate.
Online Evaluation: 2-way A/B test. Group A: Baseline (Heuristic); Group B: DeepFM. Monitor CTR and Revenue per request.

Monitoring Pipeline

System Monitoring: Latency histograms, throughput, and error rates (5xx codes).
Data Monitoring: Track KL-Divergence between training and serving feature distributions.
Model Monitoring: Monitor the Calibration Plot (Predicted vs. Actual CTR) to ensure the model isn't over/under-predicting.
Wrap Up

Final Evaluation

Observability: We use a dashboard to track "Model Staleness"—if the model hasn't been retrained in 24 hours, trigger an alert.
Feedback Loop: Clicks are fed back into the training pipeline hourly.
Trade-offs: We trade off model complexity (e.g., Transformers) for latency (DeepFM), ensuring we can scale to thousands of candidates per request.
Distinguishing Insights: In a real-world AdTech system, the "Feedback Loop" is poisoned by Exploration. We must use techniques like Bandit algorithms (UCB/Thompson Sampling) in the retrieval stage to ensure we aren't just showing the same high-CTR ads and ignoring potentially better new ones.