The Question
ML DesignScalable Multi-Stage Ads Ranking System
Design a high-scale ads ranking and auction system capable of handling a corpus of 100 million ads and 500k QPS. The system must optimize for platform revenue (eCPM) while adhering to a strict 100ms latency budget. Detail the multi-stage funnel (retrieval, ranking, auction), feature engineering for sparse data, and strategies for model calibration and ad cold-start.
Two-Tower Model
DCN-v2
FAISS
MMoE
HNSW
Kafka
Flink
Spark
Horovod
TensorRT
Feast
Redis
Questions & Insights
Clarifying Questions
Business Goal: What is the primary North Star metric? I assume we aim to maximize eCPM (effective Cost Per Mille), which is a function of P(\text{click}) \times \text{Bid}.
Constraints & Scale: What is the scale of the system? I assume 100M+ active ads, 500M+ DAU, and a peak throughput of 500k QPS.
Latency Budget: What is the end-to-end P99 latency requirement? I assume 100ms for the entire ranking stack (retrieval + scoring + auction).
Edge Cases: How should we handle "Cold Start" for new ads and "Delayed Feedback" for conversion events? I assume we need an exploration strategy (e.g., epsilon-greedy) and a windowed attribution model.
Assumptions: I assume a multi-stage architecture (Retrieval -> Ranking -> Auction) is required to handle the 100M ad corpus within the 100ms latency budget.
Thinking Process
Identify the Bottleneck: With 100M ads, we cannot score every ad with a complex model. The bottleneck is the trade-off between model complexity and the number of candidates. I must use a multi-stage funnel.
Retrieval vs. Ranking: Retrieval needs to be high-recall/low-precision (filtering 100M to 1000). Ranking needs to be high-precision (scoring 1000 to 1).
MVP Strategy: Start with a Two-Tower model for retrieval (embeddings) and a DeepFM or DCN (Deep & Cross Network) for ranking to capture feature interactions efficiently. Avoid Reinforcement Learning for bidding in the MVP.
Scaling: Use a Feature Store to bridge the gap between offline training and online serving to prevent training-serving skew.
Elite Bonus Points
Calibration: Ad scores (pCTR) must be well-calibrated (not just directional) because they are multiplied by real money (bids) in the auction. I'll implement Isotonic Regression or Platt Scaling.
Position Bias Modeling: Users are more likely to click the top ad regardless of relevance. I'll use a Position Encoding feature during training and set it to a "default" value during inference to decouple relevance from placement.
Negative Downsampling with Calibration Correction: In ad-tech, clicks are rare (CTR < 1%). I will downsample negatives to speed up training but apply a logit transformation at inference to correct the predicted probability.
Embeddings Warm-start: To solve the ad cold-start, I will use Content-based Embeddings (from ad images/text) to initialize the ID embeddings of new ads.
Design Breakdown
Requirements
Product Goal: Maximize platform revenue while maintaining user relevance and advertiser ROI.
Success Metrics:
Online: CTR (Click-Through Rate), VVR (View-to-Value Rate), Revenue per Mille (RPM).
Offline: AUC-ROC (for ranking quality), LogLoss (for calibration), NDCG.
Guardrail: P99 Latency < 100ms, Model Training Time < 4 hours.
System Constraints: 500k QPS, 100M item corpus, distributed training on TBs of daily logs.
Data Availability: Real-time click/impression streams, advertiser metadata, user profile store.
ML Problem Framing
ML Task Type: Binary Classification (Predicting click vs. no-click).
Prediction Target: P(\text{click} | \text{user, ad, context}).
Inputs:
UserID, age, gender, historical click-map (last 50 ads), search intent.
AdID, CampaignID, Category, Creative Embeddings (ResNet/CLIP), Historical CTR.
Context: Device, Time of Day, AppID, Ad Position.
Outputs: A probability score \in [0, 1].
ML Challenges: Extreme class imbalance, extreme sparsity of ID features, and high-frequency feature updates.
Design Summary & MVP
Concise Summary: A two-stage system consisting of a Two-Tower Neural Network for vector-based retrieval and a Deep & Cross Network (DCN-v2) for point-wise ranking, followed by a second-price auction logic.
Model Architecture & Selection:
Baseline: Logistic Regression with heavy feature engineering.
Target Model: Two-Tower (Retrieval) + DCN-v2 (Ranking).
Choice Rationale: Two-Tower allows for sub-millisecond approximate nearest neighbor (ANN) search. DCN-v2 captures high-order feature interactions (e.g., User-Category cross) without manual feature engineering.
ML Life Cycle Summary: Continuous ingestion via Kafka -> Spark for feature extraction -> Horovod/TensorFlow for distributed training -> TF-Serving for low-latency inference.
Simplicity Audit: This design avoids complex Multi-Task Learning (MTL) or RL-based bidding initially, focusing on the most critical component: accurate CTR prediction.
Architecture Decision Rationale: This architecture is the industry standard for AdTech (Google, Meta) because it balances the "retrieval recall" and "ranking precision" optimally within strict latency constraints.
System Architecture
Pipeline Deep Dive
Data Pipeline
Data Source: Ingestion of click/impression logs from mobile/web clients, advertiser bid updates, and user profile changes.
Data Ingestion: Kafka as the message bus for "At-least-once" delivery. Flink handles real-time aggregations (e.g., ad click counts in the last 5 minutes).
Data Storage: S3 acts as the Data Lake for raw logs (Parquet format). Snowflake or BigQuery for structured analytical queries.
Data Processing: Spark DAGs perform "Join-on-Log" to attribute clicks to impressions. This handles the 1-2 second delay between an impression and a click.
Data Quality: Great Expectations used for schema enforcement and identifying null spikes in feature columns.
Feature Pipeline
Feature Definition:
User: Dense embeddings from search history + Sparse IDs.
Ad: Static (Category) + Dynamic (1h/24h CTR).
Feature Engineering: Log-transform of counts, Z-score normalization for continuous variables, and Hashing trick for high-cardinality IDs.
Offline/Online Consistency: Use a Feature Store (Feast). Training data is generated via "Point-in-time" joins in the offline store to prevent label leakage. Online serving pulls the same feature definitions from Redis.
Training/Serving Skew: Monitored by logging the features used at inference time and comparing their distributions with training data (PSI metric).
Model Architecture
Problem Formulation: Point-wise ranking as binary classification using Cross-Entropy loss.
Candidate Model Families:
Retrieval: Two-Tower (User Tower and Ad Tower) with Cosine Similarity.
Ranking: DCN-v2 (Deep & Cross Network).
Architecture Design:
Embeddings Layer: Shared embeddings for categorical features.
Cross Network: Explicitly models feature interactions x_i \times x_j across multiple layers.
Deep Network: Standard MLP to capture non-linearities.
Model Complexity: 100M parameters (mostly embeddings). We use FP16 quantization to reduce memory footprint.
Architecture Optimization: Knowledge Distillation—a large DCN-v2 "Teacher" trains a smaller, faster "Student" model for the initial ranking stage.
Training Pipeline
Dataset Construction: Negative sampling (ratio 10:1). We use importance sampling to weight samples to account for the bias.
Data Splitting: Time-based split. Train on days 1-28, validate on day 29, test on day 30.
Training Infrastructure: Horovod on a Kubernetes GPU cluster.
Hyperparameter Tuning: Ray Tune using Bayesian Optimization for embedding dimensions and learning rates.
Retraining Strategy: Daily batch retraining with hourly incremental updates for the top layers to capture shifting trends.
Serving Pipeline
Serving Pattern: Online Inference.
Latency Optimization:
Retrieval: Use Faiss (HNSW index) for ANN search.
Ranking: Model is exported to TensorRT for GPU acceleration.
Scalability: Horizontal scaling of the Scoring Service based on CPU/GPU utilization.
Reliability: If the Scoring Service fails, fall back to a "Heuristic Ranker" (ranking by historical CTR).
Evaluation Pipeline
Offline Evaluation: AUC is the primary metric (measures ranking order). Calibration Error (ECE) ensures pCTR matches real CTR.
Online Evaluation: A/B Testing against a control group. We measure "Lift in Revenue" and "User Retention".
Monitoring Pipeline
System Monitoring: Prometheus/Grafana for QPS and P99 latency.
Data Monitoring: Monitor for Feature Drift (e.g., if a categorical feature's distribution shifts).
Model Monitoring: Track the ratio of Predicted CTR / Observed CTR in real-time. If the ratio deviates >10\%, trigger an alert.
Wrap Up
Final Evaluation
Observability: Real-time dashboards showing "Bid vs. pCTR" distributions.
Edge Cases:
Cold Start: For new ads, we use a "MAB" (Multi-Armed Bandit) approach, giving them a small "exploration bonus" to their score to collect data.
Trade-offs:
Accuracy vs. Latency: We cap the number of items passed to the Ranker at 500 to stay under 100ms.
Revenue vs. CX: We apply a "relevance threshold"—ads with a score below X are filtered even if they have high bids.
Distinguishing Insights: To reach Principal levels, I would propose Multi-Task Learning (MTL) using MMoE (Multi-gate Mixture of Experts) to simultaneously optimize for Click, Conversion, and Dismissal (Negative Signal), allowing for a more holistic optimization objective beyond just clicks.