The Question
ML DesignLarge-Scale Online Ads Ranking and Auction System
Design a high-throughput ads ranking system for a social media platform with 500M DAU. The system must select the most relevant ads from a corpus of 10M candidates within a 100ms P99 latency budget. Focus on the end-to-end ML lifecycle: from real-time feature engineering (handling user signals) and multi-stage funnel architecture (retrieval and ranking), to specialized ad-tech challenges like position bias, click/conversion calibration for auctions, and handling delayed feedback loops in conversion data. Discuss the trade-offs between model complexity and serving constraints, and how to ensure the system maximizes long-term revenue (eCPM) while maintaining advertiser trust through accurate ROI predictions.
Wide & Deep
DeepFM
Two-Tower Model
FAISS
HNSW
Kafka
Flink
Spark
Redis
Platt Scaling
Isotonic Regression
Multi-Task Learning
Questions & Insights
Clarifying Questions
Business Goal: Is the primary objective maximizing Revenue (eCPM), Click-Through Rate (CTR), or Advertiser ROI?
Assumption*: We aim to maximize eCPM** (expected Cost Per Mille), which is calculated as Bid \times pCTR \times pCVR.
Constraints & Scale: What is the scale of the system (DAU, QPS) and the latency budget?
Assumption*: 500M DAU, 100k peak QPS, and a strict P99 latency of 100ms** for the entire ranking pipeline.
Corpus Size: How many active ads are in the pool at any given time?
Assumption: 10M active ads.
Cold Start: How do we handle new advertisers or new users?
Assumption: We need a strategy for ads with no historical data to ensure exploration.
Freshness: How quickly must user actions (clicks/converts) influence the ranking?
Assumption: Near real-time (minutes) for feature updates to capture session intent.
Thinking Process
Identify the Bottleneck: With 10M ads and a 100ms budget, a single-stage complex model is impossible. I must use a multi-stage funnel: Retrieval (Candidate Generation) followed by Ranking.
Model Selection: For the MVP Ranking stage, I need to handle sparse categorical features (User ID, Ad ID) and dense features (historical CTR). A Deep Neural Network (DNN) with an embedding layer is the standard "Workhorse."
Focus on the Objective: Ads ranking isn't just about CTR; it's about the auction. The ML model provides the probabilities (pCTR, pCVR), and the system ranks by Bid \times pCTR \times pCVR.
Data Engineering: In AdTech, data freshness and label attribution (handling delayed conversions) are the biggest engineering hurdles.
Elite Bonus Points
Calibration: Raw pCTR from models is often biased. I would implement Platt Scaling or Isotonic Regression to ensure E[predictions] \approx E[labels], which is critical for fair auctions and billing.
Position Bias Modeling: Users are more likely to click the top ad regardless of relevance. I'll use a shallow "Bias Tower" during training (using position as a feature) but remove it during inference to get the true relevance score.
Delayed Feedback Handling: Conversions often happen days after a click. I’ll implement a Negative Sampling with Correction or Fake Negative Calibration strategy to prevent the model from becoming overly pessimistic about recent ads.
Counterfactual Evaluation: Using Inverse Probability Weighting (IPW) to evaluate how the model would have performed on a different policy, mitigating the "selection bias" of offline logs.
Design Breakdown
Requirements
Product Goal: Connect users with relevant ads to maximize platform revenue and advertiser value.
Success Metrics:
Online: Revenue per Mille (RPM), CTR, Conversion Rate (CVR).
Offline: AUC-ROC (for CTR), LogLoss, Calibration Error.
Guardrail: P99 Latency < 100ms, Model Training Time, Data Freshness (Latency of feature pipeline).
System Constraints: 100k QPS, 10M items, 100ms SLA.
Data Availability: User profile, Ad metadata, Real-time click/conversion streams, Search/Contextual logs.
ML Problem Framing
ML Task Type: Binary Classification (Predicting click: 0 or 1).
Prediction Target: P(\text{click} | \text{user, ad, context}).
Inputs:
User ID (embedding), age, gender, historical categories clicked, last 5 ads viewed.
Ad: Creative ID, Advertiser ID, Campaign ID, Category, Historical CTR (1h, 1d, 7d).
Context: Device, Time of day, App/Page ID, Geo-location.
Outputs: A probability score \in [0, 1].
ML Challenges: Extreme class imbalance (CTR is often < 1%), extreme feature sparsity, and feedback loops (the model influences its own future training data).
Design Summary & MVP
Concise Summary: A two-stage architecture: Retrieval (Two-Tower Model for Top-1000) and Ranking (DeepFM or DNN for Top-50), finalized by an auction mechanism (Bid \times Score).
Model Architecture & Selection:
Baseline: Logistic Regression with hashed features.
Target Model: Wide & Deep or DeepFM.
Choice Rationale: Wide component handles memorization (specific user-ad pairs), while the Deep component handles generalization via embeddings.
Simplicity Audit: We avoid Reinforcement Learning or complex Transformers for the MVP to prioritize low-latency serving and stable training.
Architecture Decision Rationale:
The funnel approach satisfies the 100ms latency requirement.
DeepFM captures high-order feature interactions without manual feature engineering.
System Architecture
Pipeline Deep Dive
Data Pipeline
Data Source: Mobile SDK logs (clicks/views), Backend Transaction DB (conversions), Ad Catalog (creatives).
Data Ingestion: Kafka for real-time events. Airflow for daily batch synchronization of advertiser metadata.
Data Storage: S3/Iceberg for the data lake. Partitioned by
event_date and event_hour for efficient retrieval.Data Processing: Spark for complex historical joins (labeling clicks with conversions). Flink for windowed aggregations (e.g., "how many times did this user click an ad in the last 10 minutes?").
Data Quality: De-duplication of click logs (preventing double-counting) and schema validation using Protobuf/Avro.
Feature Pipeline
Feature Engineering:
Categorical: Hashing trick for high-cardinality IDs.
Continuous: Log-transform of Bid and UserAge.
Temporal: Time-since-last-action features.
Offline Feature Pipeline: Pre-calculates daily historical CTRs for every ad/user pair and stores them in Hive/BigQuery.
Online Feature Pipeline: Uses Flink to update the user's "recent interest" vector in Redis (P99 < 5ms).
Feature Store: Tecton or Feast ensures that the features used during training (point-in-time) are identical to features used during serving.
Model Architecture
Problem Formulation: Multi-task learning (MTL) is preferred. Predict pCTR and pCVR simultaneously.
Core Model: Wide & Deep Neural Network.
Wide: Linear model for cross-product features (e.g.,
User_Language AND Ad_Language).Deep: 3-4 layer MLP with embeddings for sparse IDs.
Embeddings: 64-dimensional embeddings for AdIDs and UserIDs, initialized randomly and learned during training.
Complexity: ~1M parameters. Model size ~500MB (mostly embedding tables). This fits easily in RAM for low-latency scoring.
Training Pipeline
Dataset Construction:
Labeling: "Click" = 1, "Impression without click" = 0.
Negative Sampling: Since CTR is low, we down-sample negatives (non-clicks) to balance the dataset and speed up training, then apply a logit adjustment during serving: f(x) = \sigma(logit(x) + \log(\text{sampling\_rate})).
Training Infrastructure: Horovod/PyTorch Distributed using Spot Instances for cost-efficiency.
Hyperparameter Tuning: Focus on embedding dropout and learning rate decay.
Retraining: Daily batch retraining on the last 30 days of data + Incremental online learning every 1 hour for the "Deep" weights to capture fresh trends.
Serving Pipeline
Two-Stage Architecture:
Retrieval: FAISS or HNSW (Approximate Nearest Neighbor) searching against user embeddings to find the top 1000 candidate ads.
Ranking: The DNN scores those 1000 ads.
Latency Optimization:
Model Quantization (FP16).
Remote feature fetching in parallel.
Caching top 10% of popular ads' embeddings.
Reliability: If the Ranker fails, fall back to a "Popularity-based" heuristic ranker.
Evaluation Pipeline
Offline:
AUC-ROC: Measures ranking quality.
LogLoss: Measures prediction accuracy.
PR-Curve: Better for highly imbalanced ad data.
Online: A/B Testing framework. We split traffic by
User_ID and measure Lift in Revenue and Ad Relevance (CTR).Monitoring Pipeline
System Metrics: Monitoring throughput (QPS) and latency (P99).
ML Metrics:
Feature Drift: Monitor if the mean of "User_Age" changes significantly.
Prediction Drift: Monitor if the average pCTR drops (model decay).
Label Freshness: Alert if the conversion data pipeline is delayed by > 4 hours.
Wrap Up
Final Evaluation
Observability: Use Population Stability Index (PSI) to detect distribution shifts between training and serving.
Edge Cases:
Cold Start: Use an \epsilon$-greedy approach or UCB (Upper Confidence Bound) to give new ads a minimum number of impressions.
Ad Exhaustion: Penalize ads shown to the same user too many times in a session (Frequency Capping).
Trade-offs:
Accuracy vs. Latency: A 10-layer Transformer might be 2% more accurate but exceed the 100ms budget. We chose Wide & Deep for its balance.
Distinguishing Insights: In a real-world system, Auction Dynamics (GSP - Generalized Second Price or VCG) matter as much as the ML model. The ML engineer must ensure the pCTR is calibrated, or the auction will be inefficient.