DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
The Question
ML Design

Large-Scale Online Ads Ranking and Auction System

Design a high-throughput ads ranking system for a social media platform with 500M DAU. The system must select the most relevant ads from a corpus of 10M candidates within a 100ms P99 latency budget. Focus on the end-to-end ML lifecycle: from real-time feature engineering (handling user signals) and multi-stage funnel architecture (retrieval and ranking), to specialized ad-tech challenges like position bias, click/conversion calibration for auctions, and handling delayed feedback loops in conversion data. Discuss the trade-offs between model complexity and serving constraints, and how to ensure the system maximizes long-term revenue (eCPM) while maintaining advertiser trust through accurate ROI predictions.
Wide & Deep
DeepFM
Two-Tower Model
FAISS
HNSW
Kafka
Flink
Spark
Redis
Platt Scaling
Isotonic Regression
Multi-Task Learning
Questions & Insights

Clarifying Questions

Business Goal: Is the primary objective maximizing Revenue (eCPM), Click-Through Rate (CTR), or Advertiser ROI?
Assumption*: We aim to maximize eCPM** (expected Cost Per Mille), which is calculated as Bid \times pCTR \times pCVR.
Constraints & Scale: What is the scale of the system (DAU, QPS) and the latency budget?
Assumption*: 500M DAU, 100k peak QPS, and a strict P99 latency of 100ms** for the entire ranking pipeline.
Corpus Size: How many active ads are in the pool at any given time?
Assumption: 10M active ads.
Cold Start: How do we handle new advertisers or new users?
Assumption: We need a strategy for ads with no historical data to ensure exploration.
Freshness: How quickly must user actions (clicks/converts) influence the ranking?
Assumption: Near real-time (minutes) for feature updates to capture session intent.

Thinking Process

Identify the Bottleneck: With 10M ads and a 100ms budget, a single-stage complex model is impossible. I must use a multi-stage funnel: Retrieval (Candidate Generation) followed by Ranking.
Model Selection: For the MVP Ranking stage, I need to handle sparse categorical features (User ID, Ad ID) and dense features (historical CTR). A Deep Neural Network (DNN) with an embedding layer is the standard "Workhorse."
Focus on the Objective: Ads ranking isn't just about CTR; it's about the auction. The ML model provides the probabilities (pCTR, pCVR), and the system ranks by Bid \times pCTR \times pCVR.
Data Engineering: In AdTech, data freshness and label attribution (handling delayed conversions) are the biggest engineering hurdles.

Elite Bonus Points

Calibration: Raw pCTR from models is often biased. I would implement Platt Scaling or Isotonic Regression to ensure E[predictions] \approx E[labels], which is critical for fair auctions and billing.
Position Bias Modeling: Users are more likely to click the top ad regardless of relevance. I'll use a shallow "Bias Tower" during training (using position as a feature) but remove it during inference to get the true relevance score.
Delayed Feedback Handling: Conversions often happen days after a click. I’ll implement a Negative Sampling with Correction or Fake Negative Calibration strategy to prevent the model from becoming overly pessimistic about recent ads.
Counterfactual Evaluation: Using Inverse Probability Weighting (IPW) to evaluate how the model would have performed on a different policy, mitigating the "selection bias" of offline logs.
Design Breakdown

Requirements

Product Goal: Connect users with relevant ads to maximize platform revenue and advertiser value.
Success Metrics:
Online: Revenue per Mille (RPM), CTR, Conversion Rate (CVR).
Offline: AUC-ROC (for CTR), LogLoss, Calibration Error.
Guardrail: P99 Latency < 100ms, Model Training Time, Data Freshness (Latency of feature pipeline).
System Constraints: 100k QPS, 10M items, 100ms SLA.
Data Availability: User profile, Ad metadata, Real-time click/conversion streams, Search/Contextual logs.

ML Problem Framing

ML Task Type: Binary Classification (Predicting click: 0 or 1).
Prediction Target: P(\text{click} | \text{user, ad, context}).
Inputs:
User ID (embedding), age, gender, historical categories clicked, last 5 ads viewed.
Ad: Creative ID, Advertiser ID, Campaign ID, Category, Historical CTR (1h, 1d, 7d).
Context: Device, Time of day, App/Page ID, Geo-location.
Outputs: A probability score \in [0, 1].
ML Challenges: Extreme class imbalance (CTR is often < 1%), extreme feature sparsity, and feedback loops (the model influences its own future training data).

Design Summary & MVP

Concise Summary: A two-stage architecture: Retrieval (Two-Tower Model for Top-1000) and Ranking (DeepFM or DNN for Top-50), finalized by an auction mechanism (Bid \times Score).
Model Architecture & Selection:
Baseline: Logistic Regression with hashed features.
Target Model: Wide & Deep or DeepFM.
Choice Rationale: Wide component handles memorization (specific user-ad pairs), while the Deep component handles generalization via embeddings.
Simplicity Audit: We avoid Reinforcement Learning or complex Transformers for the MVP to prioritize low-latency serving and stable training.
Architecture Decision Rationale:
The funnel approach satisfies the 100ms latency requirement.
DeepFM captures high-order feature interactions without manual feature engineering.

System Architecture

Pipeline Deep Dive

Data Pipeline

Data Source: Mobile SDK logs (clicks/views), Backend Transaction DB (conversions), Ad Catalog (creatives).
Data Ingestion: Kafka for real-time events. Airflow for daily batch synchronization of advertiser metadata.
Data Storage: S3/Iceberg for the data lake. Partitioned by event_date and event_hour for efficient retrieval.
Data Processing: Spark for complex historical joins (labeling clicks with conversions). Flink for windowed aggregations (e.g., "how many times did this user click an ad in the last 10 minutes?").
Data Quality: De-duplication of click logs (preventing double-counting) and schema validation using Protobuf/Avro.

Feature Pipeline

Feature Engineering:
Categorical: Hashing trick for high-cardinality IDs.
Continuous: Log-transform of Bid and UserAge.
Temporal: Time-since-last-action features.
Offline Feature Pipeline: Pre-calculates daily historical CTRs for every ad/user pair and stores them in Hive/BigQuery.
Online Feature Pipeline: Uses Flink to update the user's "recent interest" vector in Redis (P99 < 5ms).
Feature Store: Tecton or Feast ensures that the features used during training (point-in-time) are identical to features used during serving.

Model Architecture

Problem Formulation: Multi-task learning (MTL) is preferred. Predict pCTR and pCVR simultaneously.
Core Model: Wide & Deep Neural Network.
Wide: Linear model for cross-product features (e.g., User_Language AND Ad_Language).
Deep: 3-4 layer MLP with embeddings for sparse IDs.
Embeddings: 64-dimensional embeddings for AdIDs and UserIDs, initialized randomly and learned during training.
Complexity: ~1M parameters. Model size ~500MB (mostly embedding tables). This fits easily in RAM for low-latency scoring.

Training Pipeline

Dataset Construction:
Labeling: "Click" = 1, "Impression without click" = 0.
Negative Sampling: Since CTR is low, we down-sample negatives (non-clicks) to balance the dataset and speed up training, then apply a logit adjustment during serving: f(x) = \sigma(logit(x) + \log(\text{sampling\_rate})).
Training Infrastructure: Horovod/PyTorch Distributed using Spot Instances for cost-efficiency.
Hyperparameter Tuning: Focus on embedding dropout and learning rate decay.
Retraining: Daily batch retraining on the last 30 days of data + Incremental online learning every 1 hour for the "Deep" weights to capture fresh trends.

Serving Pipeline

Two-Stage Architecture:
Retrieval: FAISS or HNSW (Approximate Nearest Neighbor) searching against user embeddings to find the top 1000 candidate ads.
Ranking: The DNN scores those 1000 ads.
Latency Optimization:
Model Quantization (FP16).
Remote feature fetching in parallel.
Caching top 10% of popular ads' embeddings.
Reliability: If the Ranker fails, fall back to a "Popularity-based" heuristic ranker.

Evaluation Pipeline

Offline:
AUC-ROC: Measures ranking quality.
LogLoss: Measures prediction accuracy.
PR-Curve: Better for highly imbalanced ad data.
Online: A/B Testing framework. We split traffic by User_ID and measure Lift in Revenue and Ad Relevance (CTR).

Monitoring Pipeline

System Metrics: Monitoring throughput (QPS) and latency (P99).
ML Metrics:
Feature Drift: Monitor if the mean of "User_Age" changes significantly.
Prediction Drift: Monitor if the average pCTR drops (model decay).
Label Freshness: Alert if the conversion data pipeline is delayed by > 4 hours.
Wrap Up

Final Evaluation

Observability: Use Population Stability Index (PSI) to detect distribution shifts between training and serving.
Edge Cases:
Cold Start: Use an \epsilon$-greedy approach or UCB (Upper Confidence Bound) to give new ads a minimum number of impressions.
Ad Exhaustion: Penalize ads shown to the same user too many times in a session (Frequency Capping).
Trade-offs:
Accuracy vs. Latency: A 10-layer Transformer might be 2% more accurate but exceed the 100ms budget. We chose Wide & Deep for its balance.
Distinguishing Insights: In a real-world system, Auction Dynamics (GSP - Generalized Second Price or VCG) matter as much as the ML model. The ML engineer must ensure the pCTR is calibrated, or the auction will be inefficient.