The Question
ML DesignAd Ranking and Revenue Optimization System
Design a high-scale, low-latency ads ranking system capable of selecting the most relevant ads from a corpus of 100M+ candidates. The design must address the dual goals of maximizing platform revenue (eCPM) and user experience. Specifically, detail the multi-stage retrieval and ranking pipeline, strategies for handling extreme feature sparsity, techniques for model calibration to support fair auctions, and mechanisms to mitigate position bias and handle delayed feedback from conversion events. Discuss the system architecture required to support 50k+ QPS with sub-100ms P99 latency while ensuring offline-online feature consistency.
DeepFM
Two-Tower Model
Isotonic Regression
Negative Downsampling
FAISS
Kafka
Flink
Spark
Feast
Triton Inference Server
Questions & Insights
Clarifying Questions
Business Goal: What is the primary North Star metric?
Assumption*: We aim to maximize Expected Revenue (eCPM)**, which is a function of Bid \times pCTR \times pCVR, while maintaining user relevance.
Constraints & Scale: What is the scale of the system?
Assumption: 100M+ candidate ads, 500M+ users, 50k QPS, and a strict P99 latency budget of <100ms for the ranking stage.
Data Freshness: How quickly must the system respond to new ads or user actions?
Assumption: New ads must be rankable within minutes (cold start), and user feedback (clicks) should update features in near real-time.
Edge Cases: How do we handle budget exhaustion and cold start?
Assumption: We assume an upstream service handles advertiser budget pacing; our focus is purely on the ranking and auction integrity.
Thinking Process
Identify the Funnel: With 100M ads, a single-stage ranker is impossible due to latency. I need a multi-stage approach: Retrieval (narrow 100M to 1k) and Ranking (narrow 1k to top-N).
The Core ML Problem: This is a multi-task learning problem. We need to predict P(click) and P(conversion). Raw probabilities must be calibrated because they are used in a downstream auction (GSP or VCG).
Feature Sparsity: AdTech data is notoriously sparse (UserIDs, AdIDs). Embeddings and cross-features (User \times Ad Category) are critical.
Scaling Strategy: Use a Feature Store for low-latency retrieval and a distributed training setup (e.g., Parameter Server or Horovod) for the massive embedding tables.
Elite Bonus Points
Calibration for Bidding Accuracy: Raw model outputs are often biased. I would implement Isotonic Regression or Platt Scaling as a post-processing step to ensure that a pCTR of 0.1 actually means a 10% click probability, which is vital for fair auctions.
Position Bias Mitigation: In historical logs, ads at the top get more clicks regardless of quality. I'll use a Position Feature during training (but set it to a default value during inference) or a shallow tower to learn and subtract the position bias.
Delayed Feedback Handling: Conversions often happen days after a click. I would implement a Negative Sampling with Correction or a FNC (False Negative Calibration) approach to avoid penalizing recent clicks that haven't converted yet.
Online Learning / Incremental Updates: Use a "streaming" training loop (e.g., Flink + Parameter Server) to update embedding layers every few minutes to capture trending ad creatives.
Design Breakdown
Requirements
Product Goal: Maximize revenue for the platform while delivering high ROI for advertisers.
Success Metrics:
Online Metrics: eCPM (Revenue per 1k impressions), CTR, Conversion Rate, Ad Relevance Score.
Offline Metrics: AUC (for CTR), LogLoss (for calibration accuracy), PR-AUC (for imbalanced conversion labels).
Guardrail Metrics: P99 Latency, Training-Serving Skew, Ad Fatigue (frequency capping).
System Constraints: 100ms inference, high availability (99.99%), support for billion-scale embedding tables.
Data Availability: Real-time clickstream logs, historical conversion data, advertiser metadata, user profile store.
ML Problem Framing
ML Task Type: Binary Classification (Pointwise Ranking).
Prediction Target: Two heads: P(\text{click} | u, a, c) and P(\text{conv} | \text{click}, u, a, c).
Inputs:
User: ID, demographics, last 10 ads clicked, search history.
Ad (Item): AdID, CampaignID, Creative Category, Image/Text embeddings.
Context: Device, Time of day, Page/Slot ID, Geo.
Outputs: Calibrated probabilities used to calculate eCPM = Bid \times pCTR \times pCVR.
ML Challenges: High class imbalance (CTR < 1%), extreme feature sparsity, and feedback loops (only showing what we think is good).
Design Summary & MVP
Concise Summary: A two-stage system: 1) A Two-Tower Retrieval model for generating candidates, and 2) A DeepFM or Wide & Deep ranker that predicts calibrated CTR and CVR.
Model Architecture & Selection:
Baseline Model: Logistic Regression with heavy feature engineering (cross-products).
Target Model: DeepFM. It shares the same embedding input for its "wide" (FM) and "deep" (DNN) components, allowing it to learn both low-order and high-order feature interactions without manual feature crossing.
Simplicity Audit: We use DeepFM for the MVP because it eliminates the need for manual cross-feature engineering which is brittle and time-consuming. We avoid Transformers/Graph Nets initially to stay within the 100ms latency budget.
System Architecture
Pipeline Deep Dive
Data Pipeline
Data Source: Mobile/Web SDKs for events; Advertiser Portal for ad metadata.
Data Ingestion: Kafka for real-time events (clicks, impressions). Airflow for batch metadata.
Data Storage: S3 for raw logs. Iceberg for the structured data lake to allow for schema evolution and time-travel (crucial for debugging).
Data Processing: Spark for joining clicks and impressions (attribution windowing). Flink for real-time aggregation of user-ad frequency counts.
Data Quality: De-duplication of events (at-least-once Kafka delivery) and validation of "Bid" values to prevent training on corrupted auction data.
Feature Pipeline
Feature Engineering:
Categorical: Hash-encoding for AdID/UserID.
Continuous: Log-transforming historical CTRs.
Real-time: Number of times the user saw this ad in the last hour (to prevent over-exposure).
Feature Store: Use Feast or Tecton.
Offline: Provides point-in-time joins to prevent data leakage (using the timestamp of the impression).
Online: Provides <10ms lookup for user-state (e.g., last 5 categories visited).
Skew Mitigation: Use a shared code library for feature transformation in both Spark (offline) and the Serving layer (online).
Model Architecture
Problem Formulation: Pointwise Multi-task Learning. Task_1: P(click), Task_2: P(conv).
Candidate Models:
Wide & Deep: Good, but requires manual wide-part engineering.
DeepFM: Preferred. The FM layer learns 2nd-order interactions automatically via inner products of embeddings.
Architecture Design:
Embedding Layer: 128-dim vectors for sparse IDs.
FM Layer: Models pairwise feature interactions.
Deep Layer: 3-layer MLP (1024, 512, 256) to capture non-linear relationships.
Complexity vs. Latency: To meet <100ms, we apply Weight Quantization (INT8) and use NVidia Triton with ONNX runtime for inference.
Training Pipeline
Dataset Construction:
Negative Sampling: Impressions without clicks are negatives. Since CTR is low (~0.1%), we downsample negatives to 10% to speed up training and apply a weight correction w = 1/downsample\_rate in the loss function.
Data Splitting: Temporal Split. Train on days 1-28, validate on day 29, test on day 30. Random splits lead to look-ahead bias.
Retraining: Daily batch retraining on the latest 30 days of data. Weekly "full-shuffle" training to refresh all embeddings.
Serving Pipeline
Pattern: Two-stage.
Retrieval (Recall): ANN search (using FAISS) on embeddings or simple filtering (Geo + Keywords) to get 1,000 candidates.
Ranking (Precision): DeepFM scores the 1,000 candidates.
Reliability: If the DeepFM service times out, fall back to a "Heuristic Ranker" (Rank by historical Ad-level CTR).
Scalability: Horizontal scaling of inference nodes via K8s. Use gRPC for low-overhead communication between the Ad-Server and the Ranking-Service.
Evaluation Pipeline
Offline: AUC is the primary metric for ranking quality. LogLoss measures how close predicted probabilities are to 0/1. Expected Calibration Error (ECE) to monitor bidding integrity.
Online: A/B Testing. Group users by UUID. Metric: Revenue per User (ARPU) and Ad Relevance (Negative feedback rate).
Monitoring Pipeline
Data Monitoring: Track Population Stability Index (PSI) of incoming features. If the distribution of "Country" features shifts significantly, trigger an alert.
Model Monitoring: Monitor the Mean Predicted CTR vs. Actual CTR. If the model starts over-predicting, the auction will over-charge advertisers, leading to churn.
Wrap Up
Final Evaluation
Observability: Real-time dashboards showing pCTR vs. Observed CTR per campaign.
Feedback Loop: Clicks are fed back into the Feature Store within seconds to update "User-Last-Clicked-Category" features.
Trade-offs:
Accuracy vs. Latency: We chose DeepFM over a heavy Transformer to ensure the 100ms budget is met.
Freshness vs. Stability: Daily retraining strikes a balance; real-time embedding updates (Online Learning) are deferred to V2 to avoid system complexity.
Distinguishing Insight: In Ads, Position Bias is the silent killer of AUC. By including position as a feature during training but setting it to "Position 1" for all candidates during inference, we effectively rank ads based on their "intrinsic quality" rather than where the previous model placed them.