The Question

Evaluation System for Large-Scale Recommendation Models

Design a high-scale evaluation and experimentation platform for a recommendation system (e.g., Amazon or YouTube). Your system must handle the end-to-end lifecycle: from offline backtesting using historical logs and counterfactual techniques to online A/B testing and shadow deployment. Address specific challenges such as selection bias in offline data, delayed feedback for conversion labels, and ensuring consistency between training and serving features. Explain how you would measure success using both ML-specific metrics (NDCG, AUC, Calibration) and business KPIs, while maintaining a strict P99 latency SLA for production traffic.

MMoE

XGBoost

Spark

Flink

Kafka

FAISS

HNSW

IPS

Feature Store

Thompson Sampling

AUC

NDCG

Questions & Insights

Clarifying Questions

Clarifying Questions & Constraints:

Business Goal: Is the primary focus on improving relevance (NDCG), engagement (CTR), or long-term value (LTV/Retention)? Assumption: We focus on a recommendation system aiming to maximize CTR and Conversion Rate (CVR).

Constraints & Scale: What is the scale of evaluation? How many models are being compared simultaneously? Assumption: 100M+ users, 10M+ items, evaluating 5-10 concurrent experiments.

Label Delay: How long does it take to get ground truth (e.g., a "purchase" event)? Assumption: Clicks are near real-time (seconds), conversions can take up to 24 hours.

Edge Cases: How do we handle "selection bias" (evaluating only on items shown to users)? Assumption: We use logging of propensity scores or random exploration data.

Assumptions:

A P99 latency requirement of 100ms for online serving.

Offline evaluation must be performed on a 30-day rolling window of historical logs.

We are designing for a large-scale e-commerce or content platform.

Thinking Process

Identify the Evaluation Loop: Evaluation isn't a single step; it's a feedback loop. I need to design a system that bridges the gap between offline proxy metrics (AUC/Loss) and online business metrics (Revenue/CTR).

Addressing Data Leakage: The biggest pitfall in evaluation is temporal leakage. I must enforce time-based splitting in the training and evaluation pipelines.

Handling Bias: Recommendation systems suffer from "rich get richer" loops. I need to incorporate counterfactual evaluation techniques to estimate how a new model would perform on data it hasn't seen.

Scaling Evaluation: Running full-scale A/B tests is expensive and slow. I should implement a "Multi-Stage Evaluation" funnel: Offline (Fast/Cheap) -> Replay/Counterfactual (Medium) -> Interleaving/Shadow (Safe) -> A/B Test (Final).

Elite Bonus Points

Counterfactual Evaluation (IPS): Using Inverse Propensity Scoring (IPS) to de-bias offline evaluation, allowing us to estimate the performance of a new policy using logs from an old policy.

Interleaving: Using Team Draft Interleaving or Balanced Interleaving to compare two ranking models in a single user session, which provides 10-100x more statistical power than traditional A/B testing.

Delayed Feedback Modeling: Implementing an "expectation-maximization" or "importance sampling" approach to handle labels (like conversions) that arrive long after the prediction was made.

Calibration Metrics: Beyond AUC/NDCG, tracking the "Expected Calibration Error" (ECE) to ensure predicted probabilities match real-world frequencies, which is critical for downstream bidding systems.

Design Breakdown

Requirements

Product Goal: Increase user engagement and revenue by identifying and deploying higher-performing ranking models faster.

Success Metrics:

Online Metrics: CTR, Conversion Rate, Revenue per User (ARPU).

Offline Metrics: AUC-PR, Log-Loss, NDCG@K, MRR.

Guardrail Metrics: P99 Latency, Model Drift (PSI), CPU/Memory utilization.

System Constraints: Support for 100k QPS, daily retraining, and sub-100ms inference.

Data Availability: User interaction logs (clicks, skips, buys), item metadata, user profiles.

ML Problem Framing

ML Task Type: Ranking (Pointwise/Pairwise) and Binary Classification.

Prediction Target:

P(\text{click} | \text{user, item, context})

and

P(\text{conversion} | \text{click, user, item, context})

Inputs:

User: ID, history (last 10 items), demographics.

Item: ID, category, price, visual embeddings.

Context: Device, time, location.

Outputs: A ranked list of items or a calibrated probability score.

ML Challenges: Selection bias (feedback loops), extreme class imbalance (conversions are rare), and feature/label temporal alignment.

Design Summary & MVP

Concise Summary: A two-stage evaluation system featuring an offline "Counterfactual Engine" for rapid iteration and an online "A/B Platform" with automated significance testing.

Model Architecture & Selection:

Baseline Model: Logistic Regression or XGBoost on tabular features.

Target Model: Multi-Gate Mixture-of-Experts (MMoE) to optimize for both CTR and CVR simultaneously.

Choice Rationale: MMoE handles multi-objective optimization better than separate models and shares feature representations, reducing serving costs.

ML Life Cycle Summary: Logs are processed into features; models are trained on historical data; offline metrics are calculated; models are deployed to shadow mode, then A/B tested, and continuously monitored for drift.

Simplicity Audit: MVP focuses on offline AUC/NDCG and simple A/B testing. Advanced counterfactuals are added only after the basic pipeline is stable.

Architecture Decision Rationale:

Why this architecture?: Decoupling evaluation from serving ensures that evaluation logic (which is compute-heavy) doesn't impact user-facing latency.

Requirement Satisfaction: Meets scale via distributed Spark/Flink processing and ensures reliability through shadow serving.

System Architecture

Pipeline Deep Dive

Data Pipeline

Data Source: Mobile/Web application logs capturing impression, click, and purchase events.

Data Ingestion: Kafka for real-time ingestion; Airflow for daily batch ETL.

Data Storage: S3/Data Lake for raw logs (Parquet format, partitioned by date/hour); Snowflake/BigQuery for structured analytical queries.

Data Processing: Spark for joining clicks with purchases. Crucial step: Attribution Windowing (e.g., attributing a purchase to a click within 7 days).

Data Quality: Great Expectations for schema validation. Check for "null" rates in critical features like user_id.

Feature Pipeline

Feature Definition:

User: Long-term interests (30 days), short-term session behavior (last 5 items).

Item: Popularity (CTR over last 1 hour), category embeddings.

Feature Engineering: Target encoding for categorical variables (with smoothing to prevent leakage).

Offline/Online Consistency: Use a Feature Store (e.g., Tecton or Feast) to ensure that the logic used to compute "User Last 5 Items" is identical in training (point-in-time join) and serving (real-time KV store).

Model Architecture

Problem Formulation: Multi-task learning (MTL).

Target Model: MMoE (Multi-gate Mixture-of-Experts).

Shared Bottom: Embedding layers for users and items.

Experts: Multiple DNN experts to learn common patterns.

Towers: Two separate towers for Click prediction (Binary Cross-Entropy) and Conversion prediction (Binary Cross-Entropy).

Complexity: DNN with ~100M parameters. Use Embedding Quantization (FP16) to fit into memory.

Training Pipeline

Dataset Construction: Negative sampling for impressions that were not clicked. To handle Label Delay for conversions, we use a "waiting window" or "importance weighting" for recent data.

Data Splitting: Time-based split. Train on days 1-28, validate on day 29, test on day 30. Never use random split for time-series/recommendation data.

Retraining: Weekly full retraining; daily incremental "warm-start" training to capture new item trends.

Serving Pipeline

Serving Pattern: Online inference for personalized ranking.

Latency Optimization:

Two-Stage Ranking: Retrieval (Top 1000 items via FAISS/HNSW) -> Ranking (Top 50 via MMoE).

Model Pruning: Remove non-essential expert gates for lower latency.

Reliability: Fallback to a "Popularity-based" heuristic model if the MMoE service exceeds 200ms or returns 5xx.

Evaluation Pipeline

Offline Evaluation:

Metric: NDCG@10 (Relevance), AUC (Discrimination).

Calibration: Reliability diagrams to ensure

P(\text{click}) = 0.1

actually results in 10% clicks.

Counterfactual Replay: Re-rank historical sessions with the new model and use IPS to calculate the expected reward without showing the new model to users.

Monitoring Pipeline

System Monitoring: Prometheus/Grafana for QPS, Latency, and Error rates.

Model Monitoring:

Population Stability Index (PSI): Detect when the distribution of input features (e.g., age) shifts.

Prediction Drift: Monitor if the average predicted CTR shifts significantly from historical norms.

Ground Truth Feedback: Join serving logs with Kafka click-streams to calculate real-time "Online AUC" for the production model.

Wrap Up

Final Evaluation

Observability: We track the "Delta" between offline AUC and online CTR. If offline goes up but online goes down, we have Offline/Online Mismatch (usually due to leakage or feedback loops).

Trade-offs:

Accuracy vs. Latency: MMoE adds complexity but provides a 2-3% revenue lift over simple DNNs.

Exploration vs. Exploitation: Use a small % of traffic for Thompson Sampling or Epsilon-greedy to gather unbiased data for future evaluation.

Failure Modes: Cold-start items. Solution: Use content-based embeddings (from item titles/images) to rank items with zero history.