The Question
ML Design

Scalable ML Evaluation Framework

Design a comprehensive evaluation system for a high-scale recommendation engine. The system should address the gap between offline proxy metrics and online business objectives, handle selection bias in historical data, and provide a robust framework for A/B testing and long-term model performance monitoring.
MMoE
IPS
NDCG
AUC
PSI
Two-Tower Model
ScaNN
Kafka
Spark
Tecton
ONNX
MurmurHash
Questions & Insights

Clarifying Questions

Business Goal: Is the primary objective to maximize immediate engagement (CTR), long-term retention (LTV), or specific conversion events (Revenue)? Assumption: We aim for a multi-objective balance of CTR and Revenue.
Constraints & Scale: What is the scale of the system? (DAU, item corpus). Assumption: 100M DAU, 10M item corpus, and a P99 latency budget of 200ms.
Feedback Loop: How long is the label delay? (e.g., clicks are instant, purchases take minutes/hours). Assumption: Near real-time for clicks, 24-hour attribution window for conversions.
Evaluation Scope: Are we evaluating a single model or a complex multi-stage pipeline (Retrieval + Ranking)? Assumption: We are evaluating the end-to-end ranking pipeline.

Thinking Process

Identify the Core Conflict: The "Offline-Online Gap." Models often perform well on historical data (offline AUC/NDCG) but fail to move the needle on live business metrics (online CTR/Revenue) due to selection bias and distribution shift.
The Evaluation Hierarchy: I need to build a tiered evaluation strategy: Offline (Static) -> Replay/Counterfactual (Simulator) -> Shadow (Passive Online) -> A/B Test (Active Online).
Addressing Bias: Since we only have labels for items the previous model decided to show, I must implement Inverse Propensity Scoring (IPS) or exploration strategies to evaluate the "unseen" possibilities.
Scalability: For 100M users, A/B testing must be statistically significant. I need to ensure the experimentation engine supports proper hashing/bucketing.

Elite Bonus Points

Counterfactual Evaluation (IPS): Using Inverse Propensity Scoring to correct for selection bias in offline logs, allowing us to estimate how a new model would have performed on items it would have ranked highly.
Interleaving: Instead of a standard A/B test, use Team Draft Interleaving to compare two ranking algorithms in a single list, which can be 10-100x more sensitive than traditional A/B tests.
Delayed Feedback Modeling: Implementing an "observed-to-expected" (O/E) ratio calibration to handle labels that arrive 24+ hours late, preventing the model from under-predicting conversions for fresh data.
Sliced Evaluation: Moving beyond global metrics to evaluate "Head vs. Tail" items and "New vs. Power" users to ensure the model isn't just optimizing for the majority at the expense of niche growth.
Design Breakdown

Requirements

Product Goal: Build a robust evaluation framework to reliably promote model candidates that improve user satisfaction and revenue.
Success Metrics:
Online Metrics: CTR, Conversion Rate (CVR), Revenue per Mille (RPM).
Offline Metrics: AUC-ROC (for classification), NDCG@K (for ranking), Log-Loss (for calibration).
Guardrail Metrics: P99 Latency, CPU/Memory utilization, and Model Calibration Error (MCE).
System Constraints: 100M users, 50k QPS, strict 200ms P99 latency.
Data Availability: Real-time clickstream, historical purchase logs, and user/item metadata.

ML Problem Framing

ML Task Type: Multi-task Ranking (Binary Classification for Clicks and Purchases).
Prediction Target: Score = w_1 \cdot P(\text{click}) + w_2 \cdot P(\text{purchase}).
Inputs:
User: Historical IDs, category preferences, embedding of last 10 actions.
Item: Price, category, popularity (CTR over last hour), content embeddings.
Context: Device, Time of day, Page position.
ML Challenges: Position Bias (users click top items more regardless of relevance) and Selection Bias (we only have data for items we showed).

Design Summary & MVP

Concise Summary: A two-stage ranking system (Two-Tower Retrieval + Multi-gate Mixture-of-Experts Ranking) evaluated via a "Flighting" pipeline that moves from Offline Backtesting to Interleaving and finally A/B testing.
Model Architecture & Selection:
Baseline Model: Logistic Regression with basic cross-features.
Target Model: Multi-gate Mixture-of-Experts (MMoE) to handle the trade-off between CTR and Revenue.
Choice Rationale: MMoE shares parameters across tasks while learning task-specific "gates," outperforming single-task models in multi-objective scenarios.
Major Pipelines:
Data Pipeline: Ingests raw clickstream/transaction events into a unified data lake.
Experimentation Pipeline: Manages user bucketing and metric aggregation for A/B and Shadow tests.
Simplicity Audit: We start with Offline AUC and a simple A/B test. We avoid Reinforcement Learning (RL) for evaluation initially to maintain interpretability.

System Architecture

Pipeline Deep Dive

Data Pipeline

Data Source: Mobile/Web logs (Protobuf format) and Transaction DB (PostgreSQL).
Data Ingestion: Kafka for real-time events. Flink for sessionization to group clicks with the corresponding impressions.
Data Storage: S3 Parquet for raw logs (partitioned by date/hour). Iceberg/Delta Lake for ACID transactions on labels.
Data Processing: Spark jobs handle the "label joining" problem—associating a conversion that happens at 4 PM with an impression that happened at 10 AM.

Feature Pipeline

Feature Definition:
Dense: User/Item Embeddings.
Sparse: Category IDs, User IDs.
Dynamic: "CTR in the last 5 minutes" (Streaming).
Feature Store: Use Tecton or Feast. Provides a unified SDK so get_features(user_id) returns the same values in training and serving, eliminating Training/Serving Skew.

Model Architecture

Architecture: MMoE (Multi-gate Mixture-of-Experts).
Bottom shared layers capture general user interest.
Gate 1 + Expert 1 optimize for Click Probability.
Gate 2 + Expert 2 optimize for Conversion/Revenue.
Complexity: ~10M parameters. Use Quantization (FP16) to ensure the ranking stage stays under 50ms.

Training Pipeline

Label Construction: Use a "Wait Window." Labels are only finalized after 24 hours to ensure conversions are captured.
Splitting: Temporal Split. Train on days 1-28, validate on day 29, test on day 30. This simulates real-world deployment.
Imbalance: Conversions are rare (0.1%). Use Negative Downsampling on non-clicked impressions to speed up training, then apply a calibration correction (p / (p + (1-p)/w)).

Serving Pipeline

Pattern: Two-stage.
Retrieval: ScaNN or FAISS to find top 500 items from 10M via embeddings.
Ranking: MMoE model scores the 500 items.
Optimization: Batch inference for the 500 items using ONNX Runtime or TensorRT.

Evaluation Pipeline

Offline Evaluation: Use NDCG to measure ranking quality. Use Calibration Plots (Expected vs. Observed) to ensure the probabilities are "real" for bidding/auction logic.
Online Evaluation:
Bucketization: Use MurmurHash3(user_id + salt) to assign users to A/B groups.
Interleaving: For 10% of traffic, mix results from Model A and Model B. If users click Model A's results more, Model A is the winner.

Monitoring Pipeline

Drift Monitoring: Calculate the Population Stability Index (PSI) for input features. If "Category_ID" distribution shifts significantly, trigger an alert.
Prediction Drift: Monitor the average predicted CTR vs. actual CTR. A sudden divergence indicates an upstream data pipeline failure.
Wrap Up

Final Evaluation

Observability: Real-time dashboards showing the delta between Control and Treatment groups across all 15+ KPIs.
Edge Cases:
Cold Start: For new items, use a "bandit" exploration strategy (e.g., Thompson Sampling) to gather initial data.
Feedback Loops: "The rich get richer" effect. Countered by adding a small "exploration" weight to new/under-represented items.
Trade-offs:
Accuracy vs. Latency: We chose a 2-stage approach to keep latency low while maintaining high accuracy in the final rank.
Exploration vs. Exploitation: Using 5% of traffic for epsilon-greedy exploration ensures the evaluation dataset stays fresh and unbiased.