Evaluation System for Large-Scale Recommendation Models
Design a high-scale evaluation and experimentation platform for a recommendation system (e.g., Amazon or YouTube). Your system must handle the end-to-end lifecycle: from offline backtesting using historical logs and counterfactual techniques to online A/B testing and shadow deployment. Address specific challenges such as selection bias in offline data, delayed feedback for conversion labels, and ensuring consistency between training and serving features. Explain how you would measure success using both ML-specific metrics (NDCG, AUC, Calibration) and business KPIs, while maintaining a strict P99 latency SLA for production traffic.
MMoEXGBoostSparkFlinkKafkaFAISSHNSWIPSFeature StoreThompson SamplingAUCNDCG
00