The Question
ML Design

Scalable Personalized Content Feed Design

Design a large-scale personalized article recommendation system for a platform with 100M+ users. The system must provide real-time updates for news freshness, handle high-throughput candidate retrieval from a corpus of 10M+ items, and optimize for multi-objective engagement (clicks vs. read time). Detail the data lifecycle from ingestion to model serving, and discuss how you mitigate common production issues like position bias, cold-start items, and training-serving skew.
Two-Tower DNN
MMoE
HNSW
FAISS
LightGBM
Flink
Kafka
DistilBERT
Thompson Sampling
Questions & Insights

Clarifying Questions

Business Goal: What is the primary North Star metric? Is it Click-Through Rate (CTR), total read time (dwell time), or user retention?
Assumption: We aim to maximize "High-Quality Engagement," defined as a weighted combination of CTR and Read Time.
Constraints & Scale: What is the scale of the user base and article corpus?
Assumption: 100M Daily Active Users (DAU), 10M active articles, and a P99 latency budget of 200ms.
Data Freshness: How quickly must new articles appear in the feed?
Assumption: Articles are time-sensitive (news/blogs). New items should be discoverable within minutes.
Cold Start: How do we handle users with no history or new articles with no clicks?
Assumption: We use metadata-based heuristics and content embeddings for initial recommendations.

Thinking Process

Identify the Funnel: With 10M articles, a single-stage ranking is impossible. I need a multi-stage architecture: Retrieval (Candidate Generation) to filter millions to hundreds, and Ranking to order the top results precisely.
Freshness vs. Accuracy: Articles decay quickly. I must prioritize features like "time since publication" and "real-time trending signals" over long-term historical averages.
Multi-Objective optimization: A click doesn't mean the user liked the article (clickbait). I need to model both P(click) and P(read\_time > threshold).
Scaling the MVP: Start with a Two-Tower model for retrieval because it allows pre-computing user/item embeddings for low-latency ANN search. For ranking, use a Point-wise approach for simplicity.

Elite Bonus Points

Position Bias Correction: In training data, users click items at the top more often. I will implement a "Position Bias" feature during training (e.g., using a shallow tower or as a feature) but set it to a constant "Position 0" during inference to decorrelate the model from layout effects.
Calibration for Multi-Objective: Since I am predicting different targets (click vs. dwell time), I will use Expected Value Ranking or MMoE (Multi-gate Mixture-of-Experts) to avoid the "Clickbait Trap" where the model over-optimizes for sensationalist headlines.
Delayed Feedback Loop: High-quality read time labels arrive minutes after the click. I'll implement a "Wait-and-Label" window or use "Importance Sampling" to update models without waiting for the full session to expire.
Online Feature Store & Point-in-time Joins: To prevent data leakage, I ensure that features used for training reflect the exact state of the world at the time the user saw the article.
Design Breakdown

Requirements

Product Goal: Deliver a highly relevant, fresh, and personalized feed that keeps users engaged.
Success Metrics:
Online: CTR, Mean Dwell Time, DAU Retention.
Offline: AUC (for CTR), NDCG (for ranking order), LogLoss.
Guardrail Metrics: P99 Latency < 200ms, "News Diversity" score (to avoid filter bubbles).
System Constraints: 100k QPS at peak, horizontal scalability, 99.9% availability.
Data Availability: User profile (demographics), Interaction logs (clicks, skips, scrolls), Article Metadata (tags, text, author).

ML Problem Framing

ML Task Type: Two-stage ranking problem (Retrieval + Ranking).
Prediction Target: Score = w_1 \cdot P(\text{click}) + w_2 \cdot E(\text{dwell\_time}).
Inputs:
User: ID embedding, 7-day click history, preferred categories, location.
Item: Article ID, Text embedding (SBERT/BERT), age, popularity (V-point CTR).
Context: Device, time of day, day of week.
ML Challenges: Extreme data sparsity (most users haven't read most articles) and "Rich get Richer" feedback loops.

Design Summary & MVP

Concise Summary: A two-stage system using a Two-Tower Deep Neural Network for retrieval and a LightGBM or Deep & Wide model for ranking, leveraging an online feature store for real-time context.
Model Architecture & Selection:
Baseline Model: Logistic Regression with crossed features (User_Topic x Article_Topic).
Target Model: Retrieval: Two-Tower (User/Item) with HNSW (Approximate Nearest Neighbor). Ranking: Multi-Task DNN (MTL) to predict click and dwell time simultaneously.
Choice Rationale: Two-tower allows O(log N) retrieval via vector search. MTL prevents clickbait by optimizing for engagement depth.
Simplicity Audit: We avoid Reinforcement Learning (RL) for the MVP. While RL handles long-term rewards well, it is difficult to debug and requires a high-fidelity simulator. Supervised learning on engagement is the industry standard for MVP.

System Architecture

Pipeline Deep Dive

Data Pipeline

Data Source: Mobile/Web clickstream (protobuf events), CMS for article metadata.
Data Ingestion: Kafka for high-throughput messaging. Flink for sessionization (connecting a click event with the subsequent dwell time event).
Data Storage: S3 for raw logs (Parquet). Snowflake/BigQuery for structured analytical queries.
Data Quality: De-duplication of events (at-least-once to exact-once logic) and schema validation using Confluent Schema Registry.

Feature Pipeline

Feature Engineering:
User: Weighted average of embeddings of the last 20 articles read.
Item: TF-IDF/BM25 for text, but primarily Dense Embeddings from a fine-tuned DistilBERT.
Freshness: log(1 + \text{AgeInMinutes}) to capture temporal decay.
Feature Store: Use Tecton or Feast.
Offline: Joins historical features with labels for training (point-in-time).
Online: Serves low-latency (<10ms) feature lookups for the ranking model.

Model Architecture

Retrieval (Two-Tower):
User Tower: Dense layers processing user features and history.
Item Tower: Processes article metadata and text.
Output: Dot product of User and Item vectors.
Loss: In-batch softmax cross-entropy with sampled softmax for efficiency.
Ranking (MMoE):
Architecture: Shared bottom layers to learn general representations. Two "Expert" heads: one for Click (binary classification) and one for Read Time (regression).
Why?: Different tasks share low-level features (e.g., topic interest) but diverge at the top (clickability vs. depth).

Training Pipeline

Dataset Construction: Use a sliding window for training. For article recommendation, train on the last 7 days and test on the 8th to mimic production.
Handling Class Imbalance: Negative sampling ratio of 4:1 (unclicked vs clicked articles in the same session).
Training: Distributed training using Horovod or PyTorch DistributedDataParallel on NVIDIA A100s.

Serving Pipeline

Retrieval Pattern: Use HNSW (Hierarchical Navigable Small World) index for 10ms retrieval of top 500 candidates.
Ranking Pattern: Synchronous request-response. Ranker fetches missing features from the online feature store.
Optimization: Quantize the Ranking model to INT8 using TensorRT to reduce latency.

Evaluation Pipeline

Offline: Use Replay Buffer or Inverse Propensity Scoring (IPS) to evaluate how the new model would have performed on historical logs without the logging bias of the old model.
Online: Standard A/B testing framework. Track "Interleaving" as a faster way to detect preference between two ranking algorithms.

Monitoring Pipeline

Data Drift: Monitor the distribution of "Article Category" in the recommended list vs. the consumed list.
Prediction Drift: If the average predicted CTR drops from 5% to 2%, trigger an alert for model decay or upstream data breakage.
Wrap Up

Final Evaluation

Cold Start: Use a Multi-Armed Bandit (Thompson Sampling) for new articles to give them enough impressions to gather baseline CTR data.
Exploration vs. Exploitation: Reserve 5% of traffic for a "Random/Exploration" bucket to discover new user interests.
Trade-offs: We trade off perfect accuracy (which would require a massive Transformer ranker) for latency by using a two-stage approach and optimized vector search.