DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
The Question
ML Design

Large-Scale Recommendation Ranking for Long User Sequences

Design the final-stage ranking system for a content discovery platform (like Pinterest) where users have long-term interaction histories (1,000+ events). Your system must efficiently handle these sequences to capture both long-term and short-term interests within a 100ms P99 latency budget. Detail the data and feature pipelines, the specific attention mechanisms used to overcome the computational cost of long sequences, and how the system maintains online/offline consistency for embedding-based features.
SIM
DIN
DCN-v2
Transformer
Kafka
Flink
Spark
Tecton
PyTorch
Horovod
Questions & Insights

Clarifying Questions

Business Goal: Is the primary North Star metric "Save Rate" (high intent) or "Click-Through Rate" (engagement)?
Assumption: We aim to maximize a weighted multi-objective score of CTR and Save Rate.
Constraints & Scale: What is the scale of the user history and the candidate pool for ranking?
Assumption: Ranking ~500–1000 candidate pins. User interaction history can span up to 1,000+ events (long-term sequence).
Latency Budget: What is the P99 latency requirement for the ranking stage?
Assumption: Total ranking latency < 100ms.
Data Freshness: How quickly must a user's latest interaction influence their recommendations?
Assumption: Near real-time (seconds) for short-term interests, daily for long-term profiling.

Thinking Process

Identify the Bottleneck: Standard Transformer-based attention is O(N^2) relative to sequence length. Processing 1,000+ pins in the ranking stage for each of the 500 candidates is computationally prohibitive for a 100ms budget.
Strategy - Filter then Attend: Instead of attending over the whole sequence, I should use a two-stage approach within the ranking model: a fast search (GSU - General Search Unit) to find relevant items from the history, followed by a complex attention mechanism (ESU - Exact Search Unit) on a small subset.
Decoupling: I need to separate the long-term historical features (static/slow-moving) from the short-term session features (dynamic).
Architecture Selection: A Deep Interest Network (DIN) approach is a good baseline, but for "long" sequences, I will propose a Search-based Interest Model (SIM) or UBR4Rec style architecture.

Elite Bonus Points

Negative Augmentation: Incorporating "skipped" pins in the sequence to explicitly model negative preferences, not just positive interactions.
Calibration for Multi-objective: Using a calibration layer (e.g., Platt scaling or Isotonic Regression) to ensure the predicted probabilities of Saves vs. Clicks are on the same scale before weighted summation.
Position Bias Correction: Implementing a shallow "Position Office" tower during training to prevent the model from learning that "Top items are better simply because they are at the top."
Embeddings Versioning & Warm-start: When updating Pin embeddings, the sequence features will break. I would implement a "warm-start" mapping or a lightweight residual adapter to keep the long-term sequence meaningful during model transitions.
Design Breakdown

Requirements

Product Goal: Deliver highly relevant Pin recommendations that lead to engagement (Saves/Clicks).
Success Metrics:
Online: CTR, Save Rate, Time-spent.
Offline: AUC (for classification), NDCG (for ranking), Recall@K.
Guardrail: P99 Latency < 100ms, Model Training Time, Inference QPS.
System Constraints: 500M+ users, billions of pins. Need to handle "Heavy Users" with years of history.
Data Availability: Real-time clickstream, historical interaction DB (Saves, Close-ups), Pin metadata (Tags, Image Embeddings).

ML Problem Framing

ML Task Type: Point-wise Ranking (Binary Classification).
Prediction Target: P(\text{Engage} | \text{User}, \text{Pin}, \text{Context}).
Inputs:
User: Profile (age, location) + Long-term Interest (1 year history) + Short-term Interest (last 10 interactions).
Item (Candidate Pin): Pin embedding, category, popularity, creator authority.
Context: Device, time, surface (Homefeed vs. Related).
ML Challenges: Long-sequence interaction modeling (the core constraint), data imbalance (saves are rarer than clicks), and feedback loops.

Design Summary & MVP

Concise Summary: We will implement a Search-based Interest Model (SIM). It uses a General Search Unit (GSU) to retrieve the top-K relevant pins from a user's 1,000+ historical actions based on the candidate pin's category/embedding, and an Exact Search Unit (ESU) to perform Multi-Head Attention on that filtered subset.
Model Architecture:
Baseline: Logistic Regression with aggregated history (mean-pooling of last 50 pins).
Target Model: Deep-Cross Network (DCN-v2) combined with a SIM sequence encoder.
Simplicity Audit: This is the simplest way to handle long sequences because it avoids the O(N^2) cost of self-attention over the whole sequence by performing a sub-linear search first.
Architecture Decision Rationale: This satisfies the latency budget while capturing fine-grained user interests that a simple "mean-pooling" would wash out.

System Architecture

Pipeline Deep Dive

Data Pipeline

Data Source: Pinterest mobile/web event logs (Pin-click, Pin-save).
Data Ingestion: Kafka for real-time events. Airflow orchestrates daily Spark jobs to sync the Data Lake (S3) with the Data Warehouse (BigQuery/Snowflake).
Data Storage: S3 for raw Parquet files (partitioned by day/hour).
Data Quality: De-duplication of events (e.g., accidental double-clicks) and schema enforcement using Protobuf.

Feature Pipeline

Feature Definition:
User Long-term Sequence: List of Pin IDs + Timestamps + Interaction Type (last 1,000 items).
User Short-term Sequence: Last 20 items (high signal).
Candidate Pin: ID, Visual Embedding (from a pre-trained Vision Transformer), Category.
Online Feature Pipeline: Flink consumes Kafka to maintain a sliding window of the user's last 20 actions for immediate personalization.
Feature Store: Tecton or Feast. We store the "Long-term" sequence as a compressed list of IDs to save space.

Model Architecture

The Long-Sequence Problem:
General Search Unit (GSU): For a candidate pin P_c, we search the user history H = [h_1, h_2, ..., h_{1000}].
Hard Search (MVP): Match items in H that share the same category_id as P_c.
Soft Search (Advanced): Use the embedding of P_c to find the top 50 items in H via Inner Product.
Exact Search Unit (ESU):
Take the 50 items from GSU.
Apply Target Attention: The candidate pin P_c acts as the "Query," and the 50 items act as "Keys" and "Values."
This captures the specific relevance of the history to the current pin being ranked.
Core Model: The output of ESU is concatenated with other features (User, Context) and fed into a DCN-v2 (Deep & Cross Network) to capture high-order feature interactions.

Training Pipeline

Dataset Construction: Use a 7-day window for training. To handle "Long Sequence," we store only Pin IDs in the training records and join them with a "Snapshot" of Pin Embeddings from that day to avoid leakage.
Negative Sampling: Use "Logged Negatives" (items shown but not clicked).
Distributed Training: Use Horovod or PyTorch DistributedDataParallel, as the embedding tables for billions of Pins will be large.

Serving Pipeline

Pattern: Online Request-Response.
Optimization:
GSU is performed using a fast bitmask or hash-map lookup for category matching (Hard Search).
Embedding lookups for the sequence are batched.
Model is quantized to FP16.

Evaluation Pipeline

Offline Evaluation: AUC for binary labels. We also track GAUC (Group AUC) per user to ensure the model ranks better for individuals, not just globally.
Online Evaluation: Standard A/B testing framework measuring "Pins Saved per User" and "Long-term Retention."

Monitoring Pipeline

Data Monitoring: Check if the "Sequence Length" distribution shifts (e.g., if a bug causes sequences to be truncated).
Model Monitoring: Monitor the "Attention Weights" in the ESU. If the model starts ignoring the sequence, it may indicate embedding drift.
Wrap Up

Final Evaluation

Edge Cases: Cold-start users (no sequence).
Fallback: Use demographic-based popularity or "Global Trending" pins.
Trade-offs:
Hard Search vs. Soft Search: Hard search is faster/cheaper; Soft search is more accurate but requires vector search in the ranking loop.
MVP Recommendation: Start with Hard Search (Category matching) for the GSU. It is robust and incredibly fast.
Distinguishing Insight: In long sequences, time decay is vital. A pin saved 2 years ago is less relevant than one saved yesterday. I would add a Time-Aware Positional Encoding to the ESU to help the model learn the temporal relevance.