The Question
ML Design

Large-Scale Recommendation Ranking for Long User Sequences

Design the final-stage ranking system for a content discovery platform (like Pinterest) where users have long-term interaction histories (1,000+ events). Your system must efficiently handle these sequences to capture both long-term and short-term interests within a 100ms P99 latency budget. Detail the data and feature pipelines, the specific attention mechanisms used to overcome the computational cost of long sequences, and how the system maintains online/offline consistency for embedding-based features.
SIM
DIN
DCN-v2
Transformer
Kafka
Flink
Spark
Tecton
PyTorch
Horovod
Questions & Insights

Clarifying Questions

Business Goal: Is the primary North Star metric "Save Rate" (high intent) or "Click-Through Rate" (engagement)?
Assumption: We aim to maximize a weighted multi-objective score of CTR and Save Rate.
Constraints & Scale: What is the scale of the user history and the candidate pool for ranking?
Assumption: Ranking ~500–1000 candidate pins. User interaction history can span up to 1,000+ events (long-term sequence).
Latency Budget: What is the P99 latency requirement for the ranking stage?
Assumption: Total ranking latency < 100ms.
Data Freshness: How quickly must a user's latest interaction influence their recommendations?
Assumption: Near real-time (seconds) for short-term interests, daily for long-term profiling.

Thinking Process

Identify the Bottleneck: Standard Transformer-based attention is O(N^2) relative to sequence length. Processing 1,000+ pins in the ranking stage for each of the 500 candidates is computationally prohibitive for a 100ms budget.
Strategy - Filter then Attend: Instead of attending over the whole sequence, I should use a two-stage approach within the ranking model: a fast search (GSU - General Search Unit) to find relevant items from the history, followed by a complex attention mechanism (ESU - Exact Search Unit) on a small subset.
Decoupling: I need to separate the long-term historical features (static/slow-moving) from the short-term session features (dynamic).
Architecture Selection: A Deep Interest Network (DIN) approach is a good baseline, but for "long" sequences, I will propose a Search-based Interest Model (SIM) or UBR4Rec style architecture.

Elite Bonus Points

Negative Augmentation: Incorporating "skipped" pins in the sequence to explicitly model negative preferences, not just positive interactions.
Calibration for Multi-objective: Using a calibration layer (e.g., Platt scaling or Isotonic Regression) to ensure the predicted probabilities of Saves vs. Clicks are on the same scale before weighted summation.
Position Bias Correction: Implementing a shallow "Position Office" tower during training to prevent the model from learning that "Top items are better simply because they are at the top."
Embeddings Versioning & Warm-start: When updating Pin embeddings, the sequence features will break. I would implement a "warm-start" mapping or a lightweight residual adapter to keep the long-term sequence meaningful during model transitions.
Design Breakdown

Requirements

Product Goal: Deliver highly relevant Pin recommendations that lead to engagement (Saves/Clicks).
Success Metrics:
Online: CTR, Save Rate, Time-spent.
Offline: AUC (for classification), NDCG (for ranking), Recall@K.
Guardrail: P99 Latency < 100ms, Model Training Time, Inference QPS.
System Constraints: 500M+ users, billions of pins. Need to handle "Heavy Users" with years of history.
Data Availability: Real-time clickstream, historical interaction DB (Saves, Close-ups), Pin metadata (Tags, Image Embeddings).

ML Problem Framing

ML Task Type: Point-wise Ranking (Binary Classification).
Prediction Target: P(\text{Engage} | \text{User}, \text{Pin}, \text{Context}).
Inputs:
User: Profile (age, location) + Long-term Interest (1 year history) + Short-term Interest (last 10 interactions).
Item (Candidate Pin): Pin embedding, category, popularity, creator authority.
Context: Device, time, surface (Homefeed vs. Related).
ML Challenges: Long-sequence interaction modeling (the core constraint), data imbalance (saves are rarer than clicks), and feedback loops.

Design Summary & MVP

Concise Summary: We will implement a Search-based Interest Model (SIM). It uses a General Search Unit (GSU) to retrieve the top-K relevant pins from a user's 1,000+ historical actions based on the candidate pin's category/embedding, and an Exact Search Unit (ESU) to perform Multi-Head Attention on that filtered subset.
Model Architecture:
Baseline: Logistic Regression with aggregated history (mean-pooling of last 50 pins).
Target Model: Deep-Cross Network (DCN-v2) combined with a SIM sequence encoder.
Simplicity Audit: This is the simplest way to handle long sequences because it avoids the O(N^2) cost of self-attention over the whole sequence by performing a sub-linear search first.
Architecture Decision Rationale: This satisfies the latency budget while capturing fine-grained user interests that a simple "mean-pooling" would wash out.

System Architecture

Pipeline Deep Dive

Data Pipeline

Data Source: Pinterest mobile/web event logs (Pin-click, Pin-save).
Data Ingestion: Kafka for real-time events. Airflow orchestrates daily Spark jobs to sync the Data Lake (S3) with the Data Warehouse (BigQuery/Snowflake).
Data Storage: S3 for raw Parquet files (partitioned by day/hour).
Data Quality: De-duplication of events (e.g., accidental double-clicks) and schema enforcement using Protobuf.

Feature Pipeline

Feature Definition:
User Long-term Sequence: List of Pin IDs + Timestamps + Interaction Type (last 1,000 items).
User Short-term Sequence: Last 20 items (high signal).
Candidate Pin: ID, Visual Embedding (from a pre-trained Vision Transformer), Category.
Online Feature Pipeline: Flink consumes Kafka to maintain a sliding window of the user's last 20 actions for immediate personalization.
Feature Store: Tecton or Feast. We store the "Long-term" sequence as a compressed list of IDs to save space.

Model Architecture

The Long-Sequence Problem:
General Search Unit (GSU): For a candidate pin P_c, we search the user history H = [h_1, h_2, ..., h_{1000}].
Hard Search (MVP): Match items in H that share the same category_id as P_c.
Soft Search (Advanced): Use the embedding of P_c to find the top 50 items in H via Inner Product.
Exact Search Unit (ESU):
Take the 50 items from GSU.
Apply Target Attention: The candidate pin P_c acts as the "Query," and the 50 items act as "Keys" and "Values."
This captures the specific relevance of the history to the current pin being ranked.
Core Model: The output of ESU is concatenated with other features (User, Context) and fed into a DCN-v2 (Deep & Cross Network) to capture high-order feature interactions.

Training Pipeline

Dataset Construction: Use a 7-day window for training. To handle "Long Sequence," we store only Pin IDs in the training records and join them with a "Snapshot" of Pin Embeddings from that day to avoid leakage.
Negative Sampling: Use "Logged Negatives" (items shown but not clicked).
Distributed Training: Use Horovod or PyTorch DistributedDataParallel, as the embedding tables for billions of Pins will be large.

Serving Pipeline

Pattern: Online Request-Response.
Optimization:
GSU is performed using a fast bitmask or hash-map lookup for category matching (Hard Search).
Embedding lookups for the sequence are batched.
Model is quantized to FP16.

Evaluation Pipeline

Offline Evaluation: AUC for binary labels. We also track GAUC (Group AUC) per user to ensure the model ranks better for individuals, not just globally.
Online Evaluation: Standard A/B testing framework measuring "Pins Saved per User" and "Long-term Retention."

Monitoring Pipeline

Data Monitoring: Check if the "Sequence Length" distribution shifts (e.g., if a bug causes sequences to be truncated).
Model Monitoring: Monitor the "Attention Weights" in the ESU. If the model starts ignoring the sequence, it may indicate embedding drift.
Wrap Up

Final Evaluation

Edge Cases: Cold-start users (no sequence).
Fallback: Use demographic-based popularity or "Global Trending" pins.
Trade-offs:
Hard Search vs. Soft Search: Hard search is faster/cheaper; Soft search is more accurate but requires vector search in the ranking loop.
MVP Recommendation: Start with Hard Search (Category matching) for the GSU. It is robust and incredibly fast.
Distinguishing Insight: In long sequences, time decay is vital. A pin saved 2 years ago is less relevant than one saved yesterday. I would add a Time-Aware Positional Encoding to the ESU to help the model learn the temporal relevance.