The Question

Large-Scale Recommendation Ranking for Long User Sequences

Design the final-stage ranking system for a content discovery platform (like Pinterest) where users have long-term interaction histories (1,000+ events). Your system must efficiently handle these sequences to capture both long-term and short-term interests within a 100ms P99 latency budget. Detail the data and feature pipelines, the specific attention mechanisms used to overcome the computational cost of long sequences, and how the system maintains online/offline consistency for embedding-based features.

SIM

DIN

DCN-v2

Transformer

Kafka

Flink

Spark

Tecton

PyTorch

Horovod

Questions & Insights

Clarifying Questions

Business Goal: Is the primary North Star metric "Save Rate" (high intent) or "Click-Through Rate" (engagement)?

Assumption: We aim to maximize a weighted multi-objective score of CTR and Save Rate.

Constraints & Scale: What is the scale of the user history and the candidate pool for ranking?

Assumption: Ranking ~500–1000 candidate pins. User interaction history can span up to 1,000+ events (long-term sequence).

Latency Budget: What is the P99 latency requirement for the ranking stage?

Assumption: Total ranking latency < 100ms.

Data Freshness: How quickly must a user's latest interaction influence their recommendations?

Assumption: Near real-time (seconds) for short-term interests, daily for long-term profiling.

Thinking Process

Identify the Bottleneck: Standard Transformer-based attention is

O(N^2)

relative to sequence length. Processing 1,000+ pins in the ranking stage for each of the 500 candidates is computationally prohibitive for a 100ms budget.

Strategy - Filter then Attend: Instead of attending over the whole sequence, I should use a two-stage approach within the ranking model: a fast search (GSU - General Search Unit) to find relevant items from the history, followed by a complex attention mechanism (ESU - Exact Search Unit) on a small subset.

Decoupling: I need to separate the long-term historical features (static/slow-moving) from the short-term session features (dynamic).

Architecture Selection: A Deep Interest Network (DIN) approach is a good baseline, but for "long" sequences, I will propose a Search-based Interest Model (SIM) or UBR4Rec style architecture.

Elite Bonus Points

Negative Augmentation: Incorporating "skipped" pins in the sequence to explicitly model negative preferences, not just positive interactions.

Calibration for Multi-objective: Using a calibration layer (e.g., Platt scaling or Isotonic Regression) to ensure the predicted probabilities of Saves vs. Clicks are on the same scale before weighted summation.

Position Bias Correction: Implementing a shallow "Position Office" tower during training to prevent the model from learning that "Top items are better simply because they are at the top."

Embeddings Versioning & Warm-start: When updating Pin embeddings, the sequence features will break. I would implement a "warm-start" mapping or a lightweight residual adapter to keep the long-term sequence meaningful during model transitions.

Design Breakdown

Requirements

Product Goal: Deliver highly relevant Pin recommendations that lead to engagement (Saves/Clicks).

Success Metrics:

Online: CTR, Save Rate, Time-spent.

Offline: AUC (for classification), NDCG (for ranking), Recall@K.

Guardrail: P99 Latency < 100ms, Model Training Time, Inference QPS.

System Constraints: 500M+ users, billions of pins. Need to handle "Heavy Users" with years of history.

Data Availability: Real-time clickstream, historical interaction DB (Saves, Close-ups), Pin metadata (Tags, Image Embeddings).

ML Problem Framing

ML Task Type: Point-wise Ranking (Binary Classification).

Prediction Target:

P(\text{Engage} | \text{User}, \text{Pin}, \text{Context})

Inputs:

User: Profile (age, location) + Long-term Interest (1 year history) + Short-term Interest (last 10 interactions).

Item (Candidate Pin): Pin embedding, category, popularity, creator authority.

Context: Device, time, surface (Homefeed vs. Related).

ML Challenges: Long-sequence interaction modeling (the core constraint), data imbalance (saves are rarer than clicks), and feedback loops.

Design Summary & MVP

Concise Summary: We will implement a Search-based Interest Model (SIM). It uses a General Search Unit (GSU) to retrieve the top-K relevant pins from a user's 1,000+ historical actions based on the candidate pin's category/embedding, and an Exact Search Unit (ESU) to perform Multi-Head Attention on that filtered subset.

Model Architecture:

Baseline: Logistic Regression with aggregated history (mean-pooling of last 50 pins).

Target Model: Deep-Cross Network (DCN-v2) combined with a SIM sequence encoder.

Simplicity Audit: This is the simplest way to handle long sequences because it avoids the

O(N^2)

cost of self-attention over the whole sequence by performing a sub-linear search first.

Architecture Decision Rationale: This satisfies the latency budget while capturing fine-grained user interests that a simple "mean-pooling" would wash out.

System Architecture

Pipeline Deep Dive

Data Pipeline

Data Source: Pinterest mobile/web event logs (Pin-click, Pin-save).

Data Ingestion: Kafka for real-time events. Airflow orchestrates daily Spark jobs to sync the Data Lake (S3) with the Data Warehouse (BigQuery/Snowflake).

Data Storage: S3 for raw Parquet files (partitioned by day/hour).

Data Quality: De-duplication of events (e.g., accidental double-clicks) and schema enforcement using Protobuf.

Feature Pipeline

Feature Definition:

User Long-term Sequence: List of Pin IDs + Timestamps + Interaction Type (last 1,000 items).

User Short-term Sequence: Last 20 items (high signal).

Candidate Pin: ID, Visual Embedding (from a pre-trained Vision Transformer), Category.

Online Feature Pipeline: Flink consumes Kafka to maintain a sliding window of the user's last 20 actions for immediate personalization.

Feature Store: Tecton or Feast. We store the "Long-term" sequence as a compressed list of IDs to save space.

Model Architecture

The Long-Sequence Problem:

General Search Unit (GSU): For a candidate pin

P_c

, we search the user history

H = [h_1, h_2, ..., h_{1000}]

Hard Search (MVP): Match items in

H

that share the same category_id as

P_c

Soft Search (Advanced): Use the embedding of

P_c

to find the top 50 items in

H

via Inner Product.

Exact Search Unit (ESU):

Take the 50 items from GSU.

Apply Target Attention: The candidate pin

P_c

acts as the "Query," and the 50 items act as "Keys" and "Values."

This captures the specific relevance of the history to the current pin being ranked.

Core Model: The output of ESU is concatenated with other features (User, Context) and fed into a DCN-v2 (Deep & Cross Network) to capture high-order feature interactions.

Training Pipeline

Dataset Construction: Use a 7-day window for training. To handle "Long Sequence," we store only Pin IDs in the training records and join them with a "Snapshot" of Pin Embeddings from that day to avoid leakage.

Negative Sampling: Use "Logged Negatives" (items shown but not clicked).

Distributed Training: Use Horovod or PyTorch DistributedDataParallel, as the embedding tables for billions of Pins will be large.

Serving Pipeline

Pattern: Online Request-Response.

Optimization:

GSU is performed using a fast bitmask or hash-map lookup for category matching (Hard Search).

Embedding lookups for the sequence are batched.

Model is quantized to FP16.

Evaluation Pipeline

Offline Evaluation: AUC for binary labels. We also track GAUC (Group AUC) per user to ensure the model ranks better for individuals, not just globally.

Online Evaluation: Standard A/B testing framework measuring "Pins Saved per User" and "Long-term Retention."

Monitoring Pipeline

Data Monitoring: Check if the "Sequence Length" distribution shifts (e.g., if a bug causes sequences to be truncated).

Model Monitoring: Monitor the "Attention Weights" in the ESU. If the model starts ignoring the sequence, it may indicate embedding drift.

Wrap Up

Final Evaluation

Edge Cases: Cold-start users (no sequence).

Fallback: Use demographic-based popularity or "Global Trending" pins.

Trade-offs:

Hard Search vs. Soft Search: Hard search is faster/cheaper; Soft search is more accurate but requires vector search in the ranking loop.

MVP Recommendation: Start with Hard Search (Category matching) for the GSU. It is robust and incredibly fast.

Distinguishing Insight: In long sequences, time decay is vital. A pin saved 2 years ago is less relevant than one saved yesterday. I would add a Time-Aware Positional Encoding to the ESU to help the model learn the temporal relevance.