The Question
ML DesignLarge-Scale Social Link Prediction System (People You May Know)
Design a high-scale 'People You May Know' (PYMK) recommendation system for a social network with over 1 billion users. Your design should handle massive graph data, addressing the challenges of candidate retrieval (Stage 1) and precision ranking (Stage 2). Detail how you would implement a low-latency serving architecture (P99 < 150ms) and a robust data pipeline that handles real-time friendship updates. Focus on feature engineering for graph-based signals, handling class imbalance in training, and defining the success metrics for business growth and system reliability.
LightGBM
XGBoost
Spark GraphX
Apache Flink
Kafka
Redis
Feast
LambdaMART
Questions & Insights
Clarifying Questions
Clarifying Questions & Constraints:
Business Goal: Is the primary goal to increase network density (total friend connections) or user retention (DAU)? (Assume: Retention through meaningful connections).
Scale: What is the size of the user graph and the expected QPS? (Assume: 500M DAU, 1B total users, 50k QPS).
Latency: What is the P99 latency budget for showing suggestions on the home feed? (Assume: 150ms).
Negative Feedback: Do we have a "dismiss" signal (X button) to treat as a strong negative? (Assume: Yes).
Definition of Success: Is a "send request" enough, or do we optimize for "accepted request"? (Assume: Accepted request).
Assumptions:
The user corpus is 1B nodes.
We have access to a graph of existing friendships.
We assume a two-stage approach: Retrieval (filtering billions to hundreds) and Ranking (scoring hundreds for the top 10).
Thinking Process
Identify the Core Bottleneck: The primary challenge is the "needle in the haystack" problem in a massive graph. I cannot rank 1B users for every user. I must use graph heuristics (triadic closure) for efficient retrieval.
Funnel Strategy: Design a multi-stage pipeline. Stage 1 (Retrieval) uses high-recall heuristics like "friends of friends." Stage 2 (Ranking) uses a high-precision ML model to sort these candidates.
Feature Importance: Recognize that graph-structural features (mutual friend count) are significantly more predictive than static profile features.
Cold Start: Address how to suggest friends for a brand-new user with zero connections (e.g., using contact uploads or location/interests).
Elite Bonus Points
Triadic Closure Theory: Leveraging the sociological principle that if A knows B and B knows C, there is a high probability A will know C. This forms the basis of the "Mutual Friend" heuristic.
Graph Embedding Versioning: Discussing how to handle "embedding drift" where user embeddings change as they add friends, requiring incremental updates rather than full batch re-trains.
Position Bias Correction: Implementing a bias-correction term in the ranking model to account for the fact that the first suggested person is naturally more likely to be clicked/added.
Delayed Feedback Loop: Handling the fact that a friend request might be accepted 3 days after it was sent, requiring a robust join-logic for training labels.
Design Breakdown
Requirements
Product Goal: Increase meaningful user connections to drive platform stickiness and retention.
Success Metrics:
Online Metrics: Friend Request Acceptance Rate (FRAR), Net New Connections per User.
Offline Metrics: PR-AUC (Precision-Recall Area Under Curve), NDCG@10 (Normalized Discounted Cumulative Gain).
Guardrail Metrics: Request Ignore/Dismiss Rate, Latency (P99 < 150ms).
System Constraints: 1B users, 50k QPS, real-time graph updates (new friendships should influence suggestions quickly).
Data Availability: User profiles, friendship graph, contact lists (if opted-in), interaction logs (likes, comments).
ML Problem Framing
ML Task Type: Binary Classification (Link Prediction).
Prediction Target: P(\text{Accept} | \text{Source User, Target User, Context}).
Inputs:
User/Item Features: Age, Location, Work/Education, Privacy settings.
Graph Features: Number of mutual friends, Jaccard similarity of friend lists, PageRank scores.
Context: Device, Time of Day, Home Feed vs. Profile View.
Outputs: Probability score [0, 1].
ML Challenges: Highly imbalanced labels (most pairs are NOT friends), extreme sparsity, and feedback loops (the model only learns from what it shows).
Design Summary & MVP
Concise Summary: A two-stage system using "Friends-of-Friends" (FoF) for candidate retrieval followed by a Gradient Boosted Decision Tree (GBDT) for precision ranking.
Model Architecture & Selection:
Baseline Model: Heuristic-based (Sort by number of mutual friends).
Target Model: LightGBM or XGBoost for ranking candidates.
Choice Rationale: GBDTs are exceptional for tabular data with heterogeneous features (graph metrics + profile metadata) and are computationally efficient for scoring ~500 candidates within the latency budget.
ML Life Cycle Summary: Spark-based graph processing generates candidates; Flink processes real-time friendship events to update features; LightGBM ranks in real-time.
Simplicity Audit: Avoids complex GNNs (Graph Neural Networks) initially. FoF covers 80% of relevant connections with much lower infra complexity.
Architecture Decision Rationale:
Why this architecture?: FoF is highly parallelizable and captures the strongest signal for social links.
Requirement Satisfaction: Scalable to 1B users via distributed graph processing and meets 150ms latency by pre-calculating candidates.
System Architecture
Pipeline Deep Dive
Data Pipeline
Data Source: User profiles (PostgreSQL), Friendship Graph (Neo4j or sharded MySQL), and Clickstream (Kafka).
Data Ingestion: Use Kafka for real-time events (friend requests) and periodic ETL (Airflow) for graph snapshots.
Data Storage: S3 for the Data Lake. Graph data stored in an adjacency list format (Parquet) optimized for bulk processing.
Data Processing: Spark GraphX or Apache Flink. Spark is used to compute global graph metrics (e.g., PageRank) weekly, while Flink updates "Mutual Friend" counts incrementally.
Data Quality: De-duplication of friend-request events and schema validation (Protobuf) to ensure feature consistency.
Feature Pipeline
Feature Definition:
Structural: Mutual friend count (strongest), HITS scores, Triangle counts.
Affinity: Common schools, shared workplaces, geographic proximity (City/Region).
Activity: Profile views in the last 7 days, "likes" on common posts.
Offline Feature Pipeline: Spark job calculates N-hop neighbors and static profile similarities.
Online Feature Pipeline: Flink consumes the "New Friendship" stream to increment/decrement mutual friend counts in a low-latency Key-Value store (Redis/Cassandra).
Feature Store: Use a Feature Store (e.g., Feast) to ensure that the "Mutual Friend Count" used during training exactly matches the one used during serving (avoiding training-serving skew).
Model Architecture
Problem Formulation: Supervised binary classification. Target:
is_accepted.Candidate Model Families:
Linear Models: Too simple; fails to capture interactions between features (e.g., same school AND same age).
GBDT (LightGBM): Chosen MVP. Handles missing values, non-linear relationships, and is very fast for inference.
Deep Learning (Two-Tower): Better for unstructured data, but harder to tune and deploy for an MVP.
Architecture Design: Features are concatenated into a flat vector. We use a LambdaMART objective for ranking if we want to optimize the list order directly, or LogLoss for pointwise probability.
Complexity vs. Latency: The model will score ~200-500 candidates. A LightGBM with 500 trees and depth 6 can easily achieve < 10ms inference.
Training Pipeline
Dataset Construction:
Positives: Sent requests that were accepted.
Negatives: Suggestions shown but ignored or dismissed.
Downsampling: Negative samples are usually 100x more than positives; we downsample negatives to a 10:1 ratio.
Data Splitting: Time-based split. Train on weeks 1-4, validate on week 5. Random splitting leads to data leakage due to the temporal nature of friendships.
Retraining Strategy: Weekly batch retraining to capture new users and shifting social trends.
Serving Pipeline
Serving Pattern:
Retrieval: Use a fast C++ service to query a pre-computed "Friends of Friends" list from a cache (Redis).
Ranking: Fetch features from the Online Feature Store and call the LightGBM model hosted on a high-throughput inference server (Triton or SageMaker).
Latency Optimization: Pre-compute top 500 candidates for popular users. Only perform real-time ranking for active sessions.
Reliability: If the ranking service fails, fallback to a simple heuristic: "Sort by Mutual Friend Count."
Evaluation Pipeline
Offline Evaluation:
Recall@K: Percentage of actual future friendships captured in the retrieval stage.
PR-AUC: Since classes are imbalanced, PR-AUC is more informative than ROC-AUC.
Online Evaluation:
A/B Testing: Randomly assign users to the new ML model vs. the heuristic baseline.
Metric: "Friend Request Acceptance Rate" (Accepts / Suggestions).
Monitoring Pipeline
System Monitoring: Track the P99 of the retrieval vs. ranking stages.
Data Monitoring: Monitor the distribution of "Mutual Friend Count." If it drops significantly, the graph ingestion pipeline might be broken.
Model Monitoring: Prediction Drift—if the average P(\text{accept}) drops from 0.05 to 0.01, the model may be stale or feature distributions have shifted.
Wrap Up
Final Evaluation
Observability: Use a dashboard to track "Connection Quality"—do people message each other after becoming friends?
Feedback Loop: New "Accepts" are immediately fed back into the graph, which updates the FoF retrieval for the next session.
Edge Cases:
Cold Start: For new users, use "Popular in your city" or "Common Interests" as the retrieval mechanism.
Trade-offs:
Accuracy vs. Latency: We only rank 500 candidates to save latency, potentially missing a "long-tail" friend that a more exhaustive search would find.
Distinguishing Insights:
Privacy: Implement a "Block-list" check at the very end of the serving pipeline to ensure suggested people haven't blocked the user.
Exploration vs. Exploitation: Occasionally inject a "random" or "distant" connection (3rd degree) to help the user expand their network beyond their immediate bubble.