The Question

Large-Scale Social Link Prediction System (People You May Know)

Design a high-scale 'People You May Know' (PYMK) recommendation system for a social network with over 1 billion users. Your design should handle massive graph data, addressing the challenges of candidate retrieval (Stage 1) and precision ranking (Stage 2). Detail how you would implement a low-latency serving architecture (P99 < 150ms) and a robust data pipeline that handles real-time friendship updates. Focus on feature engineering for graph-based signals, handling class imbalance in training, and defining the success metrics for business growth and system reliability.

LightGBM

XGBoost

Spark GraphX

Apache Flink

Kafka

Redis

Feast

LambdaMART

Questions & Insights

Clarifying Questions

Clarifying Questions & Constraints:

Business Goal: Is the primary goal to increase network density (total friend connections) or user retention (DAU)? (Assume: Retention through meaningful connections).

Scale: What is the size of the user graph and the expected QPS? (Assume: 500M DAU, 1B total users, 50k QPS).

Latency: What is the P99 latency budget for showing suggestions on the home feed? (Assume: 150ms).

Negative Feedback: Do we have a "dismiss" signal (X button) to treat as a strong negative? (Assume: Yes).

Definition of Success: Is a "send request" enough, or do we optimize for "accepted request"? (Assume: Accepted request).

Assumptions:

The user corpus is 1B nodes.

We have access to a graph of existing friendships.

We assume a two-stage approach: Retrieval (filtering billions to hundreds) and Ranking (scoring hundreds for the top 10).

Thinking Process

Identify the Core Bottleneck: The primary challenge is the "needle in the haystack" problem in a massive graph. I cannot rank 1B users for every user. I must use graph heuristics (triadic closure) for efficient retrieval.

Funnel Strategy: Design a multi-stage pipeline. Stage 1 (Retrieval) uses high-recall heuristics like "friends of friends." Stage 2 (Ranking) uses a high-precision ML model to sort these candidates.

Feature Importance: Recognize that graph-structural features (mutual friend count) are significantly more predictive than static profile features.

Cold Start: Address how to suggest friends for a brand-new user with zero connections (e.g., using contact uploads or location/interests).

Elite Bonus Points

Triadic Closure Theory: Leveraging the sociological principle that if A knows B and B knows C, there is a high probability A will know C. This forms the basis of the "Mutual Friend" heuristic.

Graph Embedding Versioning: Discussing how to handle "embedding drift" where user embeddings change as they add friends, requiring incremental updates rather than full batch re-trains.

Position Bias Correction: Implementing a bias-correction term in the ranking model to account for the fact that the first suggested person is naturally more likely to be clicked/added.

Delayed Feedback Loop: Handling the fact that a friend request might be accepted 3 days after it was sent, requiring a robust join-logic for training labels.

Design Breakdown

Requirements

Product Goal: Increase meaningful user connections to drive platform stickiness and retention.

Success Metrics:

Online Metrics: Friend Request Acceptance Rate (FRAR), Net New Connections per User.

Offline Metrics: PR-AUC (Precision-Recall Area Under Curve), NDCG@10 (Normalized Discounted Cumulative Gain).

Guardrail Metrics: Request Ignore/Dismiss Rate, Latency (P99 < 150ms).

System Constraints: 1B users, 50k QPS, real-time graph updates (new friendships should influence suggestions quickly).

Data Availability: User profiles, friendship graph, contact lists (if opted-in), interaction logs (likes, comments).

ML Problem Framing

ML Task Type: Binary Classification (Link Prediction).

Prediction Target:

P(\text{Accept} | \text{Source User, Target User, Context})

Inputs:

User/Item Features: Age, Location, Work/Education, Privacy settings.

Graph Features: Number of mutual friends, Jaccard similarity of friend lists, PageRank scores.

Context: Device, Time of Day, Home Feed vs. Profile View.

Outputs: Probability score

[0, 1]

ML Challenges: Highly imbalanced labels (most pairs are NOT friends), extreme sparsity, and feedback loops (the model only learns from what it shows).

Design Summary & MVP

Concise Summary: A two-stage system using "Friends-of-Friends" (FoF) for candidate retrieval followed by a Gradient Boosted Decision Tree (GBDT) for precision ranking.

Model Architecture & Selection:

Baseline Model: Heuristic-based (Sort by number of mutual friends).

Target Model: LightGBM or XGBoost for ranking candidates.

Choice Rationale: GBDTs are exceptional for tabular data with heterogeneous features (graph metrics + profile metadata) and are computationally efficient for scoring ~500 candidates within the latency budget.

ML Life Cycle Summary: Spark-based graph processing generates candidates; Flink processes real-time friendship events to update features; LightGBM ranks in real-time.

Simplicity Audit: Avoids complex GNNs (Graph Neural Networks) initially. FoF covers 80% of relevant connections with much lower infra complexity.

Architecture Decision Rationale:

Why this architecture?: FoF is highly parallelizable and captures the strongest signal for social links.

Requirement Satisfaction: Scalable to 1B users via distributed graph processing and meets 150ms latency by pre-calculating candidates.

System Architecture

Pipeline Deep Dive

Data Pipeline

Data Source: User profiles (PostgreSQL), Friendship Graph (Neo4j or sharded MySQL), and Clickstream (Kafka).

Data Ingestion: Use Kafka for real-time events (friend requests) and periodic ETL (Airflow) for graph snapshots.

Data Storage: S3 for the Data Lake. Graph data stored in an adjacency list format (Parquet) optimized for bulk processing.

Data Processing: Spark GraphX or Apache Flink. Spark is used to compute global graph metrics (e.g., PageRank) weekly, while Flink updates "Mutual Friend" counts incrementally.

Data Quality: De-duplication of friend-request events and schema validation (Protobuf) to ensure feature consistency.

Feature Pipeline

Feature Definition:

Structural: Mutual friend count (strongest), HITS scores, Triangle counts.

Affinity: Common schools, shared workplaces, geographic proximity (City/Region).

Activity: Profile views in the last 7 days, "likes" on common posts.

Offline Feature Pipeline: Spark job calculates

N

-hop neighbors and static profile similarities.

Online Feature Pipeline: Flink consumes the "New Friendship" stream to increment/decrement mutual friend counts in a low-latency Key-Value store (Redis/Cassandra).

Feature Store: Use a Feature Store (e.g., Feast) to ensure that the "Mutual Friend Count" used during training exactly matches the one used during serving (avoiding training-serving skew).

Model Architecture

Problem Formulation: Supervised binary classification. Target: is_accepted.

Candidate Model Families:

Linear Models: Too simple; fails to capture interactions between features (e.g., same school AND same age).

GBDT (LightGBM): Chosen MVP. Handles missing values, non-linear relationships, and is very fast for inference.

Deep Learning (Two-Tower): Better for unstructured data, but harder to tune and deploy for an MVP.

Architecture Design: Features are concatenated into a flat vector. We use a LambdaMART objective for ranking if we want to optimize the list order directly, or LogLoss for pointwise probability.

Complexity vs. Latency: The model will score ~200-500 candidates. A LightGBM with 500 trees and depth 6 can easily achieve < 10ms inference.

Training Pipeline

Dataset Construction:

Positives: Sent requests that were accepted.

Negatives: Suggestions shown but ignored or dismissed.

Downsampling: Negative samples are usually 100x more than positives; we downsample negatives to a 10:1 ratio.

Data Splitting: Time-based split. Train on weeks 1-4, validate on week 5. Random splitting leads to data leakage due to the temporal nature of friendships.

Retraining Strategy: Weekly batch retraining to capture new users and shifting social trends.

Serving Pipeline

Serving Pattern:

Retrieval: Use a fast C++ service to query a pre-computed "Friends of Friends" list from a cache (Redis).

Ranking: Fetch features from the Online Feature Store and call the LightGBM model hosted on a high-throughput inference server (Triton or SageMaker).

Latency Optimization: Pre-compute top 500 candidates for popular users. Only perform real-time ranking for active sessions.

Reliability: If the ranking service fails, fallback to a simple heuristic: "Sort by Mutual Friend Count."

Evaluation Pipeline

Offline Evaluation:

Recall@K: Percentage of actual future friendships captured in the retrieval stage.

PR-AUC: Since classes are imbalanced, PR-AUC is more informative than ROC-AUC.

Online Evaluation:

A/B Testing: Randomly assign users to the new ML model vs. the heuristic baseline.

Metric: "Friend Request Acceptance Rate" (Accepts / Suggestions).

Monitoring Pipeline

System Monitoring: Track the P99 of the retrieval vs. ranking stages.

Data Monitoring: Monitor the distribution of "Mutual Friend Count." If it drops significantly, the graph ingestion pipeline might be broken.

Model Monitoring: Prediction Drift—if the average

P(\text{accept})

drops from 0.05 to 0.01, the model may be stale or feature distributions have shifted.

Wrap Up

Final Evaluation

Observability: Use a dashboard to track "Connection Quality"—do people message each other after becoming friends?

Feedback Loop: New "Accepts" are immediately fed back into the graph, which updates the FoF retrieval for the next session.

Edge Cases:

Cold Start: For new users, use "Popular in your city" or "Common Interests" as the retrieval mechanism.

Trade-offs:

Accuracy vs. Latency: We only rank 500 candidates to save latency, potentially missing a "long-tail" friend that a more exhaustive search would find.

Distinguishing Insights:

Privacy: Implement a "Block-list" check at the very end of the serving pipeline to ensure suggested people haven't blocked the user.

Exploration vs. Exploitation: Occasionally inject a "random" or "distant" connection (3rd degree) to help the user expand their network beyond their immediate bubble.