DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
The Question
ML Design

Large-Scale Social Link Prediction System (People You May Know)

Design a high-scale 'People You May Know' (PYMK) recommendation system for a social network with over 1 billion users. Your design should handle massive graph data, addressing the challenges of candidate retrieval (Stage 1) and precision ranking (Stage 2). Detail how you would implement a low-latency serving architecture (P99 < 150ms) and a robust data pipeline that handles real-time friendship updates. Focus on feature engineering for graph-based signals, handling class imbalance in training, and defining the success metrics for business growth and system reliability.
LightGBM
XGBoost
Spark GraphX
Apache Flink
Kafka
Redis
Feast
LambdaMART
Questions & Insights

Clarifying Questions

Clarifying Questions & Constraints:
Business Goal: Is the primary goal to increase network density (total friend connections) or user retention (DAU)? (Assume: Retention through meaningful connections).
Scale: What is the size of the user graph and the expected QPS? (Assume: 500M DAU, 1B total users, 50k QPS).
Latency: What is the P99 latency budget for showing suggestions on the home feed? (Assume: 150ms).
Negative Feedback: Do we have a "dismiss" signal (X button) to treat as a strong negative? (Assume: Yes).
Definition of Success: Is a "send request" enough, or do we optimize for "accepted request"? (Assume: Accepted request).
Assumptions:
The user corpus is 1B nodes.
We have access to a graph of existing friendships.
We assume a two-stage approach: Retrieval (filtering billions to hundreds) and Ranking (scoring hundreds for the top 10).

Thinking Process

Identify the Core Bottleneck: The primary challenge is the "needle in the haystack" problem in a massive graph. I cannot rank 1B users for every user. I must use graph heuristics (triadic closure) for efficient retrieval.
Funnel Strategy: Design a multi-stage pipeline. Stage 1 (Retrieval) uses high-recall heuristics like "friends of friends." Stage 2 (Ranking) uses a high-precision ML model to sort these candidates.
Feature Importance: Recognize that graph-structural features (mutual friend count) are significantly more predictive than static profile features.
Cold Start: Address how to suggest friends for a brand-new user with zero connections (e.g., using contact uploads or location/interests).

Elite Bonus Points

Triadic Closure Theory: Leveraging the sociological principle that if A knows B and B knows C, there is a high probability A will know C. This forms the basis of the "Mutual Friend" heuristic.
Graph Embedding Versioning: Discussing how to handle "embedding drift" where user embeddings change as they add friends, requiring incremental updates rather than full batch re-trains.
Position Bias Correction: Implementing a bias-correction term in the ranking model to account for the fact that the first suggested person is naturally more likely to be clicked/added.
Delayed Feedback Loop: Handling the fact that a friend request might be accepted 3 days after it was sent, requiring a robust join-logic for training labels.
Design Breakdown

Requirements

Product Goal: Increase meaningful user connections to drive platform stickiness and retention.
Success Metrics:
Online Metrics: Friend Request Acceptance Rate (FRAR), Net New Connections per User.
Offline Metrics: PR-AUC (Precision-Recall Area Under Curve), NDCG@10 (Normalized Discounted Cumulative Gain).
Guardrail Metrics: Request Ignore/Dismiss Rate, Latency (P99 < 150ms).
System Constraints: 1B users, 50k QPS, real-time graph updates (new friendships should influence suggestions quickly).
Data Availability: User profiles, friendship graph, contact lists (if opted-in), interaction logs (likes, comments).

ML Problem Framing

ML Task Type: Binary Classification (Link Prediction).
Prediction Target: P(\text{Accept} | \text{Source User, Target User, Context}).
Inputs:
User/Item Features: Age, Location, Work/Education, Privacy settings.
Graph Features: Number of mutual friends, Jaccard similarity of friend lists, PageRank scores.
Context: Device, Time of Day, Home Feed vs. Profile View.
Outputs: Probability score [0, 1].
ML Challenges: Highly imbalanced labels (most pairs are NOT friends), extreme sparsity, and feedback loops (the model only learns from what it shows).

Design Summary & MVP

Concise Summary: A two-stage system using "Friends-of-Friends" (FoF) for candidate retrieval followed by a Gradient Boosted Decision Tree (GBDT) for precision ranking.
Model Architecture & Selection:
Baseline Model: Heuristic-based (Sort by number of mutual friends).
Target Model: LightGBM or XGBoost for ranking candidates.
Choice Rationale: GBDTs are exceptional for tabular data with heterogeneous features (graph metrics + profile metadata) and are computationally efficient for scoring ~500 candidates within the latency budget.
ML Life Cycle Summary: Spark-based graph processing generates candidates; Flink processes real-time friendship events to update features; LightGBM ranks in real-time.
Simplicity Audit: Avoids complex GNNs (Graph Neural Networks) initially. FoF covers 80% of relevant connections with much lower infra complexity.
Architecture Decision Rationale:
Why this architecture?: FoF is highly parallelizable and captures the strongest signal for social links.
Requirement Satisfaction: Scalable to 1B users via distributed graph processing and meets 150ms latency by pre-calculating candidates.

System Architecture

Pipeline Deep Dive

Data Pipeline

Data Source: User profiles (PostgreSQL), Friendship Graph (Neo4j or sharded MySQL), and Clickstream (Kafka).
Data Ingestion: Use Kafka for real-time events (friend requests) and periodic ETL (Airflow) for graph snapshots.
Data Storage: S3 for the Data Lake. Graph data stored in an adjacency list format (Parquet) optimized for bulk processing.
Data Processing: Spark GraphX or Apache Flink. Spark is used to compute global graph metrics (e.g., PageRank) weekly, while Flink updates "Mutual Friend" counts incrementally.
Data Quality: De-duplication of friend-request events and schema validation (Protobuf) to ensure feature consistency.

Feature Pipeline

Feature Definition:
Structural: Mutual friend count (strongest), HITS scores, Triangle counts.
Affinity: Common schools, shared workplaces, geographic proximity (City/Region).
Activity: Profile views in the last 7 days, "likes" on common posts.
Offline Feature Pipeline: Spark job calculates N-hop neighbors and static profile similarities.
Online Feature Pipeline: Flink consumes the "New Friendship" stream to increment/decrement mutual friend counts in a low-latency Key-Value store (Redis/Cassandra).
Feature Store: Use a Feature Store (e.g., Feast) to ensure that the "Mutual Friend Count" used during training exactly matches the one used during serving (avoiding training-serving skew).

Model Architecture

Problem Formulation: Supervised binary classification. Target: is_accepted.
Candidate Model Families:
Linear Models: Too simple; fails to capture interactions between features (e.g., same school AND same age).
GBDT (LightGBM): Chosen MVP. Handles missing values, non-linear relationships, and is very fast for inference.
Deep Learning (Two-Tower): Better for unstructured data, but harder to tune and deploy for an MVP.
Architecture Design: Features are concatenated into a flat vector. We use a LambdaMART objective for ranking if we want to optimize the list order directly, or LogLoss for pointwise probability.
Complexity vs. Latency: The model will score ~200-500 candidates. A LightGBM with 500 trees and depth 6 can easily achieve < 10ms inference.

Training Pipeline

Dataset Construction:
Positives: Sent requests that were accepted.
Negatives: Suggestions shown but ignored or dismissed.
Downsampling: Negative samples are usually 100x more than positives; we downsample negatives to a 10:1 ratio.
Data Splitting: Time-based split. Train on weeks 1-4, validate on week 5. Random splitting leads to data leakage due to the temporal nature of friendships.
Retraining Strategy: Weekly batch retraining to capture new users and shifting social trends.

Serving Pipeline

Serving Pattern:
Retrieval: Use a fast C++ service to query a pre-computed "Friends of Friends" list from a cache (Redis).
Ranking: Fetch features from the Online Feature Store and call the LightGBM model hosted on a high-throughput inference server (Triton or SageMaker).
Latency Optimization: Pre-compute top 500 candidates for popular users. Only perform real-time ranking for active sessions.
Reliability: If the ranking service fails, fallback to a simple heuristic: "Sort by Mutual Friend Count."

Evaluation Pipeline

Offline Evaluation:
Recall@K: Percentage of actual future friendships captured in the retrieval stage.
PR-AUC: Since classes are imbalanced, PR-AUC is more informative than ROC-AUC.
Online Evaluation:
A/B Testing: Randomly assign users to the new ML model vs. the heuristic baseline.
Metric: "Friend Request Acceptance Rate" (Accepts / Suggestions).

Monitoring Pipeline

System Monitoring: Track the P99 of the retrieval vs. ranking stages.
Data Monitoring: Monitor the distribution of "Mutual Friend Count." If it drops significantly, the graph ingestion pipeline might be broken.
Model Monitoring: Prediction Drift—if the average P(\text{accept}) drops from 0.05 to 0.01, the model may be stale or feature distributions have shifted.
Wrap Up

Final Evaluation

Observability: Use a dashboard to track "Connection Quality"—do people message each other after becoming friends?
Feedback Loop: New "Accepts" are immediately fed back into the graph, which updates the FoF retrieval for the next session.
Edge Cases:
Cold Start: For new users, use "Popular in your city" or "Common Interests" as the retrieval mechanism.
Trade-offs:
Accuracy vs. Latency: We only rank 500 candidates to save latency, potentially missing a "long-tail" friend that a more exhaustive search would find.
Distinguishing Insights:
Privacy: Implement a "Block-list" check at the very end of the serving pipeline to ensure suggested people haven't blocked the user.
Exploration vs. Exploitation: Occasionally inject a "random" or "distant" connection (3rd degree) to help the user expand their network beyond their immediate bubble.