The Question
DesignNearby Friends System
Design a real-time 'Nearby Friends' feature for a major social media platform with 100M DAU. The system should allow users to see their mutual friends within a 5km radius and receive updates as friends move. Focus on handling the high-frequency location updates (300k+ QPS), minimizing mobile battery consumption, and ensuring privacy. Detail the trade-offs between polling and push mechanisms, and explain how you would handle geospatial data at scale.
Redis
WebSockets
Geohashing
PostgreSQL
Pub/Sub
Questions & Insights
Clarifying Questions
Scale and Usage: What is the scale of the system? (e.g., Total users vs. Daily Active Users (DAU) and concurrent users).
Update Frequency: How often do we expect location updates from the client (e.g., every 30 seconds, 1 minute, or only on significant movement)?
Radius and Privacy: Is the search radius fixed (e.g., 5km) or user-configurable? How do we handle privacy (e.g., invisible mode or selective sharing)?
Data Retention: Do we need to store location history for analytics, or only the "last known position" for real-time features?
Assumptions for this Design:
DAU: 100 Million.
Concurrent Users: 10 Million.
Update Frequency: Every 30 seconds when the app is in the foreground.
Radius: Fixed at 5km for the MVP.
Privacy: Only "mutual friends" who have both opted-in can see each other.
Thinking Process
Core Bottleneck: The primary challenge is the "Fan-out" problem. Every location update from one user potentially needs to be pushed to dozens of friends. At 10M concurrent users updating every 30s (~333k QPS), a naive database query is impossible.
Strategy Steps:
Efficient In-Memory Storage: Use Redis with Geospatial indexes (Geohash) to store and query locations with sub-millisecond latency.
Stateful Connections: Implement WebSockets to push updates to users in real-time, avoiding the overhead of polling.
Pub/Sub Decoupling: Use a Pub/Sub mechanism (Redis or Kafka) to broadcast location updates to the specific WebSocket servers where a user's friends are connected.
Bonus Points
Geo-sharding & Hotspots: Discussing how to shard Redis based on Geohash prefixes to prevent "Manhattan" or "London" hotspots from overwhelming a single node.
Battery Optimization: Implementing "Variable Heartbeats"—increasing update frequency when moving fast (driving) and decreasing it when stationary or when battery is low.
Privacy-Preserving Proximity: Using fuzzy location/grid-based matching so the server doesn't know the exact GPS coordinate, only the "cell" the user is in.
Delivery Guarantees: Implementing "at-most-once" delivery for location updates, as getting the latest update is more important than ensuring every historical update is received.
Design Breakdown
Functional Requirements
Core Use Cases:
Users can opt-in/out of the "Nearby Friends" feature.
Users can update their current GPS coordinates.
Users receive a real-time list of friends within a 5km radius.
Scope Control:
In-Scope: Real-time location updates and proximity alerts.
Out-of-Scope: Location history/timeline, navigation/directions, and "People You May Know" recommendations.
Non-Functional Requirements
Scale: Support 10M concurrent users and 333k location updates per second.
Latency: Location updates should reflect on friends' devices within < 2 seconds (p99).
Availability & Reliability: High availability is critical for the "social" feel; however, transient location loss is acceptable.
Consistency: Eventual consistency is sufficient; showing a friend 5 seconds late is acceptable.
Security & Privacy: Strict mutual-friend authorization; location data must be encrypted in transit.
Estimation
Traffic:
10M concurrent users / 30s update interval = 333,333 Write QPS.
If each user has 100 friends, and 10 are online, the "fan-out" could result in millions of internal messages, but filtered by proximity.
Storage:
100M total users 100 bytes (UserId, Lat, Long, Timestamp) = 10GB RAM** for current locations.
Bandwidth:
Write: 333k QPS * 100 bytes = ~33 MB/s.
Read (Push): Highly variable based on friend density.
Blueprint
Concise Summary: A WebSocket-based architecture using Redis Geospatial for efficient proximity lookups and Redis Pub/Sub for routing updates to active sessions.
Major Components:
WebSocket Server: Maintains long-lived connections to push real-time updates to clients.
Location Service: Handles incoming GPS coordinates and updates the geospatial index.
Redis Geo Index: Stores the last known location of all active users using
GEOADD.Pub/Sub Component: Manages the distribution of location events to the relevant WebSocket servers.
Simplicity Audit: This design avoids complex batch processing (Spark) or persistent stream storage (Kafka) by using Redis as both the index and the message bus, which is sufficient for real-time proximity.
Architecture Decision Rationale:
Why this architecture?: WebSockets reduce header overhead compared to HTTP polling. Redis
GEORADIUS provides O(N + log M) complexity, which is the fastest way to find nearby points.Functional Satisfaction: Meets the real-time update and proximity search requirements.
Non-functional Satisfaction: Scalable via sharding Redis and horizontal scaling of WebSocket workers.
High Level Architecture
Sub-system Deep Dive
Service
Topology & Scaling:
WebSocket Servers: Stateless regarding application data but stateful regarding connections. Scaled horizontally based on connection count (~50k per node).
Location Service: Stateless REST/gRPC service for processing incoming coordinates.
API Schema Design:
Update Location (WS/JSON):
{"type": "update", "lat": 40.7, "long": -74.0}.Friend Nearby Alert (WS/JSON):
{"type": "nearby", "friend_id": "u123", "dist": "0.5km"}.Resilience & Reliability:
Heartbeats: Clients send ping/pong to keep WS connections alive and detect silent disconnects.
Service Discovery: Required for the Pub/Sub layer to know which WebSocket server hosts which user (usually handled by a "Location Routing Table" in Redis).
Storage
Access Pattern: Extremely high write (updates) and high read (proximity queries).
Database Table Design:
User DB (PostgreSQL): Stores user profiles and friend relationships.
Fields:
user_id (PK), friend_id (FK), status (active/blocked).Technical Selection:
Redis (Primary): Using
GEOADD for writes and GEORADIUS (or GEOSEARCH) for reads.Rationale: In-memory speed is mandatory for 300k+ updates/sec.
Distribution Logic:
Sharding: Shard Redis by
user_id for location storage, but for proximity, we shard by Geohash prefixes. This ensures users in the same geographic area are on the same shard, allowing a single query to find neighbors.Cache
Purpose: Reducing DB hits for friendship checks.
Key-Value Schema:
friends:{user_id} -> Set[friend_ids].Technical Selection: Redis (shared with Geo Index or separate cluster).
Consistency: Updated on friend request acceptance.
Messaging
Purpose & Decoupling: When User A updates their location, their ID is published to a channel.
Topic Schema:
user_updates:{user_id}.Scaling Consumers: Each WebSocket server subscribes to the channels of the friends of the users currently connected to that server.
Failure Handling: If a message is lost, the next update (30s later) will correct the state.
Infrastructure (Optional)
Observability:
Metrics: Monitor WebSocket connection churn and Redis command latency.
Distributed Tracing: Trace a location update from Client -> WS -> Redis -> Friend WS.
Wrap Up
Advanced Topics
Trade-offs (Push vs. Pull):
Pull: Simpler server-side (stateless), but massive battery drain on mobile and high overhead for 10M users.
Push (Chosen): Efficient data transfer but requires managing stateful WebSocket clusters and complex Pub/Sub logic.
Reliability & Failure Handling:
If a Redis shard fails, we lose real-time proximity for that "cell." We mitigate this with Redis Replication (Primary-Replica).
Bottleneck Analysis:
The "Celebrity" Problem: A user with 5,000 friends moving around causes a massive spike in Pub/Sub messages. Optimization: Limit the number of friends tracked in real-time or throttle updates for high-degree nodes.
Security & Privacy:
Mutual Opt-in: The system must check the Friendship Service before broadcasting any location.
Geofencing: Users can define "Privacy Zones" (e.g., Home) where their location is never shared.