The Question
Design

Design a Nearby Friends Feature

Design a real-time system for a social media application that allows users to see their friends on a map when they are within a certain proximity (e.g., 5 miles). The system must handle 10 million daily active users, support frequent location updates, and maintain strict privacy boundaries ensuring only mutual friends can see each other. Focus on how you would manage high-write throughput of GPS data and the spatial querying logic required to match friends efficiently at scale.
Redis
PostgreSQL
WebSockets
Geohash
PubSub
API Gateway
Kubernetes
Questions & Insights

Clarifying Questions

Scale & Usage: What is the Daily Active User (DAU) count, and what percentage of users have the "Nearby" feature enabled?
Update Frequency: How often should a user's location be updated (e.g., every 30 seconds, 1 minute, or only on significant movement)?
Friendship Density: What is the average and maximum number of friends per user? (Critical for fan-out calculations).
Precision vs. Latency: Is 100% real-time accuracy required, or is a 30-60 second lag acceptable for battery optimization?
Privacy Constraints: Are there specific requirements for data retention? (e.g., do we store location history or only the last known position?)
Assumptions:
DAU: 10 Million.
Update Frequency: Every 30 seconds when the app is in the foreground.
Friendship: Max 5,000 friends (standard social limit), average 200.
Storage: Only the "Last Known Location" is required for the core MVP feature.
Search Radius: Users within a 5km to 10km radius.

Thinking Process

The Core Bottleneck: High-frequency write-heavy traffic (location updates) coupled with spatial "Join" queries (matching user location against a friend's location).
Strategy Steps:
How do we handle 30k+ location updates per second without bottlenecking a relational DB? (Answer: Use an in-memory Key-Value store like Redis with Geospatial indexes).
How do we efficiently find "Friends who are nearby" without a massive O(N^2) cross-join? (Answer: Filter by the user's friend list first, then perform spatial filtering on that subset).
How do we notify friends of updates efficiently? (Answer: Use a Pub/Sub model to push updates to active connections).
How do we ensure privacy? (Answer: Only process updates for users who have mutually opted-in).

Bonus Points

Geo-Sharding with S2/H3: Propose using Google's S2 Geometry or Uber's H3 for sharding the Redis cluster by geographic cells to ensure spatial locality and avoid global hot spots.
Write Coalescing: Implement client-side logic to only send updates if the user has moved > 50 meters to save battery and reduce server ingress.
Adaptive Precision: Dynamically change update frequency based on battery level, movement speed (walking vs. driving), and location density.
Privacy Jitter: Add random noise (fuzzy location) to the reported coordinates to prevent precise stalking while maintaining "nearby" utility.
Design Breakdown

Functional Requirements

Core Use Cases:
Users can opt-in/out of the "Nearby Friends" feature.
Users can update their current GPS coordinates.
Users can view a list of friends currently within a 5km radius.
Users receive real-time updates when a friend enters their vicinity.
Scope Control:
In-scope: Real-time location tracking, friend proximity discovery, and opt-in management.
Out-of-scope: Location history/timeline, "People You May Know" based on location, and background tracking (MVP focuses on active app usage).

Non-Functional Requirements

Scale: Support 10M DAU with 33k+ write QPS.
Latency: Discovery of nearby friends should happen in < 200ms.
Availability: High availability (99.9%) as social features are "always-on."
Consistency: Eventual consistency is acceptable; it is okay if a friend appears "nearby" 30 seconds after they actually arrive.
Fault Tolerance: If the location cache fails, users simply see "No friends nearby" until the cache repopulates.
Security & Privacy: Strict mutual friendship checks; users must not see locations of non-friends.

Estimation

Traffic Estimation:
10M DAU / 10 (Peak Ratio) = 1M Concurrent Users.
1M users / 30s update interval = ~33,333 Write QPS.
Read QPS (Manual Refresh/Map Open): ~3,000 QPS (10% of writes).
Storage Estimation:
User ID (8 bytes) + Lat/Long (16 bytes) + Timestamp (8 bytes) = 32 bytes per user.
10M users 32 bytes = 320 MB** (fits easily in a single small Redis instance, but sharded for throughput).
Bandwidth Estimation:
Incoming: 33k QPS * 100 bytes (JSON/Protobuf payload) = ~3.3 MB/s.

Blueprint

Concise Summary: A WebSocket-based real-time system that stores transient user coordinates in a Redis Geo-spatial index, using a Pub/Sub mechanism to broadcast movements to a user's active friend list.
Major Components:
Load Balancer & API Gateway: Handles SSL termination and routes WebSocket/REST traffic.
Location Service: A stateless service that validates and ingests GPS updates.
Redis (Geo Index): Stores the current location of all active users using GEOADD for efficient radius queries.
Presence & PubSub Service: Maintains active WebSocket connections and manages the fan-out of location updates to friends.
Friendship Service (Postgres): Stores the source of truth for bi-directional social graphs.
Simplicity Audit: This design avoids complex batch processing (Spark/Flink) and heavy persistent databases for location, focusing on in-memory speed for the "current state" which is the only requirement for the MVP.
Architecture Decision Rationale:
Why this architecture?: Redis is chosen because it provides native Geospatial commands (GEORADIUS, GEOSEARCH) that are significantly faster than PostGIS for high-frequency transient data.
Functional Satisfaction: Meets the core requirement of seeing "who is nearby right now."
Non-functional Satisfaction: Scalable via Redis sharding and horizontally scaling stateless services.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing: Not heavily used for coordinates, but DNS utilizes latency-based routing to the nearest region.
Security & Perimeter:
API Gateway: Handles JWT authentication to ensure user_id in the location update matches the token.
Rate Limiting: Limits location updates to 1 per 10 seconds per user to prevent API abuse and battery drain.

Service

Topology & Scaling:
Stateless Services: All services (Location, Search, Presence) are deployed as Docker containers in K8s, scaling on CPU/Request count.
WebSocket Management: The Presence Service maintains sticky sessions. If a user moves, we only notify friends who have an active WebSocket connection.
API Schema Design:
POST /v1/location: Update lat/long. (REST or WebSocket)
GET /v1/nearby: Get list of friends within radius R.
WS /v1/updates: Bidirectional stream for real-time proximity alerts.
Resilience & Reliability:
Exponential Backoff: If the app fails to send a location update, it retries with jitter.
Graceful Degradation: If the real-time stream fails, the app falls back to a 60-second polling interval on the REST endpoint.

Storage

Access Pattern: 90% writes (updates), 10% reads (lookups).
Database Table Design:
Friendship (Postgres):
user_id_1 (UUID, Indexed)
user_id_2 (UUID, Indexed)
status (Enum: pending, active)
created_at (Timestamp)
Technical Selection:
PostgreSQL: Used for friendships due to strong ACID requirements for social graph consistency.
Redis: Used for location storage.
Distribution Logic:
Redis Sharding: Use a hash of the user_id for the Geo-index to distribute load. For global scale, shard by Geo-cell (e.g., all users in NYC in one shard).
Reliability & Recovery:
Redis is treated as transient. If it wipes, we lose "Nearby" status for a few minutes until apps check in again. RPO is low priority for this specific feature.

Cache

Purpose & Justification: Redis is the primary store for transient location, acting as a high-speed cache.
Key-Value Schema:
Key: locations:{shard_id}
Value: Redis GeoSet (Member: user_id, Score: Geohash).
Failure Handling: In case of Redis node failure, the system fails over to a replica. Since data is transient, no heavy disk-based PITR is required.

Messaging

Purpose & Decoupling: Redis Pub/Sub is used for the fan-out.
Event Schema: {"user_id": "123", "lat": 40.7, "lng": -74.0, "timestamp": 1625...}.
Throughput: 33k messages per second. Redis Pub/Sub handles this with sub-millisecond latency.
Failure Handling: Pub/Sub is "fire and forget." If a message is missed, the next update (30s later) will correct the state.

Infrastructure (Optional)

Observability:
Prometheus/Grafana: Monitoring Redis memory usage and WebSocket connection counts.
Tracing: Jaeger used to trace a location update from the gateway to the Pub/Sub fan-out.
Wrap Up

Advanced Topics

Trade-offs: We choose Availability over Consistency (AP). If a user moves, their friend might see their old location for a few seconds. This is acceptable for a social app.
Reliability: If the Presence Service restarts, we lose WebSocket connections. Clients are designed to reconnect immediately with exponential backoff.
Bottleneck Analysis: The primary bottleneck is the "Friend Fan-out." If a celebrity with 5,000 friends moves, we must check the presence of all 5,000 friends.
Optimization: Only fan-out to friends who are also currently active (presence bit set in Redis).
Security: Data at rest in Postgres is encrypted. Location data in Redis is volatile and set with a short TTL (e.g., 10 minutes) so that inactive users automatically disappear from "Nearby" lists.