DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
The Question
Design

Design Google Maps

Design a comprehensive geospatial system similar to Google Maps that supports global map rendering, high-concurrency point-of-interest (POI) search, and real-time navigation. The system must handle 100M+ daily active users, provide sub-second routing responses, and incorporate real-time traffic data to adjust ETAs. Address specific challenges regarding geospatial indexing, massive data ingestion for location pings, and efficient distribution of map assets at scale.
S2 Cells
PostGIS
Kafka
Apache Spark
Redis
S3
CDN
Vector Tiles
Contraction Hierarchies
gRPC
Questions & Insights

Clarifying Questions

What is the primary scale of the system? (Assumption: 100M Daily Active Users (DAU), 1B total users, global distribution).
What are the core functional requirements for the MVP? (Assumption: Map rendering (tiles), Point of Interest (POI) search, and Turn-by-Turn Navigation/Routing).
Does the system need to handle real-time traffic updates for the MVP? (Assumption: Yes, pings from active users must influence routing ETAs).
Is offline map support required? (Assumption: No, YAGNI for MVP).
What is the expected latency for route calculation? (Assumption: P99 < 500ms for city-wide routes).

Thinking Process

How do we efficiently serve map data at a global scale? Use Vector Tiles served via a Geographically distributed CDN to minimize latency and bandwidth.
How do we search for POIs near a user? Implement Geospatial Indexing (S2 Cells or Quadtrees) to narrow down search results to specific lat/lng bounds.
How do we handle navigation between two points? Use a Graph-based representation of the road network (Nodes as intersections, Edges as roads) and perform pathfinding (A* or Contraction Hierarchies).
How do we integrate real-time traffic? Stream user location pings into a processing pipeline to update edge weights in the road graph dynamically.

Bonus Points

S2 Geometry Library: Use Google's S2 cells for hierarchical spatial indexing, which maps 2D lat/lng to a 1D 64-bit integer (Hilbert Curve), enabling efficient range queries.
Contraction Hierarchies: Pre-calculate "shortcuts" in the road graph to speed up routing queries by orders of magnitude compared to standard Dijkstra/A*.
Write-Heavy Location Ingestion: Use a Geo-partitioned Kafka setup to handle millions of concurrent GPS pings without bottlenecking the primary database.
Adaptive Tile Zooming: Implement different levels of detail (LOD) to prevent over-fetching data at high zoom levels.
Design Breakdown

Functional Requirements

Core Use Cases:
Render map tiles at various zoom levels.
Search for POIs (e.g., "coffee shops near me").
Calculate the fastest route between Point A and Point B.
Provide real-time ETA based on current traffic.
Scope Control:
In-scope: Static/Vector tiles, POI Search, Basic Routing, Traffic updates.
Out-of-scope: Street View (3D), Satellite imagery, Offline maps, Public transit schedules, User reviews/photos.

Non-Functional Requirements

Scale: Support 100M DAU; 10k+ QPS for search; 50k+ QPS for tile requests.
Latency: Map tile loading < 200ms; Route calculation < 500ms.
Availability & Reliability: 99.99% (High availability is critical for navigation).
Consistency: Eventual consistency for POI updates; High availability over strict consistency for traffic data.
Security & Privacy: Anonymize user location data; TLS for all transit.

Estimation

Traffic Estimation:
100M DAU * 10 tile requests/day = 1B Tile Requests/Day (~12k QPS average, 25k peak).
100M DAU * 2 searches/day = 200M Search QPS (~2.3k QPS).
Storage Estimation:
Map tiles (Vector): ~50-100 TB for global road data at all zoom levels.
POI Database: 100M POIs * 1KB/POI = 100 GB.
Road Graph: 1B Edges/Nodes * 100 bytes = 100 GB.
Bandwidth Estimation:
Tiles: 12k QPS * 50KB/tile = 600 MB/s outgoing.

Blueprint

Concise Summary: A geo-distributed system utilizing a CDN for map tiles, a geospatial-indexed database for POIs, and a graph-processing engine for navigation.
Major Components:
Tile Service: Serves pre-rendered or dynamic vector tiles from Object Storage via CDN.
Search Service: Uses S2-cell indexing to query POI metadata in a spatial database.
Routing Service: Executes pathfinding algorithms on a memory-optimized road graph.
Traffic Processor: Ingests GPS pings to adjust graph edge weights (traffic density).
Simplicity Audit: This architecture focuses on the three pillars (View, Find, Navigate) using industry-standard geospatial patterns without over-complicating the data pipeline.
Architecture Decision Rationale:
Why?: Separation of concerns allows scaling the Tile Service (read-heavy) independently from the Routing Service (compute-heavy).
Functional: Meets all core MVP needs: rendering, search, and navigation.
Non-functional: CDN ensures low latency; Sharded databases ensure scalability.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing
CDN: Primary delivery mechanism for Vector Tiles. Tiles are cached at the edge based on zoom/x/y coordinates.
DNS: Latency-based routing to the nearest regional API Gateway.
Security & Perimeter
API Gateway: Handles JWT authentication, SSL termination, and rate limiting to prevent scraping of map data.

Service

Topology & Scaling
Tile Service: Stateless microservice, scales horizontally based on CPU/Egress bandwidth.
Routing Service: State-heavy (loads graph in memory). Uses large-memory instances. Sharded by geographic region (e.g., North America, Europe) to keep graphs manageable.
API Schema Design
GET /v1/tiles/{z}/{x}/{y}: Returns Protobuf-encoded vector data.
GET /v1/search?q=query&lat=...&lng=...&radius=...: Returns list of POIs.
GET /v1/route?origin=lat,lng&dest=lat,lng&mode=driving: Returns Polyline and ETA.
Resilience & Reliability
Circuit Breakers: If the Routing Service is slow, fallback to a simpler distance-based heuristic.
Retries: Exponential backoff for Search Service calls.

Storage

Access Pattern
Tile Store: 99% Read, 1% Write (during map updates).
POI DB: High read (search), low write.
Road Graph: Read-heavy (routing), constant stream of weight updates (traffic).
Database Table Design
POI Table (PostGIS):
poi_id (PK), name, category, location (GEOGRAPHY point), s2_cell_id (Indexed).
Road Graph (Graph Store):
node_id, lat, lng.
edge_id, start_node, end_node, weight (time), metadata (speed limit, road type).
Technical Selection
Tile Store: AWS S3 or Google Cloud Storage.
POI Search: PostgreSQL with PostGIS extension for spatial queries and GiST indexing.
Road Network: Custom Graph Service (In-memory) or Neo4j for persistent relationships.
Distribution Logic
Sharding: POI and Road data are sharded by S2 Cell ID (Level 12 or 13) to ensure spatial locality.

Cache

Purpose & Justification: Reduce load on the POI database and speed up common route requests.
Key-Value Schema:
poi:search:{query}:{cell_id} -> List of POI IDs (TTL 1 hour).
route:{origin_cell}:{dest_cell} -> Cached Polyline (TTL 5 mins, highly volatile due to traffic).
Technical Selection: Redis.

Messaging

Purpose & Decoupling: Ingest high-velocity GPS pings from users to calculate traffic without slowing down the main request flow.
Event Schema: {user_id, lat, lng, speed, timestamp, heading}.
Throughput & Partitioning: Kafka partitioned by geohash or s2_cell to ensure pings from the same area go to the same consumer for velocity calculation.
Technical Selection: Kafka.

Data Processing

Processing Model: Stream processing.
Processing DAG: Kafka Source -> Map to Road Segment -> Calculate Average Speed (Windowed) -> Update Routing Service Weights.
Technical Selection: Apache Spark Streaming or Flink. Chosen for windowing capabilities to average traffic over 1-5 minute intervals.
Wrap Up

Advanced Topics

Trade-offs:
Vector vs Raster Tiles: Chosen Vector. Pro: Smaller size, client-side styling. Con: Requires more client CPU to render.
Graph Consistency: Routing weights are eventually consistent with real traffic. It's better to provide a slightly stale route than no route at all.
Reliability: If the Traffic Processor fails, the Routing Service defaults to static "speed limit" weights.
Bottleneck Analysis:
Hot Spot Shards: Popular cities (NYC, London) will have much higher search density. Use sub-sharding within these S2 cells or dedicated cache layers.
Security: PII protection is paramount. User location pings are dissociated from User IDs in the Traffic Processor to maintain anonymity.
Optimization:
Tile Pre-rendering: Pre-render common zoom levels (1-10) for the whole world; render higher zoom levels (city level) on-demand and cache.