The Question
DesignDesign a Global Map and Navigation System
Design a comprehensive mapping system similar to Google Maps that supports real-time rendering, location search, and navigation. The system must handle over 1 billion daily active users, provide low-latency map tile delivery (under 50ms), and offer accurate turn-by-turn driving directions adjusted for real-time traffic conditions. Address how you would efficiently store and query hundreds of millions of points of interest (POIs) and how you would scale the routing engine to handle global road networks while maintaining high availability and consistency for road data updates.
CDN
Vector Tiles
S2 Geometry
PostGIS
Redis
Kafka
Apache Flink
Protobuf
Contraction Hierarchies
S3
Questions & Insights
Clarifying Questions
Scale: What is the expected scale in terms of Daily Active Users (DAU) and queries? (Assume 1 Billion DAU, 10 billion map tile requests/day, and 1 billion search/routing requests/day).
Core Features: Should we focus on the entire Google Maps suite or a subset? (Focus on MVP: Map Rendering, Location Search (POI), and Navigation/Routing).
Data Latency: How real-time does the traffic data need to be for routing? (Assume 1-2 minute latency for traffic updates is acceptable for MVP).
Map Accuracy: Are we handling global map data ingestion or just serving pre-processed data? (Assume serving pre-processed data with an internal pipeline for graph construction).
Assumptions:
Vector Tiles: We will use vector tiles instead of raster for better scaling and client-side styling.
Geography: Global coverage with tiered zoom levels.
Routing: Support for driving directions based on distance and live traffic.
Thinking Process
Core Bottleneck 1: Massive Read Volume for Tiles. Solve by using a Geo-distributed CDN and Vector Tile format to minimize payload size.
Core Bottleneck 2: Spatial Search Complexity. Solve by using Geospatial indexing (S2 Geometry or Quadtrees) for Points of Interest (POI).
Core Bottleneck 3: Real-time Routing on Global Graph. Solve by partitioning the global road graph into sub-graphs and using Contraction Hierarchies (CH) for sub-millisecond pathfinding.
Core Bottleneck 4: Data Freshness. Solve by using a stream processing layer to ingest GPS telemetry and update edge weights in the routing graph.
Bonus Points
S2 Geometry Library: Utilize Google’s S2 library for mapping the sphere to a 1D Hilbert Curve, enabling efficient range scans and spatial joins in standard B-Trees.
Hierarchical Pathfinding: Implement Multi-Level Dijkstra or Contraction Hierarchies to pre-calculate "shortcuts" on highways, reducing the search space for long-distance routing from millions of nodes to hundreds.
Vector Tile Differential Updates: Instead of re-downloading tiles, send Protobuf-encoded deltas for map feature updates.
Adaptive Precision: Use different data structures for dense urban areas (high-depth Quadtrees) vs. rural areas (low-depth) to optimize storage.
Design Breakdown
Functional Requirements
Core Use Cases:
Map Rendering: Users can view map tiles and zoom/pan seamlessly.
POI Search: Users can search for businesses or addresses by name/category.
Navigation: Users can get the fastest route between two points considering current traffic.
Scope Control:
In-Scope: Map tile delivery, Spatial search, Routing, Traffic ingestion.
Out-of-Scope: Street View, 3D Building rendering, Offline maps, Location sharing, User reviews/photos.
Non-Functional Requirements
Scale: Support 1B+ users and 100k+ peak QPS for various services.
Latency: Map tiles < 50ms (from CDN), Search/Routing < 200ms.
Availability & Reliability: 99.99% availability; navigation is a critical service for drivers.
Consistency: Eventual consistency for POI updates; high consistency for road closures.
Fault Tolerance: Region-level failover for the routing engine.
Estimation
Traffic:
1B DAU / 86400s ≈ 12k average users/sec.
Peak Tile QPS: Assume 10 tiles per session, 100k+ QPS.
Search/Routing: 1B queries/day ≈ 12k QPS.
Storage:
Map Tiles (Vector): ~50PB for global coverage across 20 zoom levels (compressed).
POI Database: 200M POIs * 1KB/POI ≈ 200GB.
Road Graph: 100M nodes/edges ≈ 50GB in-memory representation.
Bandwidth:
Vector Tile: ~20KB-50KB per tile. 100k QPS * 50KB ≈ 5GB/s egress.
Blueprint
Concise Summary: A geo-distributed architecture utilizing CDN-cached vector tiles for rendering, S2-indexed spatial databases for search, and a partitioned graph processing engine for routing.
Major Components:
Tile Service: Serves pre-generated vector tiles from object storage via CDN.
Search Service: Uses a spatial index (Elasticsearch/PostGIS) to find POIs within a bounding box.
Routing Service: Executes pathfinding algorithms on an in-memory road graph.
Traffic Processor: Ingests real-time GPS telemetry to update road weights.
Simplicity Audit: This design avoids complex 3D rendering and focuses on the "Big Three" (View, Search, Route) using proven spatial indexing and graph theory.
Architecture Decision Rationale:
Why this architecture?: Separating Tile, Search, and Routing allows independent scaling. Tiles are static-heavy (CDN), Search is IO-heavy (DB), and Routing is CPU/RAM-heavy.
Functional Satisfaction: Covers end-to-end user flow from finding a place to getting there.
Non-functional Satisfaction: CDN ensures low latency globally, while in-memory graphs ensure fast routing.
High Level Architecture
Sub-system Deep Dive
Edge (Optional)
Content Delivery & Traffic Routing:
CDN: Critical for Vector Tiles (PBF format). Uses Anycast DNS to route users to the nearest Edge POP.
L7 Load Balancer: Handles SSL termination and routes
/search to Search Service and /route to Routing Service.Security:
Rate Limiting: Per-API key quotas to prevent scraping of map data.
Service
Topology & Scaling:
Stateless Services: All API services are stateless and run in Kubernetes clusters across multiple Availability Zones.
Routing Service Isolation: Routing nodes are sharded by geographic region (e.g., North America, Europe) to keep the graph size manageable in RAM.
API Schema Design:
GET /v1/tile/{z}/{x}/{y}.pbf: Returns vector data for a specific tile coordinates.GET /v1/search?q=coffee&lat=...&long=...: Returns list of POIs. Uses Protobuf for response.GET /v1/route?origin=...&dest=...&mode=driving: Returns a polyline and step-by-step instructions.Resilience:
Circuit Breakers: If the Traffic Processor lags, the Routing Service falls back to historical average speeds.
Storage
Access Pattern:
Tiles: Read-heavy, write-once (updated weekly).
POI: Read-heavy (search), low-frequency writes.
Database Table Design:
POI Table:
poi_id (UUID), name (String), location (GEOGRAPHY), category (Int), metadata (JSONB). Index: GIST index on
location for PostGIS.Technical Selection:
Object Store (S3): For Vector Tiles.
PostgreSQL/PostGIS: For POI storage due to robust spatial query support (ST_DWithin, ST_Distance).
Graph Representation: Adjacency list stored in a specialized Graph DB (e.g., Neo4j) or serialized in-memory for the Routing Service.
Cache
Purpose: Reduce redundant computation for popular routes (e.g., SFO to Downtown SF).
Key-Value Schema:
route:{origin_s2}:{dest_s2} -> EncodedPolyline.Technical Selection: Redis. Uses an LRU eviction policy. TTL set to 5 minutes to account for traffic changes.
Messaging
Purpose: Buffer high-velocity GPS pings from millions of active app users.
Event Schema:
vehicle_id, timestamp, lat, long, speed.Technical Selection: Kafka. High throughput allows ingestion of millions of events per second.
Data Processing
Processing Model: Stream Processing.
Processing DAG: Kafka Source -> Map Matcher (snap GPS to road segments) -> Speed Calculator (moving average) -> Graph Weight Updater.
Technical Selection: Apache Flink. Handles out-of-order data and windowing for calculating average speeds on road segments.
Infrastructure (Optional)
Distributed Coordination:
Zookeeper: Used by Kafka and for leader election in the Routing Service shards to ensure only one node per region handles graph updates.
Wrap Up
Advanced Topics
Trade-offs: We choose Availability over Consistency (AP) for the map tiles and POI search. It's better to show an old map than no map at all. However, routing requires higher accuracy.
Reliability: If the routing engine fails, we use a "Simple Router" fallback that uses straight-line distance or cached major highway paths.
Bottleneck Analysis: The Road Graph is the primary bottleneck. If it exceeds 128GB RAM, we must move to a distributed graph traversal or more aggressive pruning.
Security: PII protection for GPS telemetry; data is anonymized and aggregated before being used to calculate traffic speeds.
Optimization - S2 Cells: Instead of lat/long, use S2 Cell IDs (uint64). This allows the system to treat 2D spatial queries as 1D range queries, significantly speeding up database lookups.