The Question
DesignGoogle Maps System Design
Design a global-scale mapping and navigation system similar to Google Maps. The system must handle rendering interactive maps, searching for millions of points of interest (POIs), and providing real-time turn-by-turn navigation with live traffic updates. Focus on geospatial indexing strategies, efficient pathfinding algorithms for millions of concurrent users, and the data pipeline required to ingest and process crowd-sourced traffic data at scale while maintaining low latency and high availability.
S2 Geometry
CDN
Kafka
Apache Flink
Redis
Elasticsearch
Protobuf
Vector Tiles
Graph Database
Hierarchical Hub Labeling
Questions & Insights
Clarifying Questions
What is the primary scale of the system? (Assumption: Global scale with 1B+ Monthly Active Users and 100M+ Daily Active Users).
What are the core features for the MVP? (Assumption: Map rendering via tiles, Point of Interest (POI) search, and turn-by-turn navigation/routing).
Do we need real-time traffic updates for the MVP? (Assumption: Yes, crowd-sourced location data must influence routing ETAs).
What is the target device profile? (Assumption: Primarily mobile-first (iOS/Android) with heavy emphasis on bandwidth efficiency and offline-first capabilities).
Is the map data static or dynamic? (Assumption: Static base maps with dynamic overlays for traffic and POI updates).
Thinking Process
How do we efficiently represent and serve massive geospatial data? Use Map Tiles (vector tiles preferred) and a Hierarchical Spatial Index (Google S2 or Quadtree) to shard the world.
How do we handle Point of Interest (POI) search at scale? Implement a Geospatial Search engine (Elasticsearch with Geo-queries) indexed by S2 cells.
How do we solve the shortest path problem for millions of concurrent users? Model the world as a directed weighted graph. Use A* for short distances and Hierarchical Hub Labeling or Contraction Hierarchies for lightning-fast long-distance routing.
How is real-time traffic integrated? Ingest user location streams through a messaging layer, aggregate them into road segment velocities, and update the routing graph weights.
Bonus Points
S2 Geometry Library: Explain why S2 (Hilbert Curve based) is superior to Geohash for cell stability and neighbor lookups.
Hierarchical Hub Labeling (HHL): Discuss pre-computing distances between "hubs" to reduce routing query time from O(E \log V) to nearly O(1).
Vector vs. Raster Tiles: Detail how vector tiles allow client-side styling and 90% bandwidth reduction compared to raster PNGs.
Predictive ETA: Mention using Graph Neural Networks (GNNs) or LSTMs to predict traffic 30 minutes into the future based on historical patterns.
Design Breakdown
Functional Requirements
Core Use Cases:
Map Rendering: Users can view maps at various zoom levels.
POI Search: Users can search for businesses/locations by name or category.
Navigation: Users get the fastest route from point A to B with ETA.
Location Tracking: Real-time GPS updates for the user's position.
Scope Control:
In-Scope: Map tiles, Search, Routing, Traffic ingestion.
Out-of-Scope: Street view (360 imagery), Satellite imagery processing, Social features (sharing location with friends).
Non-Functional Requirements
Scale: Support 100M+ DAU and 1M+ QPS for tile/search requests.
Latency: Map tiles must load in < 200ms; Routing must calculate in < 500ms.
Availability & Reliability: 99.99% availability; Map must work even if traffic updates are delayed (graceful degradation).
Consistency: Eventual consistency for POI updates; High accuracy for current GPS position.
Security: TLS for all transit; Privacy-preserving techniques for user location history (anonymization).
Estimation
Traffic Estimation:
Map Tile QPS: 100M DAU * 10 tiles/session / 86400s \approx 12k QPS (Average), Peak 50k+ QPS.
Search QPS: 100M DAU * 2 searches \approx 2.3k QPS.
Storage Estimation:
World Map Data: Raw OSM data is ~100GB. Multi-zoom level tiles ~PB range (stored in Object Store).
POI Database: 200M POIs * 1KB \approx 200GB.
Bandwidth Estimation:
Outgoing: 12k QPS * 50KB (Vector Tile) \approx 600 MB/s.
Blueprint
Concise Summary: A geo-distributed architecture leveraging a global CDN for tile delivery, a specialized graph-based routing engine, and a geospatial search service indexed by S2 cells.
Major Components:
Tile Service: Serves pre-rendered vector map data stored in S3/GCS.
Search Service: Geospatial indexing (Elasticsearch) for finding POIs within a radius.
Routing Service: Graph-based engine for pathfinding (Dijkstra/A*/HHL).
Location Ingestor: High-throughput pipeline for collecting user GPS pings to derive traffic.
Simplicity Audit: This design separates static asset delivery (Tiles) from compute-heavy tasks (Routing), allowing independent scaling and minimizing cost via CDN caching.
Architecture Decision Rationale:
Why this architecture?: Geo-sharding map data is the only way to scale to a global dataset while maintaining low latency.
Functional Satisfaction: Tiles provide the visual, Search provides the intent, and Routing provides the utility.
Non-functional Satisfaction: CDN ensures low latency; Graph pre-computation ensures fast routing.
High Level Architecture
Sub-system Deep Dive
Edge (Optional)
Content Delivery & Traffic Routing:
CDN: Essential for Map Tiles. Tiles are immutable for a specific version. Cache TTL set to 24h+.
Anycast DNS: Routes users to the nearest edge location.
Security & Perimeter:
API Gateway: Handles AuthN/AuthZ, rate limiting to prevent scrapers from stealing map data.
Service
Topology & Scaling:
Stateless microservices deployed in K8s across multiple regions (US-East, EU-West, Asia-South).
Search Service: Scales based on QPS.
Routing Service: Memory-intensive; scales based on graph size and request complexity.
API Schema Design:
GET /v1/tiles/{z}/{x}/{y}: Returns Protobuf vector data.GET /v1/search?q=coffee&lat=...&long=...: Returns List of POIs.GET /v1/directions?origin=...&dest=...&mode=driving: Returns Polyline and ETA.Resilience:
Circuit Breakers: If the Traffic Cache (Redis) is down, routing falls back to "static" weights (speed limits) without real-time data.
Storage
Access Pattern:
Tiles: High read, zero write.
POI: High read, low write.
Traffic: Extremely high write, high read.
Technical Selection:
Tiles: S3/GCS with CDN.
POI: Elasticsearch or Cloud Spanner with S2 indexing.
Routing Graph: Custom In-memory graph for performance; persistent store in Neo4j or Graph-optimized RDBMS.
Distribution Logic:
Shard POI and Graph data by S2 Cell ID (Level 12-14) to ensure data for the same city resides on the same shard.
Cache
Purpose: Store real-time traffic weights for road segments.
Key-Value Schema:
segment_id -> {current_speed, timestamp}.Technical Selection: Redis (Cluster mode) for sub-millisecond lookups during pathfinding.
Failure Handling: Fall back to historical averages if real-time data is missing.
Messaging
Purpose: Buffering millions of GPS pings from mobile devices.
Event Schema:
{user_id, lat, long, timestamp, speed, heading}.Technical Selection: Kafka. Partitioned by
road_segment_id (via reverse geocoding) to ensure all pings for one road go to the same processor.Data Processing
Processing Model: Streaming (Apache Flink/Spark Streaming).
Processing DAG:
Map Match: Align noisy GPS pings to road segments.
Aggregate: Calculate moving average speed per segment.
Update: Push new weights to Redis and the Routing Engine.
Technical Selection: Flink for its low-latency windowing capabilities.
Infrastructure (Optional)
Observability:
Standard RED metrics.
Critical Metric: Tile miss rate on CDN, Routing compute time P99.
Distributed Coordination:
Zookeeper/Etcd for managing the cluster state of the distributed routing graph shards.
Wrap Up
Advanced Topics
Trade-offs:
Consistency vs. Latency: Routing uses eventual consistency for traffic data. It’s better to provide a 1-minute old "fast" route than to block for 5 seconds to get the "perfect" route.
Reliability:
Offline Maps: Mobile clients pre-fetch tiles for a bounding box and store them in SQLite. Search and Routing logic is partially replicated on-device for basic functionality.
Bottleneck Analysis:
The "Hot Spot" Problem: Popular cities (NYC, London) have high QPS. Use multi-layered caching and dedicated shard replicas for these S2 cells.
Security & Privacy:
Differential Privacy: Add noise to location data before ingestion to ensure individual users cannot be tracked by analyzing traffic patterns.
Optimization (Hierarchical Routing):
Use a "Multi-level" graph. Long-distance routing uses only highways; local routing uses all streets. This significantly reduces the search space for Dijkstra.