The Question
Design

Google Maps System Design

Design a large-scale geospatial mapping and navigation system that allows users to view maps, search for local points of interest, and calculate optimal driving routes considering real-time traffic conditions for a global user base.
S2 Geometry
Vector Tiles
Elasticsearch
PostGIS
Kafka
Questions & Insights

Clarifying Questions

What are the core features for the MVP? (Assumption: Map rendering via tiles, Point of Interest (POI) search, and point-to-point navigation/routing.)
What is the scale of the system? (Assumption: 100M Daily Active Users (DAU), 1 Billion total POIs, and 1 million navigation requests per minute.)
Do we need real-time traffic updates for the MVP? (Assumption: Yes, as routing without traffic is significantly less useful for a modern MVP.)
Are we supporting offline maps or 3D rendering? (Assumption: No, YAGNI. We focus on 2D vector tiles and online-only connectivity for the MVP.)
What is the data source for the map? (Assumption: Road network and POI data are pre-ingested from providers like OpenStreetMap (OSM) or internal surveys.)

Thinking Process

Spatial Indexing: How do we efficiently find "restaurants near me"? We must use Quadtrees, Geohashing, or Google's S2 library to partition the Earth's surface.
Map Tile Strategy: How do we serve maps without massive latency? Use Vector Tiles (not raster) stored in a CDN, allowing the client to render the map dynamically.
Graph Representation: How is the road network stored for routing? The world is a directed weighted graph. We need to partition this graph geographically to fit into memory for pathfinding.
Real-time Throughput: How do we handle millions of GPS pings? A write-heavy ingestion pipeline (Kafka) is required to process user speeds and update road segment weights.

Bonus Points

S2 Geometry Library: Use S2 cells (Hilbert Curve) instead of Geohashes to minimize "edge cases" at the equator/poles and allow for flexible coverage queries.
Vector Over Raster: Use Protocol Buffers (Protobuf) for tile delivery to reduce payload size by 90% compared to PNGs, enabling smoother zooming and client-side styling.
Hierarchical Pathfinding: Implement Contraction Hierarchies to pre-calculate "highways" between distant nodes, reducing A* search space from millions of nodes to hundreds.
Differential Privacy: Apply noise to individual user GPS traces before aggregating traffic data to ensure user anonymity while maintaining traffic accuracy.
Design Breakdown

Functional Requirements

Map Rendering: Users can view maps at various zoom levels.
POI Search: Users can search for locations by name or category (e.g., "Pizza").
Navigation: Provide the fastest route from point A to point B.
Traffic Awareness: Navigation should account for current traffic conditions.

Non-Functional Requirements

Availability: 99.99% availability; navigation is a critical utility.
Latency: Search results < 200ms; Tile loading < 100ms; Route calculation < 1s.
Scalability: Handle peak traffic (e.g., rush hour) globally.
Accuracy: Routing and ETA must be highly accurate.

Estimation

DAU: 100 Million.
Map Tiles: Average user views 20 tiles/session. 100M \times 20 = 2B tile requests/day. At 50KB/tile, that's 100TB/day (heavily cached by CDN).
POI Search: 5 searches/user = 500M daily queries (\approx 6,000 QPS).
GPS Pings: If 10M users are navigating simultaneously, pinging every 10s = 1M QPS ingestion.
Storage: 1B POIs \times 1KB/POI = 1TB. Road network graph = \approx 500GB.

Blueprint

Concise Summary: A geo-distributed system utilizing spatial indexing (S2) for search and a partitioned graph engine for routing, with a heavy emphasis on CDN-based tile delivery.
Major Components:
Load Balancer/API Gateway: Handles request routing, authentication, and rate limiting.
Search Service: Uses an inverted index with spatial filtering to find POIs.
Navigation/Routing Service: Performs A* or Dijkstra on a graph of road segments.
Location Ingestor: Collects real-time GPS pings to update traffic.
Tile Service: Serves static map data via a Global CDN.
Simplicity Audit: This architecture uses standard spatial indexing and vector tiles, avoiding the complexity of raster rendering servers or offline synchronization logic.
Architecture Decision Rationale:
Why this architecture?: Separating Search, Routing, and Tiles allows independent scaling. Routing is CPU-intensive, while Tile serving is I/O-intensive.
Functional Satisfaction: Covers the "Search -> See -> Go" user flow.
Non-functional Satisfaction: CDN ensures low-latency map loading; Kafka ensures high-throughput GPS ingestion without blocking the user.

High Level Architecture

Sub-system Deep Dive

Service

Topology & Scaling: Microservices deployed in Kubernetes. The Routing Service is horizontally scaled based on CPU (pathfinding), while the Search Service scales based on memory (indexing).
API Spec:
GET /v1/search?q=pizza&lat=...&long=...: Returns JSON list of POIs.
GET /v1/route?origin=...&dest=...&mode=driving: Returns a Polyline and instructions.
POST /v1/location: Sends GPS batch (lat, long, timestamp, speed).
Tile requests bypass the API Gateway via tiles.google.com/{z}/{x}/{y}.mvt.

Storage

Data Model:
POI (ElasticSearch): Document with {id, name, location: [lat, lon], category, s2_cell_id}.
Road Network (PostGIS): Edges (Road segments) and Nodes (Intersections). Edges have weights (time, distance).
Database Logic: S2 cells are used to shard POI data. Routing graph is partitioned into tiles (e.g., Level 12 S2 cells) to allow the Routing Service to load only relevant geography into memory.

Cache

Traffic Cache (Redis): Stores real-time "speed factors" for road segment IDs.
TTL: 30-60 seconds. Navigation service fetches these weights to adjust the graph edge costs dynamically.
Eviction: Least Recently Used (LRU) for infrequently traveled roads.

Messaging

Kafka: Decouples the Location Ingestor from the Traffic Processor.
Topic Structure: location_updates partitioned by geohash_prefix to ensure all pings for a specific area go to the same consumer for easier aggregation.
Guarantees: At-least-once delivery.

Data Processing

Traffic Processor (Apache Flink):
Windowing: 1-minute tumbling windows.
Logic: Aggregates all speeds for a specific road_segment_id. If the average speed is < 50% of the speed limit, mark as "Heavy Traffic" and update the Redis cache.
Wrap Up

Advanced Topics

Monitoring: Prometheus for P99 latency on routing; Grafana for visualizing "Hot Spots" in tile requests.
Trade-offs: Availability over Consistency. If the traffic cache is stale by 30 seconds, it's acceptable for the user to see a slightly inaccurate ETA rather than a system crash.
Bottlenecks: The Routing Service memory is the bottleneck. The graph for the entire world cannot fit in RAM.
Failure Handling:
Static Routing Fallback: If the Traffic Cache/Flink fails, the Routing Service falls back to "Static" weights (speed limits) to ensure navigation still works.
Multi-Region Tiles: S3 is replicated across regions so tiles are served from the nearest bucket if the CDN misses.
Alternatives & Optimization: Instead of full A, use Landmarks (ALT algorithm) or Contraction Hierarchies to speed up long-distance routing. Use Protocol Buffers** for all internal service-to-service communication to reduce serialization overhead.