The Question

Real-time Dynamic Pricing System

Design a real-time surge pricing engine for a global ride-sharing platform. The system must process millions of concurrent location updates from drivers and riders to calculate localized price multipliers based on supply and demand. Key challenges include handling high-throughput ingestion, ensuring sub-second pricing retrieval, managing geographic data at scale, and maintaining system reliability during massive demand spikes.

Kafka

Apache Flink

Redis

PostgreSQL

API Gateway

NoSQL

Questions & Insights

Clarifying Questions

Scale and Throughput: What is the expected scale in terms of Daily Active Users (DAU) and concurrent drivers? Are we designing for a global scale or a specific high-density metropolitan area?

Update Frequency: How "real-time" does the surge multiplier need to be? Is a 30-second lag acceptable, or do we need sub-second updates to pricing?

Geographic Granularity: What is the size of the "surge zone"? Are we using fixed neighborhoods or a hexagonal grid system like Uber's H3?

Data Inputs: Aside from current supply (available drivers) and demand (riders looking at the app), should we consider external factors like weather, traffic, or scheduled events (e.g., concerts)?

Assumptions for MVP:

Scale: 100M DAU, 5M active drivers globally. Peak traffic of 500k GPS pings per second.

Latency: Surge multipliers should update every 10-20 seconds. Query latency for the price must be < 50ms.

Granularity: Use Uber’s H3 grid (Level 7 or 8) for spatial indexing.

Inputs: Focus strictly on the supply/demand ratio (drivers vs. ride requests/app opens) for the MVP.

Thinking Process

Core Bottleneck: How to efficiently aggregate millions of moving GPS points into discrete geographic buckets in real-time without overwhelming the database.

Logical Progression:

How do we map a (Lat, Long) to a Surge Zone? (Spatial Indexing / H3).

How do we count "Supply" and "Demand" per zone? (Stream Processing / Windowing).

How do we calculate the multiplier and serve it? (Pricing Engine + Distributed Cache).

How do we ensure the system doesn't create "Price Walls" where moving one block changes the price 3x? (Spatial Smoothing).

Bonus Points

H3 Hexagonal Hierarchies: Using hexagons instead of squares avoids "edge effects" where diagonal neighbors have different distances than orthogonal ones.

Negative Feedback Loop Management: Implementing "Dampening" algorithms to prevent the "yo-yo effect" where a surge attracts too many drivers, instantly killing the surge and causing drivers to leave.

Backpressure Handling: Using Flink's watermarking and backpressure mechanisms to handle spikes during massive events (e.g., New Year's Eve) without dropping data.

Spatial Smoothing: Using Gaussian kernels across neighboring H3 cells to ensure price transitions are smooth for users walking across cell boundaries.

Design Breakdown

Functional Requirements

Core Use Cases:

Drivers send periodic GPS updates (Supply).

Riders open the app or request a ride (Demand).

System calculates a surge multiplier (e.g., 1.5x) for specific geo-locations.

System provides the current multiplier to the Ride Booking service.

Scope Control:

In-scope: Real-time aggregation of supply/demand, surge calculation, and low-latency multiplier serving.

Out-of-scope: Payments, driver earnings reports, or historical ML-based price forecasting.

Non-Functional Requirements

Scale: Support 500k+ RPS for GPS ingestion and 100k+ RPS for pricing queries.

Latency: End-to-end pricing update propagation < 20 seconds; Read latency < 50ms.

Availability & Reliability: 99.99% availability (P0 system); surge failure should default to 1.0x (base price) rather than crashing the app.

Consistency: Eventual consistency is acceptable for pricing updates.

Fault Tolerance: Loss of a single stream processor shouldn't lose state (checkpointing).

Estimation

Ingestion (Supply): 5M drivers * 1 ping/5s = 1M pings/sec.

Ingestion (Demand): 50M active riders checking app * 1 request/30s ≈ 1.6M requests/sec.

Storage:

H3 Cells: Total cells at Level 7 ≈ 5M.

Multiplier State: 5M cells * 128 bytes (Cell ID + Multiplier + Timestamp) ≈ 640MB. (Fits easily in memory/Redis).

Bandwidth: 2.6M total requests/sec * 500 bytes per packet ≈ 1.3 GB/s total ingestion bandwidth.

Blueprint

Concise Summary: A streaming-first architecture that ingests driver and rider telemetry via Kafka, aggregates counts per H3 cell using Apache Flink, and stores the resulting multipliers in a global Redis cache for sub-millisecond retrieval.

Major Components:

API Gateway: Entry point for driver GPS pings and rider requests, performing authentication and rate limiting.

Kafka: Acts as the high-throughput buffer for supply/demand events.

Apache Flink: Performs windowed aggregations (e.g., 30s sliding windows) to calculate supply/demand ratios per H3 ID.

Redis: Stores the latest surge multiplier for every active H3 cell globally.

Pricing Service: A microservice that consumes the H3 multiplier and applies business logic (max caps, smoothing) before returning it to the user.

Simplicity Audit: This design avoids complex "on-the-fly" spatial joins by pre-calculating the H3 cell on the client or gateway, turning a spatial problem into a simple key-value aggregation problem.

Architecture Decision Rationale:

Why this architecture?: Stream processing is the industry standard for real-time analytics. Flink provides stateful exactly-once processing which ensures accurate counts.

Functional Satisfaction: Meets the need for real-time supply/demand balancing.

Non-functional Satisfaction: Scalable through Kafka partitioning and Flink parallelism; highly available via Redis replication.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing: Not needed for dynamic GPS pings, but we use Global Accelerator for low-latency entry into the backbone.

Security & Perimeter:

API Gateway: Validates JWTs.

Rate Limiting: Critical for preventing GPS ping flooding from compromised apps.

H3 Conversion: The Edge/Gateway converts (Lat, Long) to H3 Index immediately to simplify downstream processing.

Service

Topology & Scaling: Stateless microservices deployed across multiple Availability Zones (AZs). Scaling based on CPU and Request Count.

API Schema Design:

POST /v1/telemetry: GPS updates (Driver) and Session Heartbeats (Rider).

GET /v1/pricing: Returns current multiplier for a given H3 Cell.

Resilience:

Circuit Breakers: If the Flink/Redis pipeline fails, the Pricing Service defaults to a 1.0x multiplier.

Timeouts: Aggressive 50ms timeouts for cache reads.

Storage

Access Pattern: Heavy write (updates every 10s per cell) and heavy read (every ride request).

Technical Selection: PostgreSQL for cell metadata (neighborhood names, base rates, city boundaries).

Database Table Design:

h3_cells: cell_id (PK), city_id, base_multiplier, is_active.

Distribution: Sharded by city_id.

Cache

Purpose: Stores the "Real-time Surge Map".

Key-Value Schema:

Key: surge:cell:{h3_id}

Value: {"multiplier": 1.8, "updated_at": 1672531200}

TTL: 60 seconds (Surge is ephemeral).

Technical Selection: Redis (Cluster mode).

Failure Handling: If a key is missing, default to 1.0.

Messaging

Purpose: Decouples high-volume ingestion from heavy-duty processing.

Event Schema:

Topic: location_updates

Payload: { "user_id": "u123", "type": "DRIVER|RIDER", "h3_index": "8828308281fffff" }

Partitioning: Partition by h3_index to ensure all updates for a single cell land on the same Flink task manager (avoiding cross-network shuffles).

Data Processing

Processing Model: Sliding window (30s window, sliding every 10s).

Logic:

Count unique user_id where type=DRIVER (Supply).

Count unique user_id where type=RIDER (Demand).

Multiplier = f(Demand / Supply).

Technical Selection: Apache Flink. It handles late-arriving data and state management for windowed counts natively.

Wrap Up

Advanced Topics

Trade-offs: We trade precision for availability. A rider might be on the very edge of a surge cell and see a different price than someone 5 feet away. We mitigate this using H3's hierarchical parent cells to average prices.

Reliability: Use a "Dead Letter Queue" in Kafka for malformed GPS pings to prevent Flink job crashes.

Bottleneck Analysis: The "Hot Cell" problem (e.g., a stadium after a game). We handle this by partitioning Kafka/Flink by H3 ID, ensuring horizontal scalability.

Security: GPS data is PII. We scrub user_id in the aggregation layer so the "Surge Map" only contains anonymous counts.

Optimization (Spatial Smoothing): Instead of just using one cell, Flink can emit the counts for a cell and its 6 neighbors. The Pricing Service calculates a weighted average to "blur" the edges of the surge, preventing price shocks.