The Question
DesignScalable IoT-Enabled Bike Sharing System
Design a dockless shared bike system supporting 10 million bikes and 5 million daily users. The system must handle high-frequency IoT location heartbeats, provide low-latency proximity searches for users to find bikes, and ensure strictly consistent ride transactions and billing. Discuss the integration of an IoT gateway, geospatial indexing strategies, and how to maintain system reliability during network partitions or hardware failures.
MQTT
Redis
PostgreSQL
PostGIS
Kafka
Kubernetes
S2 Geometry
mTLS
JWT
CDNs
Questions & Insights
Clarifying Questions
Scale and Growth: How many bikes and active users are we designing for? (Assumption: 10 million bikes, 5 million Daily Active Users (DAU), focused on a high-density urban environment).
Communication Protocol: Does the bike hardware support persistent connections like MQTT, or is it HTTP-based? (Assumption: IoT bikes use MQTT for low-power, persistent bidirectional communication).
Location Precision: How frequent are the location updates? (Assumption: Every 30 seconds during an active ride, every 5 minutes when idle).
Unlocking Mechanism: How does the user unlock the bike? (Assumption: Scanning a QR code via the app, which communicates with the server, which then signals the bike lock via the IoT gateway).
Thinking Process
Core Strategy: The system must prioritize Availability for searching bikes (eventual consistency) but Strict Consistency for the ride state machine (unlocking/billing) to prevent double-booking or lost revenue. We will use a Geospatial-indexed cache for bike discovery and a relational database for transactional integrity.
Key Questions for Architecture:
How do we efficiently query thousands of bikes within a 500m radius of a user?
How do we manage millions of concurrent IoT connections without overwhelming the backend?
How do we ensure the bike is unlocked only if the user has a valid account and the bike is actually available?
How do we handle network failures during the "lock" event to ensure accurate billing?
Bonus Points
S2 Geometry vs. Geohash: Utilize Google’s S2 cells for geospatial indexing rather than Geohash to minimize edge-case "grid jumps" and enable faster neighbor lookups via Hilbert Curve logic.
Backpressure & Load Leveling: Use Kafka as a buffer between the IoT Gateway and the Ride Service to handle spikes in location heartbeats without crashing the database.
Idempotency Keys: Implementation of client-side generated idempotency keys for the "Unlock" command to prevent duplicate ride creation during flaky 4G/5G transitions.
Shadow Ride Resolution: Implement a "heartbeat timeout" logic where if a bike stops reporting during an active ride, the system triggers a "check-and-lock" protocol to prevent infinite billing.
Design Breakdown
Functional Requirements
Core Use Cases:
Find Bikes: Users can see available bikes nearby on a map.
Unlock/Start Ride: Users scan a QR code to unlock a bike.
End Ride/Payment: User locks the bike; the system calculates the duration and charges the user.
Bike Status Reporting: Bikes periodically send GPS location and battery status.
Scope Control:
In-scope: Real-time bike discovery, ride lifecycle management, IoT connectivity, and basic billing.
Out-of-scope: Maintenance/Repair workforce app, advanced demand forecasting (ML), and bike-sharing station management (this is a "dockless" design).
Non-Functional Requirements
Scale: Support 10M bikes and 1M concurrent rides at peak.
Latency: "Find nearby bikes" query should return in < 200ms. "Unlock" command should reach the bike in < 2 seconds.
Availability & Reliability: 99.99% uptime. Users must be able to end a ride even if the billing system is down.
Consistency: High consistency for bike status (Available vs. In-Use) to prevent two users from attempting to unlock the same bike.
Security: Secure IoT communication (TLS/mTLS) and encrypted payment processing.
Estimation
Traffic Estimation:
Write QPS: 10M bikes reporting every 300s (idle) = 33k QPS. During peak, 1M bikes reporting every 30s = 33k QPS. Total peak write ~66k QPS.
Read QPS: 5M DAU, each searching 5 times/day = 25M requests / 86400s \approx 300 QPS. Peak search ~3,000 QPS.
Storage Estimation:
Ride History: 10M rides/day * 500 bytes/record = 5GB/day. 1.8TB/year.
Bike Metadata: 10M bikes * 1KB = 10GB (easily fits in RAM/SSD).
Bandwidth Estimation:
Incoming: 66k QPS * 200 bytes/heartbeat \approx 13.2 MB/s (105 Mbps).
Blueprint
Concise Summary: A microservices architecture leveraging an IoT Gateway (MQTT) for bike communication, Redis for low-latency geospatial discovery, and a Relational Database for ride transactions and billing.
Major Components:
IoT Gateway: Manages persistent MQTT connections with bikes for heartbeats and lock commands.
Bike Service: Maintains bike metadata (model, battery) and state (available, faulty).
Ride Service: Manages the ride state machine (Reserved -> Active -> Completed).
Geo Service: Handles proximity queries using Redis Geospatial indexes.
Billing Worker: Asynchronously processes payments after a ride ends.
Simplicity Audit: This design avoids complex stream processing engines (like Flink) for the MVP, using Redis for real-time state and Kafka for reliable async processing.
Architecture Decision Rationale:
Why this architecture?: Decoupling IoT communication from business logic allows independent scaling of the connection layer (stateful) and the service layer (stateless).
Functional Satisfaction: Covers the full lifecycle from discovery to payment.
Non-functional Satisfaction: Redis provides sub-millisecond geo-lookups; Kafka ensures billing reliability and system resilience.
High Level Architecture
Sub-system Deep Dive
Edge (Optional)
Content Delivery & Traffic Routing: Global DNS with latency-based routing to the nearest regional API Gateway.
Security & Perimeter:
API Gateway: Handles JWT authentication for users, rate limiting (10 requests/sec per user), and SSL termination.
IoT Gateway: Uses MQTT over TLS. Each bike has a unique X.509 certificate for mTLS authentication to prevent spoofing.
Service
Topology & Scaling: Stateless microservices deployed on Kubernetes (EKS/GKE). Auto-scaling based on CPU (70% threshold).
API Schema Design:
POST /v1/rides/unlock: Protocol: REST. Request: {bike_id, lat, lng}. Response: {ride_id, lock_code}. Idempotency via request_id.GET /v1/bikes/nearby: Protocol: REST. Request: {lat, lng, radius}. Response: List<Bike>.Resilience:
Circuit Breaker: Used between Ride Service and Payment Gateway.
Retry Policy: IoT Gateway retries "Unlock" command 3 times with exponential backoff if ACK not received from bike.
Storage
Access Pattern:
Rides: Write-heavy (start/end), high consistency required.
Bike Metadata: Read-heavy.
Database Table Design (PostgreSQL):
Bikes:
id (UUID), status (enum), battery_level, last_lat, last_lng. Index on status.Rides:
id (UUID), user_id, bike_id, start_time, end_time, cost, status (enum).Technical Selection: PostgreSQL with PostGIS for persistent spatial storage (backup to Redis) and transactional ACID compliance for ride state transitions.
Distribution Logic: Partition
Rides table by created_at (monthly partitions) to keep the working set small.Cache
Purpose: Real-time proximity search and bike availability status.
Key-Value Schema:
Geo Index:
GEOADD bikes_available <lng> <lat> <bike_id>.Bike Status:
bike_status:{id} -> String (JSON) with TTL of 10 minutes (heartbeat interval).Technical Selection: Redis. Use
GEORADIUS for O(N+log(M)) proximity searches.Failure Handling: If Redis fails, the system falls back to PostGIS (higher latency, lower throughput) until the cache is rebuilt from heartbeats.
Messaging
Purpose: Decoupling location updates from business logic and ensuring reliable billing.
Event Schema:
LocationUpdate:
{bike_id, lat, lng, timestamp}.RideEnded:
{ride_id, user_id, duration, distance}.Throughput & Partitioning: Kafka topic
bike_locations partitioned by bike_id to ensure ordered processing of heartbeats per bike.Technical Selection: Kafka. High throughput for 66k QPS.
Data Processing
Processing Model: Event-driven asynchronous processing.
Processing DAG:
Consumer (Kafka) -> Geo Updater: Updates Redis
GEOADD and PostgreSQL last_lat/lng.Consumer (Kafka) -> Billing Worker: Calculates fare -> Calls Payment Gateway -> Updates
Rides table status to Paid.Failure Handling: Use a Dead Letter Queue (DLQ) for failed payment attempts to allow manual reconciliation or retry logic.
Technical Selection: Go or Node.js workers for low-memory footprint and high concurrency.
Infrastructure (Optional)
Observability:
Prometheus: Track ride start failure rates and MQTT connection counts.
Grafana: Dashboard for bike distribution/heatmaps.
Platform Security: Secrets managed in HashiCorp Vault (e.g., Payment Gateway API keys).
Wrap Up
Advanced Topics
Trade-offs (Consistency vs. Availability): We choose AP for bike discovery (showing a bike that just became unavailable is okay; the app will handle the error on unlock) and CP for the ride state machine (guaranteeing correct billing).
Bottleneck Analysis: The IoT Gateway is the most critical stateful component. We use "Sticky Sessions" at the Load Balancer level based on
bike_id to maintain MQTT sessions.Optimization (Battery Life): To save bike battery, the heartbeat frequency is reduced to every 15 minutes if the bike is stationary and battery is < 20%.
Security: "Ghost Bikes" (spoofed GPS) are mitigated by validating the reporting GPS coordinates against the last known location and reasonable velocity checks.