The Question
DesignScalable Hotel Reservation System
Design a high-concurrency hotel booking platform that supports searching for room availability across millions of properties and ensures that no room can be double-booked, even during peak traffic periods. The system must handle real-time inventory updates, secure payment processing, and provide a seamless search experience with low latency.
PostgreSQL
Redis
SQS
JWT
mTLS
Questions & Insights
Clarifying Questions
Scale and Scope: What is the target scale in terms of the number of hotels, rooms, and daily active users (DAU)? (Assumed: 100k hotels, 10M rooms, 1M DAU).
Booking Patterns: Are we supporting "flash sales" or high-concurrency events for specific hotels? (Assumed: Yes, system must handle spikes).
Consistency Requirements: Is "overbooking" acceptable as a business strategy, or is strict "no double-booking" a hard constraint? (Assumed: Strict consistency; double-booking is prohibited).
Search Complexity: Do users search by city/date only, or by complex filters (amenities, price range, proximity)? (Assumed: Standard filters: Location, Date, Guest count).
Thinking Process
Core Bottleneck: The primary challenge is managing inventory concurrency (the "double-booking" problem) across millions of room-nights while maintaining high availability for the search path.
Key Progressive Questions:
How do we model inventory to allow fast queries for date ranges?
How do we guarantee atomic room decreases during simultaneous booking attempts?
How do we decouple the intensive search traffic from the transactional booking traffic?
How do we handle distributed transactions involving external payment gateways?
Bonus Points
Inventory Sharding: Instead of sharding by HotelID alone, shard by
HotelID + Month to prevent hot partitions during peak seasonal booking for popular resorts.Optimistic Concurrency Control (OCC): Using version headers or "database constraints as locks" to minimize the duration of row locks in the RDBMS.
Availability Guard: Implementing a "Virtual Waiting Room" (Token Bucket at Gateway) for high-demand event bookings to prevent cascading failures.
TCC Pattern: Using Try-Confirm-Cancel for the booking-payment flow to ensure eventual consistency without long-lived database locks.
Design Breakdown
Functional Requirements
Users can search for available hotels based on location, date range, and number of guests.
Users can view hotel details, room types, and pricing.
Users can book a room and receive a confirmation.
Users can cancel or modify a reservation.
Hotel managers can update room inventory and pricing.
Non-Functional Requirements
Consistency: Strong consistency for the booking process (No double-booking).
High Availability for the search and browse experience (99.99%).
Low Latency: Search results should return in < 500ms; booking confirmation in < 2s.
Scalability: Support 10,000+ Transactions Per Second (TPS) during peak periods.
Estimation
Inventory Rows: 100k hotels 10 room types 730 days (2-year window) ≈ 730M rows.
Search Queries: 1M DAU * 10 searches/day ≈ 115 Queries Per Second (QPS) average; 2k peak QPS.
Booking Queries: 1M DAU * 0.1 bookings/day ≈ 1 TPS average; 100 peak TPS.
Storage: 730M inventory rows * 100 bytes ≈ 73GB (Fits easily in modern RDBMS/SSD).
Blueprint
Concise Summary: A microservices architecture separating the high-volume Search path from the ACID-compliant Booking path, utilizing a Relational Database for inventory integrity and Redis for rapid availability lookups.
Major Components:
API Gateway: Handles rate limiting, authentication, and request routing to downstream services.
Search Service: High-performance read-only service that queries cached inventory and hotel metadata.
Booking Service: Orchestrates the reservation lifecycle and ensures transactional integrity.
Inventory Service: Manages room counts per day/room-type using strict SQL transactions.
Payment Service: Integration point for external providers (Stripe/PayPal) with idempotency logic.
Simplicity Audit: This design avoids complex distributed lock managers or NoSQL solutions for inventory, relying on the proven ACID capabilities of RDBMS to solve the hardest part of the problem (concurrency).
Architecture Decision Rationale:
Why this architecture?: Separating Search from Booking prevents heavy read traffic from impacting the database performance required for writes.
Functional Satisfaction: Covers end-to-end user flow from discovery to payment.
Non-functional Satisfaction: Relational DBs provide the required isolation levels for double-booking prevention, while Redis ensures the search experience remains snappy.
High Level Architecture
Sub-system Deep Dive
Service
Topology & Scaling: Stateless microservices deployed in Multi-AZ clusters. Scaling is driven by QPS for Search and CPU/Connection count for Booking.
API Schema Design:
GET /v1/search: Query params (lat, lon, checkin, checkout). Returns list of HotelIDs and Prices.POST /v1/bookings: Request (RoomTypeID, Dates, PaymentToken). Returns booking_id. Idempotency-Key header required.Resilience & Reliability: Circuit breakers on the Payment Service integration. If Payment is down, the Booking service enters a "Pending" state rather than failing the transaction.
Observability: RED metrics (Rate, Error, Duration) per service. Distributed tracing (OpenTelemetry) to track a booking from gateway to DB.
Security: JWT-based AuthN at Gateway. mTLS between internal services.
Storage
Access Pattern: Read-heavy for metadata and search; Write-heavy for inventory updates during peak booking times.
Database Table Design:
Inventory Table:
hotel_id (PK), room_type_id (PK), date (PK), total_count, reserved_count.Booking Table:
booking_id (PK), user_id, room_type_id, status (Pending, Confirmed, Cancelled), total_price.Technical Selection: PostgreSQL. Rationale: Superior support for row-level locking and serializable transactions which are critical for inventory management.
Distribution Logic: Sharded by
HotelID. Since most queries are hotel-centric, this keeps related data on the same shard and avoids cross-shard joins.Reliability & Recovery: Daily snapshots to S3; Write-Ahead Logs (WAL) streamed to a standby replica for < 1 min RPO.
Cache
Purpose & Justification: Search acceleration. We cannot query the main Inventory DB for every "Search" request due to read amplification.
Key-Value Schema:
Key:
avail:{hotel_id}:{month}. Value: Bitmask or JSON mapping of dates to room counts.
TTL: 5 minutes.
Technical Selection: Redis.
Failure Handling: If Redis is down, Search Service falls back to a Read Replica of the Inventory DB (Graceful degradation with higher latency).
Messaging
Purpose & Decoupling: Asynchronous processing of post-booking tasks (Email confirmation, Loyalty point updates).
Event / Topic Schema:
booking.confirmed, booking.cancelled. Payload: booking_id, user_id, timestamp.Throughput & Partitioning: Partitioned by
user_id to ensure notification ordering.Technical Selection: AWS SQS (for MVP simplicity) or Kafka (if high-volume event sourcing is needed later).
Wrap Up
Advanced Topics
Monitoring: Focus on "Inventory Health" (discrepancies between DB and Cache) and "Payment Latency".
Trade-offs: We chose Strong Consistency for booking at the cost of higher latency during DB writes. Search is Eventually Consistent (Cache might be slightly out of date).
Bottlenecks: The Inventory DB row for a popular hotel on a popular date is a contention point.
Optimization: Use "Over-provisioning" (Internal buffer) to allow the Cache to serve 100% accurate "Sold Out" status while allowing a small margin of error for "Available" status to reduce DB pressure.