The Question
DesignHotel Reservation System Design
Design a globally scalable hotel booking platform that allows users to search for hotels by availability and book rooms securely. The system must guarantee that no two users can book the same room for the same date simultaneously, while maintaining high performance for millions of concurrent search queries.
PostgreSQL
Elasticsearch
Redis
Kafka
CDC
Stripe
JWT
mTLS
Questions & Insights
Clarifying Questions
What is the scale of the system? (Assumption: 10M DAU, 1M hotels worldwide, peak booking traffic of 5,000 TPS).
How do we handle overbooking? (Assumption: Strict consistency is required; we must never sell the same room twice. No overbooking allowed for the MVP).
What is the primary search criteria? (Assumption: Search by location, date range, and occupancy).
Are payments handled internally or externally? (Assumption: Integration with a 3rd party provider like Stripe/Braintree).
Thinking Process
Core Bottleneck: The primary challenge is the "Double Booking" problem under high concurrency.
Inventory Management: How do we track room availability across dates efficiently without locking the entire database?
Search Performance: Searching across millions of rooms with dynamic availability and pricing.
Progressive Architecture Flow:
How do we ensure a user can find an available room quickly? (Search Indexing).
How do we prevent two people from booking the last room simultaneously? (ACID Transactions/Row-level locking).
How do we handle high-volume read traffic for availability? (Distributed Caching).
How do we ensure the system stays consistent if a payment fails or the user abandons the checkout? (Saga Pattern/Reservation Timeouts).
Bonus Points
CDC (Change Data Capture): Using Debezium to stream inventory updates from the relational DB to ElasticSearch to ensure search results are near real-time without polling.
Optimistic Locking with Versioning: Using version numbers for inventory rows to minimize database lock contention during high-traffic windows.
Hot-Shard Mitigation: Implementing "Virtual Shards" or sub-partitioning for popular hotels in major cities (e.g., Las Vegas during CES) to prevent database hotspots.
Idempotency Keys: Implementation of client-generated request IDs to ensure that network retries don't result in duplicate charges or bookings.
Design Breakdown
Functional Requirements
Users can search for hotels by location and date range.
Users can view hotel details and specific room availability/pricing.
Users can book a room (Reserve -> Pay -> Confirm).
Users can view/cancel their bookings.
Hotel managers can update room inventory and pricing.
Non-Functional Requirements
Strong Consistency: For the booking process (No double booking).
High Availability: For the search and discovery flow.
Scalability: Ability to handle seasonal spikes (e.g., New Year's Eve).
Low Latency: Search results should return in < 500ms.
Estimation
DAU: 10 million.
Search-to-Book Ratio: 20:1.
Search QPS: (10M * 20 searches) / 86400 seconds ≈ 2,300 QPS (Average). Peak: ~10,000 QPS.
Booking QPS: (10M * 1) / 86400 ≈ 115 QPS. Peak: ~1,000 QPS.
Storage: 1M hotels 10 room types 365 days = ~365M availability rows/year. Approx 50GB storage for inventory per year.
Blueprint
Concise Summary: A microservices-based architecture utilizing a relational database for transactional integrity and a search engine for high-performance discovery.
Major Components:
Search Service: Handles location-based queries and filters availability using ElasticSearch.
Booking Service: Manages the reservation lifecycle and ensures transactional consistency.
Inventory Service: Maintains the source of truth for room counts per day/type.
Payment Worker: An asynchronous worker to handle 3rd party payment processing.
Simplicity Audit: This design avoids complex distributed locking (like ZooKeeper/Redlock) by leveraging RDBMS row-level locking, which is sufficient for the MVP's scale and ensures 100% correctness.
Architecture Decision Rationale:
Why this architecture?: Separating Search from Booking allows the system to scale the high-traffic search flow independently of the mission-critical booking flow.
Functional Satisfaction: Covers end-to-end user journey from discovery to payment.
Non-functional Satisfaction: ElasticSearch provides sub-second search; RDBMS provides the ACID guarantees needed for inventory.
High Level Architecture
Sub-system Deep Dive
Service
Topology & Scaling:
Stateless services deployed in Multi-AZ clusters (Kubernetes).
Scaling signals: CPU and Request Latency for Search; Queue Depth for Payment Workers.
Load Balancing: L7 (Envoy/NGINX) for path-based routing (/search, /book).
API Schema Design:
GET /v1/search: Query params (lat, lon, checkin, checkout). Returns list of HotelIDs.POST /v1/reservations: Request (RoomTypeID, HotelID, Dates). Returns ReservationID + IdempotencyKey.POST /v1/payments: Request (ReservationID, PaymentToken).Idempotency: All POST requests require a
X-Idempotency-Key header.Resilience & Reliability:
Circuit Breakers on the Payment Worker to prevent hammering the 3rd party API during outages.
10-minute TTL on "Reserved" status; if payment isn't confirmed, the Inventory Service releases the room.
Security:
JWT-based AuthN via API Gateway.
mTLS between internal microservices.
Storage
Access Pattern:
Search: High read, geo-spatial queries.
Booking: High write, transaction-heavy, consistency-critical.
Database Table Design:
Inventory Table:
hotel_id (PK), room_type_id (PK), date (PK), total_rooms, available_rooms, version.Booking Table:
booking_id (PK), user_id, hotel_id, room_type_id, status (Pending/Confirmed/Cancelled), total_price.Technical Selection:
PostgreSQL: For Inventory/Booking due to robust ACID support and Row-Level Locking.
ElasticSearch: For Hotel metadata and availability search.
Distribution Logic:
Sharding by
hotel_id for the Inventory DB to ensure all room types for a single hotel live on the same shard.Cache
Purpose & Justification: Reduce load on Inventory DB for high-frequency "is this room available" checks during the search/view hotel phase.
Key-Value Schema:
avail:{hotel_id}:{date} -> JSON map of room_type to count.Technical Selection: Redis (Cluster mode).
Failure Handling: Cache-aside pattern. If Redis is down, query the DB directly (with strict rate limiting to prevent DB collapse).
Messaging
Purpose & Decoupling: Decouple the booking transaction from the slow/unreliable payment process and post-booking tasks (emails).
Event / Topic Schema:
payment-requests, booking-confirmed.Throughput & Partitioning: Partitioned by
booking_id to ensure sequential processing of updates for a single reservation.Technical Selection: Kafka. Used for high-throughput event streaming and the ability to replay events if the Payment Consumer fails.
Data Processing
Processing Model: CDC (Change Data Capture) using Debezium.
Processing DAG:
Inventory DB -> Kafka Connect -> Transformation Lambda -> ElasticSearch.Correctness Guarantees: Exactly-once delivery to ElasticSearch isn't strictly necessary (search results can be slightly stale), but versioning ensures we don't overwrite newer data with older updates.
Technical Selection: Kafka Connect. High reliability and low maintenance for moving data between DB and ES.
Wrap Up
Advanced Topics
Monitoring: Prometheus for metrics (409 Conflict rates for overbooking attempts); ELK for logs.
Trade-offs: We chose Strong Consistency over Availability for the booking flow (CP in CAP). If the Inventory DB is down, bookings stop. This is a business requirement.
Bottlenecks: The Inventory DB is the write bottleneck. Optimization involves using a "Buffer" approach for high-volume hotels (pre-allocating blocks of rooms to cache).
Failure Handling: If the Payment Worker fails after taking money but before updating the DB, a reconciliation job runs every 5 minutes to sync payment provider logs with the Booking DB.