The Question
DesignHotel Reservation System
Design a high-scale hotel reservation system similar to Booking.com or Expedia. The system must support global hotel search with real-time availability filters, handle high-concurrency booking transactions without overbooking, and manage a massive inventory of room-nights across different time zones. Focus on the data consistency model for reservations, the search indexing strategy, and how to handle the high read-to-write ratio typical of travel platforms. Provide estimations for a system supporting 20 million daily active users and discuss the trade-offs between availability and consistency in the booking flow.
PostgreSQL
Elasticsearch
Redis
Kafka
Debezium
Kubernetes
CDN
JWT
Stripe
Questions & Insights
Clarifying Questions
Scale and Traffic: What is the expected scale in terms of Daily Active Users (DAU) and total hotel listings?
Assumption: 20M DAU, 2M hotels, and 100M rooms globally.
Search Requirements: Does the search need to be real-time based on availability, or can availability be checked at the final booking step?
Assumption: Search results must reflect "near real-time" availability (updated within seconds) to prevent poor UX.
Booking Flow: Should the system support partial payments or "Pay at Hotel" options?
Assumption: MVP will support immediate full payment via a 3rd party provider (Stripe/PayPal) and "Pay at Hotel" as a secondary option.
Consistency vs. Availability: In the "last room" scenario, do we prefer showing a room as available when it might be gone (Availability) or failing the search to ensure accuracy (Consistency)?
Assumption: High availability for search/browsing, but strict consistency (ACID) for the actual booking transaction.
Thinking Process
Core Bottleneck: The primary challenge is the "Double Booking" problem under high concurrency for the same room-night, combined with high read-volume for search.
Strategy:
Use a Write-Ahead-Log / Relational DB with row-level locking for the booking transaction to guarantee ACID properties.
Use a Geospatial Search Engine (Elasticsearch) to handle location-based queries and attribute filtering.
Implement an Inventory Management Service that tracks room counts per day, using a specialized schema to handle date-range queries efficiently.
Decouple the booking flow from notifications and analytics using a Message Queue.
Bonus Points
Inventory Pre-allocation: Discussing the use of a "Reservation Status" (Pending/Confirmed/Expired) to hold a room for 10 minutes during the checkout process without committing a permanent write immediately.
Optimistic Concurrency Control: Using version numbers in the inventory table to handle high-frequency updates without heavy pessimistic locking where possible.
CDC (Change Data Capture): Using Debezium/Kafka to sync the relational database (source of truth) with Elasticsearch (search index) to ensure eventual consistency without dual-write anomalies.
Data Sharding Strategy: Sharding the Booking and Inventory databases by
hotel_id to ensure that transactions for a specific hotel are localized to a single database shard.Design Breakdown
Functional Requirements
Core Use Cases:
Users can search for hotels by location, date range, and number of guests.
Users can view hotel details, amenities, and real-time room availability.
Users can reserve a room and receive a confirmation.
Users can view or cancel their existing reservations.
Hotel managers can update room prices and inventory.
Scope Control:
In-scope: Search, Booking, Inventory management, Payment integration.
Out-of-scope: Hotel reviews/ratings (MVP), Loyalty programs, Flight/Car rental integrations.
Non-Functional Requirements
Scale: Support 20M DAU; Peak QPS of 50k for search and 500 for bookings.
Latency: Search results returned < 500ms; Booking confirmation < 2s.
Availability & Reliability: 99.99% availability for the search path; 99.999% for the booking path.
Consistency: Strong consistency for inventory and booking (no double bookings). Eventual consistency (~seconds) for search index updates.
Security & Privacy: PCI-DSS compliance for payment handling (offloaded to 3rd party); GDPR for user data.
Estimation
Traffic Estimation:
Search: 20M DAU * 5 searches/user = 100M searches/day. Avg QPS = 1,150. Peak QPS = ~5,000.
Booking: 2% conversion = 2M bookings/day. Avg QPS = 23. Peak QPS = ~100.
Storage Estimation:
Bookings: 2M bookings/day 500 bytes/record 365 days = ~365 GB/year.
Inventory: 2M hotels 10 room types 730 days (2 years window) = ~14.6B rows. This requires heavy sharding or compression.
Bandwidth Estimation:
Outgoing (Search results): 5k QPS * 10KB = 50MB/s (400Mbps).
Blueprint
Concise Summary: A microservices-based architecture leveraging a sharded RDBMS for transactional integrity in bookings and Elasticsearch for high-performance geospatial search.
Major Components:
API Gateway: Central entry point for authentication, rate limiting, and request routing.
Search Service: Interfaces with Elasticsearch to provide filtered hotel listings.
Booking Service: Manages the lifecycle of a reservation (Pending, Paid, Confirmed).
Inventory Service: Tracks room availability per room type per day using a daily-record schema.
Payment Service: Wraps 3rd party payment gateways and manages transaction states.
Notification Service: Asynchronously handles emails/SMS via Kafka events.
Simplicity Audit: This design avoids complex distributed locks (like Redlock) by relying on RDBMS ACID properties within sharded instances, which is more robust for an MVP.
Architecture Decision Rationale:
Relational DB: Essential for the inventory and booking to prevent overbooking.
Elasticsearch: Necessary because relational databases struggle with complex geospatial + attribute (e.g., "WiFi AND Pool") filters at scale.
Kafka: Essential to ensure that if the Notification service is down, the user still gets their confirmation email eventually.
High Level Architecture
Sub-system Deep Dive
Edge (Optional)
Content Delivery & Traffic Routing: Cloudflare/Akamai for caching hotel images and static UI assets.
Security & Perimeter:
API Gateway: Implements JWT-based AuthN/AuthZ.
Rate Limiting: 100 requests/min per IP for Search; 5 requests/min for Booking to prevent bot-scraping and inventory exhaustion attacks.
Service
Topology & Scaling: Stateless microservices deployed on K8s across multiple Availability Zones (AZs). Scaling is based on CPU and Request Count.
API Schema Design:
POST /v1/bookings: Creates a reservation. Header: Idempotency-Key.GET /v1/hotels/search: Params: lat, lon, radius, checkin, checkout.Resilience & Reliability:
Idempotency: Booking service uses a client-provided UUID to prevent duplicate charges.
Circuit Breaker: Used for the Payment Service to fail fast if Stripe is down.
Storage
Access Pattern: Write-heavy for inventory updates (by hotel staff), Read-heavy for search, Transaction-critical for bookings.
Database Table Design:
Inventory Table:
hotel_id (PK), room_type_id, date, total_rooms, available_rooms, version.Booking Table:
booking_id (PK), user_id, hotel_id, room_type_id, status (Pending/Confirmed/Cancelled), total_price.Technical Selection: PostgreSQL with Citus or manual sharding by
hotel_id. Distribution Logic: Sharding by
hotel_id ensures all inventory for a specific hotel resides on one node, allowing for local ACID transactions.Cache
Purpose & Justification: Reduce load on the Inventory DB during the "View Room" flow.
Key-Value Schema:
inv:{hotel_id}:{date} -> available_count. Failure Handling: If Redis is down, fall back to the Inventory DB. Redis is used as a "look-aside" cache.
Messaging
Purpose & Decoupling: Kafka decouples the critical booking path from side effects.
Event / Topic Schema:
BookingCreated, BookingConfirmed, BookingCancelled.Failure Handling: Dead-letter queues (DLQ) for failed email deliveries.
Data Processing
Processing Model: A CDC pipeline (Debezium) reads PostgreSQL binlogs and streams updates to Elasticsearch.
Correctness Guarantees: Ensures the Search index reflects the latest availability from the Inventory DB.
Wrap Up
Advanced Topics
Trade-offs: We choose Eventual Consistency for search results. A user might see a room available in search, but find it gone at checkout. This is standard in the industry (e.g., "Only 1 room left!") to maintain performance.
Reliability: The "Pending" state in bookings is crucial. When a user clicks "Book", we decrement the inventory and mark the booking as "Pending". If payment fails or times out (TTL 10 mins), a background worker reverts the inventory.
Bottleneck Analysis: The Inventory table grows very large. Optimization: Partition the table by month and archive data older than 2 years to cold storage.
Security: All PII (User names, emails) in the database is encrypted at rest using AES-256.