The Question
DesignHotel Reservation System Design
Design a global hotel reservation system like Booking.com or Expedia. The system should support hotel discovery (search by location/date), real-time availability management, and a secure booking process. Key challenges include handling high-concurrency bookings for limited inventory without double-booking, ensuring low-latency search across millions of records, and maintaining data consistency across the reservation and payment lifecycles. Discuss how you would handle 10 million daily users and the trade-offs between consistency and availability in a distributed environment.
PostgreSQL
PostGIS
Redis
Kafka
CDN
JWT
Stripe
CDC
Microservices
Questions & Insights
Clarifying Questions
What is the scale of the system? (e.g., Number of hotels, rooms, and daily active users?)
Assumption: 500,000 hotels, 20 million rooms, 10 million DAU, and 500,000 bookings per day.
What is the search-to-booking ratio?
Assumption: 100:1. Users search much more than they book, making the search path read-heavy and the booking path write-heavy with high consistency requirements.
How should we handle overbooking?
Assumption: The system must strictly prevent technical double-bookings. Business-level overbooking (e.g., 110% capacity) is handled via a configurable buffer in the inventory service.
Do we need to handle payments within the system?
Assumption: We integrate with external providers (Stripe/PayPal). We must manage the payment state (Pending, Paid, Refunded).
Thinking Process
Core Bottleneck: Preventing double-bookings under high concurrency.
Strategy:
How do we ensure search is fast? (De-normalized inventory and ElasticSearch/Spatial indexing).
How do we handle the "thundering herd" on popular dates? (Redis-based inventory counters).
How do we guarantee consistency during booking? (RDBMS transactions with Pessimistic/Optimistic locking).
How do we handle distributed failures between Booking and Payment? (Transactional Outbox pattern or Saga).
Bonus Points
Inventory Partitioning: Sharding the inventory database by
hotel_id to ensure that bookings for different hotels don't contend for the same database locks.Optimistic Locking with Versioning: Using a
version column in the inventory table to handle high-concurrency room captures without long-held pessimistic locks.Availability-Consistency Trade-off: Using a "Reserve-then-Pay" flow with a TTL (Time-to-Live) on the reservation to hold the room for 10-15 minutes, balancing ACID requirements with user experience.
Geo-sharding: Deploying search services and read-replicas in multiple regions to reduce latency for global users.
Design Breakdown
Functional Requirements
Core Use Cases:
Users can search for hotels by location, date range, and room type.
Users can view hotel details and real-time room availability.
Users can reserve a room and make a payment.
Users can cancel or modify a reservation.
Hotel managers can update room prices and inventory.
Scope Control:
In-scope: Search, Booking, Inventory Management, Payment Integration.
Out-of-scope: User reviews/ratings, Loyalty programs, Flight/Car rental integrations.
Non-Functional Requirements
Scale: Support 10M DAU and 500k bookings/day.
Latency: Search results under 200ms; booking confirmation under 1s.
Availability & Reliability: 99.99% availability; zero data loss for confirmed bookings.
Consistency: Strong consistency for inventory (no double-booking).
Security: PCI-DSS compliance for payment handling; TLS for all traffic.
Estimation
Traffic:
Search QPS: 10M \times 20 \text{ searches/day} / 86400 \approx 2,300 \text{ QPS}.
Peak Search QPS (5x): 11,500 \text{ QPS}.
Booking QPS: 500,000 / 86400 \approx 6 \text{ QPS}.
Storage:
500k Hotels \times 5KB = 2.5GB.
20M Rooms \times 1KB = 20GB.
Reservations (2 years): 500k \times 365 \times 2 \times 2KB \approx 730GB.
Bandwidth:
Search: 11,500 \text{ QPS} \times 10KB \text{ per result} \approx 115MB/s.
Blueprint
Concise Summary: A microservices architecture centered around a strictly consistent Inventory and Reservation service using a Relational Database, supplemented by a high-performance Search service backed by a search-optimized index.
Major Components:
API Gateway: Handles authentication, rate limiting, and request routing.
Search Service: Provides low-latency hotel discovery using spatial indexing.
Reservation Service: Manages the lifecycle of a booking and ensures ACID compliance.
Inventory Service: Tracks room availability using row-level locking or versioning.
Payment Service: Orchestrates interactions with 3rd-party payment gateways.
Simplicity Audit: This design avoids complex distributed transactions (2PC) by using a state-machine based Reservation service and a temporary "Hold" on inventory.
Architecture Decision Rationale:
RDBMS: Chosen for Inventory/Reservations because ACID properties are non-negotiable for financial and booking data.
Redis: Used as a cache to offload the Read-heavy search traffic from the primary DB.
High Level Architecture
Sub-system Deep Dive
Edge (Optional)
Content Delivery & Traffic Routing: Use a CDN (e.g., Cloudflare) for static assets (hotel images).
Security & Perimeter: API Gateway handles JWT validation and Rate Limiting (e.g., 100 requests/min per IP) to prevent scraping.
Service
Reservation Service:
API:
POST /v1/reservations (Create), GET /v1/reservations/{id} (Status).Flow: When a user selects a room, the service calls Inventory Service to "Lock" the room for 15 minutes.
Idempotency: Use a
client_key (UUID) to prevent duplicate bookings if the user clicks "Submit" twice.Search Service:
Uses Geo-sharding. Queries filters:
Location (lat/long), Check-in, Check-out, RoomType.Joins Hotel metadata with a "Pre-calculated Availability" table.
Storage
Access Pattern: Reservation is Write-heavy during peak; Inventory is Read-Write; Search is Read-Heavy.
Database Table Design:
Hotels: hotel_id (PK), name, location_geohash, details.RoomTypes: room_type_id, hotel_id, base_price, capacity.Inventory: hotel_id, room_type_id, date, total_inventory, reserved_count. (PK: room_type_id + date).Reservations: res_id, user_id, hotel_id, room_type_id, start_date, end_date, status (PENDING, CONFIRMED, CANCELLED).Technical Selection: PostgreSQL with PostGIS for search.
Distribution Logic: Shard
Inventory and Reservations by hotel_id to ensure operations for a specific hotel occur on a single shard, allowing for local transactions.Cache
Purpose: Reduce DB load for availability checks during the search phase.
Key-Value Schema:
Key:
inv:{hotel_id}:{date}Value:
{room_type_id: count}TTL: 5 minutes (or invalidated via CDC when a booking is confirmed).
Failure Handling: If Redis is down, the system falls back to the Inventory DB (Performance degradation but no data loss).
Messaging
Purpose: Asynchronous updates. When a reservation is confirmed, an event is published to update the Search Index/Cache and send a confirmation email.
Technical Selection: Kafka for high throughput and re-playability.
Wrap Up
Advanced Topics
Trade-offs: We choose Consistency (CP) over Availability for the Booking flow. If the Inventory DB is down, users cannot book, which is preferable to overbooking.
Reliability:
Dead Letter Queues (DLQ): Used for failed payment notifications.
Expiration Worker: A background job (e.g., Redis TTL or Cron) that releases "Pending" inventory if payment isn't received within 15 minutes.
Bottleneck Analysis:
Hot Hotel: A specific hotel goes viral. Fix: Distribute inventory across multiple rows or use Redis Lua scripts for atomic increments.
Security: Use Vault for managing API keys for Payment Gateways. Use RBAC for hotel managers.