The Question

Hotel Reservation System Design

Design a global hotel reservation system similar to Booking.com. The system should allow users to search for hotels based on location and date ranges, view real-time room availability, and complete bookings with high concurrency. Address the challenges of inventory consistency (preventing overbooking), handling high-traffic search patterns, and managing the reservation lifecycle (from pending to confirmed). Assume 10M DAU and 1M hotels.

PostgreSQL

Elasticsearch

Redis

Kafka

RabbitMQ

Stripe

Kubernetes

CDN

CDC

Questions & Insights

Clarifying Questions

Scale & Traffic: What is the expected scale in terms of Daily Active Users (DAU) and total hotel listings?

Assumption: 10M DAU, 1M hotels globally, with an average of 100 rooms per hotel.

Concurrency & Consistency: How should we handle "last room" scenarios? Is overbooking allowed (standard in industry) or strictly prohibited?

Assumption: Strict consistency for the final booking to prevent overbooking for the MVP.

Search Patterns: Do users search by city, coordinates, or specific hotel names?

Assumption: Search by city and date range is the primary use case.

Booking Flow: Does the system handle payments directly or via third-party providers?

Assumption: Third-party payment integration (e.g., Stripe); we handle the state machine.

Thinking Process

Core Bottleneck: The primary challenge is managing Room Inventory under high concurrency (thousands of users trying to book the last room during a flash sale or peak season).

Strategy Steps:

How do we ensure search is fast across millions of hotels? (Read-heavy optimization).

How do we manage inventory updates without double-booking? (Transaction management).

how do we handle the "temporary hold" state during the payment process? (Distributed locking/Reservations).

How do we ensure the system is resilient if the payment provider or mail service fails? (Asynchronous processing).

Bonus Points

Database Sharding by Location: Implementing a geo-sharding strategy (e.g., partitioning by CityID or CountryID) to ensure data proximity and reduce cross-region latency.

CDC (Change Data Capture): Using Debezium/Kafka to sync the primary SQL database (Source of Truth) with Elasticsearch (Search) to avoid dual-write inconsistencies.

Optimistic Locking with Versioning: Using database-level version checks to handle high-frequency inventory updates without the overhead of heavy pessimistic locks.

Idempotency Keys: Implementing strict idempotency for both the Reservation and Payment layers to prevent duplicate charges and bookings on retry.

Design Breakdown

Functional Requirements

Core Use Cases:

Users can search for hotels by location and date.

Users can view hotel details and real-time room availability.

Users can reserve a room and complete payment.

Users can view/cancel their bookings.

Scope Control:

In-scope: Search, Inventory management, Booking flow, Payment state handling.

Out-of-scope: Hotel owner portal (extranet), loyalty point systems, dynamic pricing algorithms.

Non-Functional Requirements

Scale: Support 10k Search QPS and 500 Booking QPS.

Latency: Search results returned within < 200ms; booking confirmation < 2s.

Availability & Reliability: 99.99% availability; bookings must never be lost.

Consistency: Strong consistency for inventory; eventual consistency for search results and hotel descriptions.

Fault Tolerance: Handle payment gateway timeouts and downstream service failures gracefully.

Estimation

Traffic Estimation:

Search: 10M DAU * 10 searches/user = 100M searches/day ≈ 1,200 QPS average (10k Peak).

Booking: 10M DAU * 0.05 conversion = 500k bookings/day ≈ 6 QPS average (100 Peak).

Storage Estimation:

1M Hotels * 1KB/hotel = 1GB.

100M Rooms * 100 bytes = 10GB.

500k Bookings/day 365 days 2 years * 500 bytes ≈ 180GB.

Bandwidth Estimation:

Outgoing: 10k Search QPS * 50KB response ≈ 500 MB/s.

Blueprint

Concise Summary: A microservices-based architecture using a Relational Database for ACID-compliant transactions and a Search Engine for high-performance discovery.

Major Components:

API Gateway: Entry point for rate limiting, auth, and routing.

Search Service: Uses Elasticsearch to provide low-latency hotel filtering.

Reservation Service: Manages the booking lifecycle and coordinates inventory.

Inventory Service: Maintains room counts per day using Redis for speed and SQL for persistence.

Payment Service: Orchestrates interactions with 3rd-party payment providers.

Simplicity Audit: This design avoids complex distributed transactions (2PC) by using a "Reserved" state in the database, which is the simplest way to maintain consistency without blocking the entire system.

Architecture Decision Rationale:

SQL (PostgreSQL) is chosen for Reservation/Inventory due to transaction support.

Elasticsearch is used for Search because SQL LIKE or complex joins don't scale for global hotel filters.

Redis handles hot inventory lookups to protect the database from search-related load.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing: Use a CDN (Cloudflare/Akamai) for static assets (hotel images). Latency-based DNS routing to direct users to the nearest regional data center.

Security & Perimeter: API Gateway handles JWT validation and per-user rate limiting (e.g., 20 searches/min) to prevent scraping.

Service

Topology & Scaling: Stateless microservices deployed on Kubernetes (EKS/GKE). Scaling triggered by CPU/Memory.

API Schema Design:

GET /v1/hotels/search: Params (location, checkin, checkout, guests). Returns list of Hotel objects.

POST /v1/reservations: Request (hotel_id, room_type_id, dates, user_id). Returns reservation_id and payment_token. Idempotency key required in headers.

Resilience & Reliability: Circuit breakers on the Payment Service to prevent the Reservation Service from hanging if Stripe is down. Retries with exponential backoff for the Notification Worker.

Storage

Access Pattern: High read/write ratio for inventory during peak booking periods. Read-heavy for search.

Database Table Design:

Hotels: id (PK), name, city_id, description, lat/long.

Room_Types: id (PK), hotel_id, name, base_price.

Inventory: room_type_id (FK), date (PK), total_inventory, reserved_count.

Reservations: id (PK), user_id, room_type_id, checkin, checkout, status (Pending/Confirmed/Cancelled).

Technical Selection: PostgreSQL with REPEATABLE READ isolation to ensure inventory checks and increments are consistent within a transaction.

Distribution Logic: Partition Inventory and Reservations by hotel_id to ensure all data for a single hotel resides on one shard, simplifying transactions.

Cache

Purpose & Justification: Reduce load on SQL for availability checks.

Key-Value Schema: inv:{room_type_id}:{date} -> available_count.

Technical Selection: Redis.

Failure Handling: If Redis is down, fall back to SQL. Use a "Write-through" strategy where SQL is updated first, then Redis is evicted.

Messaging

Purpose & Decoupling: Decouple the booking success from the notification/email flow.

Event / Topic Schema: booking.confirmed: {reservation_id, user_id, email, hotel_name, dates}.

Technical Selection: RabbitMQ or SQS for simplicity in MVP.

Failure Handling: Dead-letter queues (DLQ) for failed email attempts (e.g., invalid email addresses).

Wrap Up

Advanced Topics

Trade-offs (PACELC): For the Inventory, we choose Consistency over Availability (CP). It is better to fail a request than to allow a double-booking. For Search, we choose Availability (AP) – users can see slightly stale availability in results as long as the final booking step is consistent.

Optimization - Inventory Locking:

Option A (Pessimistic): SELECT FOR UPDATE on inventory row. Safe but limits throughput.

Option B (Optimistic): UPDATE inventory SET reserved = reserved + 1 WHERE room_id = ? AND date = ? AND reserved < total. Best for MVP.

Bottleneck Analysis: The Inventory table will be a hotspot. This is mitigated by sharding by hotel_id and using Redis for pre-filtering search requests.

Security: All PII (user emails, names) is encrypted at rest. PCI compliance is offloaded to the Payment Provider (we only store payment tokens).