The Question
DesignScalable Hotel Reservation System
Design a high-scale hotel reservation and search system similar to Booking.com. The system must support millions of hotels and room types while providing sub-second search latency and guaranteeing zero double-bookings. Address how you would handle massive read traffic for hotel searches, maintain strict inventory consistency during peak booking periods, and ensure the system remains resilient if external payment providers fail.
PostgreSQL
Redis
Elasticsearch
Kafka
CDC
CQRS
REST
JWT
Questions & Insights
Clarifying Questions
Scale & Traffic: What is the expected scale in terms of hotels, rooms, and daily active users (DAU)?
Assumption: 1M hotels, 10M DAU, 500k peak search QPS, and 1k peak booking QPS.
Booking Window: How far in advance can a user book?
Assumption: Up to 500 days in advance.
Inventory Model: Are we a single chain or an aggregator?
Assumption: An aggregator model (like Booking.com or Expedia) where we manage inventory from multiple sources.
Consistency Requirements: Is "overbooking" acceptable?
Assumption: For the MVP, we require strict consistency (no double-bookings).
Thinking Process
Core Problem: Managing high-concurrency inventory updates (Atomic decrement) while maintaining a low-latency search experience across millions of records.
Key Questions:
How do we guarantee a room isn't booked twice at the exact same millisecond? (Atomic DB transactions or Distributed Locking).
How do we support complex search filters (price, location, rating) at scale? (Command Query Responsibility Segregation - CQRS with Elasticsearch).
How do we handle the "thundering herd" on popular hotels during peak seasons? (Layered caching and inventory reservation patterns).
How do we keep the search index in sync with the primary booking database? (Asynchronous Change Data Capture - CDC).
Bonus Points
TCC (Try-Confirm-Cancel) Pattern: Implementing a 2-phase reservation flow to handle distributed transactions across the Booking Service and Payment Service without long-lived DB locks.
Optimistic Locking with Versioning: Using version numbers in the inventory table to handle high-concurrency writes without the overhead of pessimistic row-level locking.
Geo-Sharding: Partitioning hotel data by geographic region to ensure localized low latency and compliance with data residency laws.
Dynamic Pricing Integration: Using a sidecar pattern to inject real-time pricing adjustments based on demand spikes without bloating the core booking logic.
Design Breakdown
Functional Requirements
Core Use Cases:
Users can search for hotels by location and date range.
Users can view hotel details and room availability.
Users can reserve a room and complete payment.
Users can view or cancel their bookings.
Scope Control:
In-scope: Search, Reservation flow, Inventory management, Payment integration.
Out-of-scope: User reviews/ratings system, Loyalty programs, Hotel-side admin dashboard (CMS), and complex dynamic pricing algorithms.
Non-Functional Requirements
Scale: Support 1M hotels and 10M DAU.
Latency: Search results in < 200ms; Booking confirmation in < 2s.
Availability & Reliability: 99.99% availability for search; 99.9% for booking (ACID compliance prioritized).
Consistency: Strong consistency for inventory; Eventual consistency for search results.
Fault Tolerance: Handle regional failures and database primary elections gracefully.
Security & Privacy: PCI-DSS compliance for payment data; TLS encryption for all transit data.
Estimation
Traffic Estimation:
Search: 10M DAU * 10 searches/day = 100M daily searches (~1,200 avg QPS; 5,000 peak QPS).
Booking: 1M daily bookings (~12 avg QPS; 200 peak QPS).
Storage Estimation:
Hotels: 1M * 5KB = 5GB.
Inventory: 1M hotels 10 room types 500 days = 5B rows.
5B rows * 100 bytes = 500GB (Manageable with sharded RDBMS).
Bandwidth Estimation:
Search: 5k QPS * 10KB (results) = 50MB/s (Outgoing).
Blueprint
Concise Summary: A CQRS-based architecture separating high-traffic Search from high-integrity Booking.
Major Components:
API Gateway: Handles rate limiting, authentication, and request routing to internal microservices.
Search Service: A read-optimized service querying an Elasticsearch cluster for hotel metadata and availability.
Booking Service: A write-optimized service managing room reservations and inventory updates using a relational database.
Inventory Cache: A Redis cluster storing real-time room counts to shield the database from search-heavy availability checks.
Sync Worker: A background processor that updates the Search Index whenever the Booking DB changes.
Simplicity Audit: This design avoids complex distributed transactions (like 2PC) by using local DB transactions for inventory and async updates for search, providing the best balance of reliability and performance for an MVP.
Architecture Decision Rationale:
Why this architecture?: Separation of concerns allows the Search Service to scale horizontally independently of the transactional Booking Service.
Functional Requirement Satisfaction: Covers the full flow from discovery (Search) to transaction (Booking).
Non-functional Requirement Satisfaction: Elasticsearch provides the required sub-200ms search latency, while PostgreSQL ensures no double-bookings through ACID transactions.
High Level Architecture
Sub-system Deep Dive
Edge (Optional)
Content Delivery & Traffic Routing: Use a global CDN for static assets (hotel images) and latency-based DNS routing to direct users to the nearest regional data center.
Security & Perimeter:
API Gateway: Implements JWT-based authentication and request validation.
Rate Limiting: Enforces quotas (e.g., 100 searches/min per IP) to prevent scraping.
SSL Termination: Handles encryption at the edge to reduce load on internal services.
Service
Topology & Scaling:
Search Service: Stateless, scales based on CPU/Request count. Deployed in multiple Availability Zones (AZs).
Booking Service: State-heavy, uses a small pool of high-spec instances to minimize connection overhead to the DB.
API Schema Design:
POST /v1/reservations: Protocol: REST/JSON. Request: {hotel_id, room_id, start_date, end_date}. Response: {reservation_id, status}. Idempotency: Required via idempotency_key header.GET /v1/search: Protocol: REST/JSON. Request: {location, guests, checkin, checkout}.Resilience & Reliability:
Circuit Breaker: Used for the Payment Gateway to prevent the Booking Service from hanging during third-party outages.
Retries: Exponential backoff for the Sync Worker to ensure the Search Index eventually reconciles with the DB.
Storage
Access Pattern:
Search: High read, complex filters, geo-spatial queries.
Booking: Low read, high write, requires atomic updates.
Database Table Design:
RoomInventory:
hotel_id (PK), date (PK), room_type_id (PK), total_rooms, reserved_rooms, version.Bookings:
booking_id (UUID), user_id, hotel_id, status (Pending/Confirmed/Cancelled).Technical Selection:
PostgreSQL: Primary DB for ACID compliance. Sharded by
hotel_id.Elasticsearch: Search Index for text-based and geo-spatial queries.
Distribution Logic: Sharding key is
hotel_id to ensure all inventory for a specific hotel resides on one shard, allowing for atomic local transactions.Cache
Purpose & Justification: Reduce DB load during the search phase ("Is this hotel available?") before the user clicks "Book".
Key-Value Schema: Key:
inv:{hotel_id}:{date}, Value: available_count. TTL: 24 hours. Invalidation: Updated via Sync Worker or direct write-through.Technical Selection: Redis (Cluster mode).
Failure Handling: If Redis is down, services fall back to the PostgreSQL Shards (Graceful degradation with increased latency).
Messaging
Purpose & Decoupling: Decouples the Booking Service from the Search Index update logic.
Event Schema: Topic:
inventory_updates. Payload: {hotel_id, date, new_availability_count}.Throughput & Partitioning: Partitioned by
hotel_id to ensure updates for the same hotel are processed in order.Technical Selection: Kafka (High throughput, durability for CDC events).
Data Processing
Processing Model: Streaming (Event-driven).
Processing DAG: Source (Kafka) -> Transformation (JSON parsing) -> Output Sink (Elasticsearch API).
Correctness Guarantees: At-least-once delivery with idempotent updates in Elasticsearch (using
doc_as_upsert).Technical Selection: Custom Go/Java Sync Worker (lightweight and low cost for MVP).
Wrap Up
Advanced Topics
Trade-offs: We chose Eventual Consistency for Search. A user might see a room as "available" in search but find it "taken" at the booking step. This is a standard industry trade-off (PACELC: Availability over Consistency for search).
Reliability: We use a Write-Ahead Log (WAL) and CDC to ensure no inventory change is lost between the DB and Search Index.
Bottleneck Analysis: The primary DB shard for a world-famous hotel (e.g., Bellagio) could become a hot spot. Solution: Use Redis-based distributed locks or "virtual sharding" for inventory if a single hotel exceeds 1k TPS.
Security: PII (User names/emails) is encrypted at rest using AES-256. Payment data is never stored locally; we only store a "token" from the Payment Provider.
Distinguishing Insight: To handle high concurrency, instead of
SELECT FOR UPDATE, we use: UPDATE RoomInventory SET reserved_rooms = reserved_rooms + 1 WHERE hotel_id = ? AND date = ? AND reserved_rooms < total_rooms. This single atomic SQL statement avoids lock-waits and prevents overbooking at the engine level.