The Question
Design

Hotel Reservation System

Design a scalable hotel reservation system that allows users to search for hotels based on location and availability, and ensures strong consistency for bookings to prevent double-booking. The system should handle high search traffic and manage a transient reservation state during the payment process.
PostgreSQL
Redis
Elasticsearch
Kafka
CDC
Questions & Insights

Clarifying Questions

Scale and Scope: What is the expected scale in terms of the number of hotels and daily active users (DAU)?
Booking Concurrency: How should the system handle simultaneous booking attempts for the last available room in a popular hotel?
Search Latency: Is real-time availability required in search results, or is a slight delay (eventual consistency) acceptable for the search index?
Payment Integration: Should the system handle payments synchronously during the booking flow or asynchronously?
Assumptions:
Scale: 100,000 hotels, 10 million DAU, 500 QPS for bookings, 10,000 QPS for searches.
Consistency: Strong consistency is mandatory for room inventory to prevent double-booking.
Inventory: We manage inventory at the RoomType level per day (e.g., 10 Deluxe rooms available on June 1st).
Payment: MVP will use a 3rd party provider (Stripe/PayPal) with a redirect or component-based integration.

Thinking Process

The Inventory Bottleneck: How do we ensure two people don't book the last room? Use RDBMS transactions with row-level locking or atomic increments to maintain strict consistency.
The Search Performance: How to handle 10k QPS of searches across millions of dates/room combinations? Use a dedicated Search Service backed by a search-optimized engine (Elasticsearch) and a caching layer.
The Booking Lifecycle: How to handle the state transition from "Selecting" to "Paid"? Implement a "Reservation Hold" (TTL-based) to temporarily lock inventory while the user completes payment.
Data Modeling: How to structure inventory for fast queries? Use an Inventory table that stores room_count per hotel_id, room_type_id, and date.

Bonus Points

Database Sharding strategy: Shard the Inventory and Booking databases by hotel_id to ensure that all operations for a single hotel happen within a single shard, avoiding cross-shard transactions.
Optimistic Locking: Use a version field in the inventory table to handle high-concurrency updates without long-held pessimistic locks.
CDC for Search Sync: Use Change Data Capture (CDC) from the Primary DB to Elasticsearch to ensure the search index stays updated with minimal lag and no dual-write anomalies.
Idempotency Keys: Use client-generated idempotency keys for all booking requests to handle network retries safely.
Design Breakdown

Functional Requirements

Users can search for hotels by location and date range.
Users can view hotel details and room availability.
Users can reserve a room and receive a confirmation.
Users can cancel a reservation.
Hotel managers can update room prices and availability.

Non-Functional Requirements

High Consistency: Zero double-bookings (Critical for business).
High Availability: Search and browsing must be available even if the booking service is degraded.
Low Latency: Search results should return in < 500ms.
Scalability: Handle seasonal spikes (e.g., holidays).

Estimation

Storage: 100k hotels 500 days of inventory ahead 5 room types = 250M rows. At ~100 bytes/row, this is ~25GB. Manageable in a single RDS instance, but sharding is better for QPS.
Bandwidth: 10k search QPS * 10KB response = 100MB/s.
Writes: 500 booking QPS is well within the limits of a tuned PostgreSQL/MySQL instance.

Blueprint

Concise Summary: A microservices-based architecture using an RDBMS for transactional integrity of bookings and an inverted index for performant hotel discovery.
Major Components:
Search Service: Handles location-based and availability-based filtering using Elasticsearch.
Booking Service: Manages the transactional lifecycle of a reservation and inventory counts in a Relational DB.
Inventory DB: A sharded RDBMS storing daily room availability counts per hotel.
Payment Service: Orchestrates the 3rd party payment flow.
Simplicity Audit: This design avoids complex distributed transactions by sharding data such that inventory and bookings for a hotel live together.
Architecture Decision Rationale:
Why this architecture?: RDBMS is the industry standard for financial/inventory consistency. Elasticsearch handles the complex "find a hotel in NYC with free rooms between X and Y" queries better than SQL.
Functional Satisfaction: Covers the full flow from discovery to payment.
Non-functional Satisfaction: Separation of Search and Booking services ensures that high search traffic doesn't starve the booking transactions of resources.

High Level Architecture

Sub-system Deep Dive

Service

Topology & Scaling: Stateless microservices deployed in Multi-AZ Kubernetes clusters. Scale Search Service based on QPS; scale Booking Service based on CPU/Database connection limits.
API Schema Design:
POST /v1/reservations: Request (hotel_id, room_type, dates, idempotency_key). Returns (reservation_id, status: PENDING).
GET /v1/hotels/search: Request (location, check_in, check_out, guests). Returns List of Hotel objects.
Resilience:
Circuit Breaker: Used for Payment Service calls.
Retry: Exponential backoff for Inventory DB updates.
Security: JWT-based AuthN. RBAC for Hotel Managers vs. Customers.

Storage

Access Pattern:
Search: High read, complex filters.
Booking: Medium write, high consistency requirement.
Database Table Design (Booking DB):
Inventory: [hotel_id (PK), date (PK), room_type_id (PK), total_rooms, available_rooms, version].
Reservations: [id (PK), user_id, hotel_id, room_type_id, start_date, end_date, status (PENDING, CONFIRMED, CANCELLED)].
Technical Selection: PostgreSQL. Support for ACID, row-level locking, and excellent performance for complex joins.
Distribution Logic: Shard by hotel_id. This ensures that Inventory and Reservations for the same hotel are co-located, allowing for local ACID transactions when checking/deducting inventory.

Cache

Purpose & Justification: Reduces load on Search Index for popular search queries (e.g., "London next weekend").
Key-Value Schema: search:{location}:{dates}:{guests} -> List<HotelIDs>. TTL: 5-10 minutes.
Technical Selection: Redis. Used for its speed and TTL support.
Failure Handling: If Redis is down, fallback to Elasticsearch.

Messaging

Purpose & Decoupling: Decouples the Booking Service from long-running or unreliable tasks like sending emails or updating the Search Index.
Event / Topic Schema: booking.confirmed, booking.cancelled. Payload contains reservation_id and metadata.
Technical Selection: Kafka. Provides durability and replayability.
Failure Handling: Dead-letter queues (DLQ) for failed notification attempts.
Wrap Up

Advanced Topics

Monitoring: Monitor "P99 Latency" for Search and "Transaction Failure Rate" for Bookings.
Trade-offs: We trade off absolute real-time accuracy in search results for high performance. A hotel might show as "available" in search but fail at the booking stage (handled gracefully by the UI).
Bottlenecks: The primary database for a specific shard could become a bottleneck if a single hotel (e.g., during a major event) receives massive traffic.
Optimization: Use Pessimistic Locking (SELECT FOR UPDATE) for the specific inventory row during the 2-second booking transaction to ensure no overbooking occurs under heavy load.