DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
The Question
Design

Hotel Reservation System Design

Design a global hotel reservation system like Booking.com or Expedia. The system should support hotel discovery (search by location/date), real-time availability management, and a secure booking process. Key challenges include handling high-concurrency bookings for limited inventory without double-booking, ensuring low-latency search across millions of records, and maintaining data consistency across the reservation and payment lifecycles. Discuss how you would handle 10 million daily users and the trade-offs between consistency and availability in a distributed environment.
PostgreSQL
PostGIS
Redis
Kafka
CDN
JWT
Stripe
CDC
Microservices
Questions & Insights

Clarifying Questions

What is the scale of the system? (e.g., Number of hotels, rooms, and daily active users?)
Assumption: 500,000 hotels, 20 million rooms, 10 million DAU, and 500,000 bookings per day.
What is the search-to-booking ratio?
Assumption: 100:1. Users search much more than they book, making the search path read-heavy and the booking path write-heavy with high consistency requirements.
How should we handle overbooking?
Assumption: The system must strictly prevent technical double-bookings. Business-level overbooking (e.g., 110% capacity) is handled via a configurable buffer in the inventory service.
Do we need to handle payments within the system?
Assumption: We integrate with external providers (Stripe/PayPal). We must manage the payment state (Pending, Paid, Refunded).

Thinking Process

Core Bottleneck: Preventing double-bookings under high concurrency.
Strategy:
How do we ensure search is fast? (De-normalized inventory and ElasticSearch/Spatial indexing).
How do we handle the "thundering herd" on popular dates? (Redis-based inventory counters).
How do we guarantee consistency during booking? (RDBMS transactions with Pessimistic/Optimistic locking).
How do we handle distributed failures between Booking and Payment? (Transactional Outbox pattern or Saga).

Bonus Points

Inventory Partitioning: Sharding the inventory database by hotel_id to ensure that bookings for different hotels don't contend for the same database locks.
Optimistic Locking with Versioning: Using a version column in the inventory table to handle high-concurrency room captures without long-held pessimistic locks.
Availability-Consistency Trade-off: Using a "Reserve-then-Pay" flow with a TTL (Time-to-Live) on the reservation to hold the room for 10-15 minutes, balancing ACID requirements with user experience.
Geo-sharding: Deploying search services and read-replicas in multiple regions to reduce latency for global users.
Design Breakdown

Functional Requirements

Core Use Cases:
Users can search for hotels by location, date range, and room type.
Users can view hotel details and real-time room availability.
Users can reserve a room and make a payment.
Users can cancel or modify a reservation.
Hotel managers can update room prices and inventory.
Scope Control:
In-scope: Search, Booking, Inventory Management, Payment Integration.
Out-of-scope: User reviews/ratings, Loyalty programs, Flight/Car rental integrations.

Non-Functional Requirements

Scale: Support 10M DAU and 500k bookings/day.
Latency: Search results under 200ms; booking confirmation under 1s.
Availability & Reliability: 99.99% availability; zero data loss for confirmed bookings.
Consistency: Strong consistency for inventory (no double-booking).
Security: PCI-DSS compliance for payment handling; TLS for all traffic.

Estimation

Traffic:
Search QPS: 10M \times 20 \text{ searches/day} / 86400 \approx 2,300 \text{ QPS}.
Peak Search QPS (5x): 11,500 \text{ QPS}.
Booking QPS: 500,000 / 86400 \approx 6 \text{ QPS}.
Storage:
500k Hotels \times 5KB = 2.5GB.
20M Rooms \times 1KB = 20GB.
Reservations (2 years): 500k \times 365 \times 2 \times 2KB \approx 730GB.
Bandwidth:
Search: 11,500 \text{ QPS} \times 10KB \text{ per result} \approx 115MB/s.

Blueprint

Concise Summary: A microservices architecture centered around a strictly consistent Inventory and Reservation service using a Relational Database, supplemented by a high-performance Search service backed by a search-optimized index.
Major Components:
API Gateway: Handles authentication, rate limiting, and request routing.
Search Service: Provides low-latency hotel discovery using spatial indexing.
Reservation Service: Manages the lifecycle of a booking and ensures ACID compliance.
Inventory Service: Tracks room availability using row-level locking or versioning.
Payment Service: Orchestrates interactions with 3rd-party payment gateways.
Simplicity Audit: This design avoids complex distributed transactions (2PC) by using a state-machine based Reservation service and a temporary "Hold" on inventory.
Architecture Decision Rationale:
RDBMS: Chosen for Inventory/Reservations because ACID properties are non-negotiable for financial and booking data.
Redis: Used as a cache to offload the Read-heavy search traffic from the primary DB.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing: Use a CDN (e.g., Cloudflare) for static assets (hotel images).
Security & Perimeter: API Gateway handles JWT validation and Rate Limiting (e.g., 100 requests/min per IP) to prevent scraping.

Service

Reservation Service:
API: POST /v1/reservations (Create), GET /v1/reservations/{id} (Status).
Flow: When a user selects a room, the service calls Inventory Service to "Lock" the room for 15 minutes.
Idempotency: Use a client_key (UUID) to prevent duplicate bookings if the user clicks "Submit" twice.
Search Service:
Uses Geo-sharding. Queries filters: Location (lat/long), Check-in, Check-out, RoomType.
Joins Hotel metadata with a "Pre-calculated Availability" table.

Storage

Access Pattern: Reservation is Write-heavy during peak; Inventory is Read-Write; Search is Read-Heavy.
Database Table Design:
Hotels: hotel_id (PK), name, location_geohash, details.
RoomTypes: room_type_id, hotel_id, base_price, capacity.
Inventory: hotel_id, room_type_id, date, total_inventory, reserved_count. (PK: room_type_id + date).
Reservations: res_id, user_id, hotel_id, room_type_id, start_date, end_date, status (PENDING, CONFIRMED, CANCELLED).
Technical Selection: PostgreSQL with PostGIS for search.
Distribution Logic: Shard Inventory and Reservations by hotel_id to ensure operations for a specific hotel occur on a single shard, allowing for local transactions.

Cache

Purpose: Reduce DB load for availability checks during the search phase.
Key-Value Schema:
Key: inv:{hotel_id}:{date}
Value: {room_type_id: count}
TTL: 5 minutes (or invalidated via CDC when a booking is confirmed).
Failure Handling: If Redis is down, the system falls back to the Inventory DB (Performance degradation but no data loss).

Messaging

Purpose: Asynchronous updates. When a reservation is confirmed, an event is published to update the Search Index/Cache and send a confirmation email.
Technical Selection: Kafka for high throughput and re-playability.
Wrap Up

Advanced Topics

Trade-offs: We choose Consistency (CP) over Availability for the Booking flow. If the Inventory DB is down, users cannot book, which is preferable to overbooking.
Reliability:
Dead Letter Queues (DLQ): Used for failed payment notifications.
Expiration Worker: A background job (e.g., Redis TTL or Cron) that releases "Pending" inventory if payment isn't received within 15 minutes.
Bottleneck Analysis:
Hot Hotel: A specific hotel goes viral. Fix: Distribute inventory across multiple rows or use Redis Lua scripts for atomic increments.
Security: Use Vault for managing API keys for Payment Gateways. Use RBAC for hotel managers.