The Question
DesignHotel Reservation System
Design a global hotel reservation system similar to Booking.com or Expedia. The system must support hotel discovery via search (location, dates, room types) and a high-integrity booking flow. A critical requirement is to ensure no two customers can book the same room for overlapping dates (zero overbooking). Consider how the system handles 10 million daily active users, global distribution, and high-concurrency scenarios during peak holiday seasons. Discuss your choice of consistency models, database selection, and how you maintain search performance while ensuring booking accuracy.
PostgreSQL
OpenSearch
Redis
Kafka
CDC
Saga Pattern
Kubernetes
CDN
Stripe
Questions & Insights
Clarifying Questions
Scale & Traffic: What is the expected scale in terms of the number of hotels and rooms? (Assumption: 100,000 hotels, 20 million rooms globally).
Concurrency: How should we handle overbooking or simultaneous booking attempts for the last room? (Assumption: Strict consistency is required for the final booking; no overbooking allowed for the MVP).
Search Patterns: Are users searching by city, date range, and price, or more complex filters like amenities? (Assumption: City, stay dates, and occupancy are the primary filters).
Payment: Should we build a payment gateway or integrate with a provider? (Assumption: Integrate with a 3rd party like Stripe/Adyen via a lightweight adapter).
Inventory Horizon: How far in advance can users book? (Assumption: Up to 2 years).
Thinking Process
The Core Bottleneck: The "Double Booking" problem. We must ensure that two users cannot book the same room for the same night simultaneously.
Progressive Logic:
How do we store room availability efficiently for 730 days (2 years)?
How do we handle high-read traffic for search without locking the main transactional database?
How do we implement a robust reservation flow that handles partial failures (e.g., payment success but booking failure)?
How do we scale globally while maintaining data integrity?
Bonus Points
Inventory Sharding: Using a composite sharding key (e.g.,
hotel_id + year_month) to prevent hot partitions on popular hotels.Idempotency Framework: Implementing a standardized idempotency layer using
idempotency_key across the API Gateway and Service layer to handle retry storms safely.Dual-Write Mitigation: Utilizing Change Data Capture (CDC) from the Reservation DB to update the Search Index (Elasticsearch) to ensure eventual consistency without distributed transactions.
Optimistic Concurrency Control (OCC): Using versioning or "Update with Where" clauses (e.g.,
update inventory set count = count - 1 where room_id = ? and count > 0) to minimize row-level locking duration.Design Breakdown
Functional Requirements
Core Use Cases:
Users can search for available hotels by city and date range.
Users can view hotel details and room types.
Users can reserve a room and receive a confirmation.
Users can cancel a reservation.
Scope Control:
In-Scope: Search, Booking, Inventory Management, Payment Integration.
Out-of-Scope: Hotelier-side management portal, loyalty programs, dynamic pricing engines, and flight/car bundles.
Non-Functional Requirements
Scale: Support 10M DAU and 1,000+ bookings per second at peak.
Latency: Search results returned in < 500ms; booking confirmation in < 2s.
Availability & Reliability: 99.99% availability for search; 99.999% for booking (critical path).
Consistency: Strong consistency for inventory/reservation; eventual consistency for search results.
Security: PCI-DSS compliance (via payment provider), TLS 1.3 for all transit data.
Estimation
Traffic Estimation:
10M DAU. 10% search (1M searches/day). 1% of searches lead to booking (10k bookings/day).
Search QPS (Peak 5x): (1,000,000 / 86,400) * 5 ≈ 60 QPS (Surprisingly low, but read-heavy per request).
Booking QPS: Very low on average, but requires low latency.
Storage Estimation:
100k hotels 50 rooms/hotel 730 days = 3.65 billion inventory rows.
At 100 bytes per row ≈ 365 GB for inventory.
Bandwidth: Minimal for text; high for images (handled by CDN).
Blueprint
Concise Summary: A microservices-based architecture separating high-traffic hotel discovery (Search Service) from high-integrity room booking (Reservation Service).
Major Components:
API Gateway: Central entry point for auth, rate limiting, and request routing.
Search Service: High-performance read-only service utilizing OpenSearch/Elasticsearch for hotel discovery.
Reservation Service: Manages the state machine of a booking (Pending, Confirmed, Cancelled) using a RDBMS.
Inventory Service: Tracks room counts per date with strict transactional integrity.
Payment Adapter: Manages communication with 3rd party payment gateways.
Simplicity Audit: This design avoids complex distributed locks or 2PC by using local database transactions for inventory and a Saga pattern (orchestration-based) for the end-to-end booking flow.
Architecture Decision Rationale:
Why this?: RDBMS is non-negotiable for inventory because ACID properties prevent double-booking. OpenSearch is used for search because RDBMS
LIKE queries or complex joins do not scale for geographic search.Functional Satisfaction: Covers search, room selection, and payment.
Non-functional Satisfaction: Scalable through read-replicas for search and sharding for inventory.
High Level Architecture
Sub-system Deep Dive
Edge (Optional)
Content Delivery & Traffic Routing: CloudFront/Akamai CDN caches hotel images and static UI assets. Latency-based DNS routing directs users to the nearest regional data center.
Security & Perimeter: API Gateway handles JWT validation and implements a leaky-bucket rate limiter to prevent scraping of hotel prices.
Service
Topology & Scaling: Stateless microservices deployed on Kubernetes (EKS/GKE). Horizontal Pod Autoscaling (HPA) triggers on CPU/Memory usage.
API Schema Design:
GET /v1/hotels/search: REST, returns list of hotels with availability.POST /v1/reservations: REST, initiates a booking. Requires idempotency_key.GET /v1/reservations/{id}: REST, status check.Resilience & Reliability:
Circuit Breaker: Implemented on the Payment Adapter to avoid hanging if Stripe is slow.
Saga Pattern: Reservation Service acts as orchestrator. 1. Lock Room -> 2. Process Payment -> 3. Finalize Room. If step 2 fails, trigger compensation (Unlock Room).
Storage
Access Pattern:
Search: High-volume, complex filtering, geo-spatial queries.
Reservation: Low-volume, high-integrity, row-level updates.
Database Table Design (Postgres):
Inventory Table:
room_id (PK), date (PK), total_rooms, available_rooms, version.Reservation Table:
res_id (PK), user_id, hotel_id, room_id, check_in, check_out, status (PENDING/CONFIRMED).Technical Selection:
Postgres: Chosen for Reservation/Inventory for ACID and robust row-level locking (
SELECT ... FOR UPDATE).OpenSearch: Chosen for Search to handle geo-queries (search by lat/long) and fuzzy matching.
Distribution Logic: Shard Inventory by
hotel_id to ensure all dates for a single hotel live on the same node, allowing atomic multi-day updates.Cache
Purpose & Justification: Reduces load on OpenSearch for popular city searches (e.g., "Paris in July").
Key-Value Schema:
search:{city}:{dates}:{occupancy} -> JSON Blob of Hotel IDs. TTL: 5-10 minutes (higher during low volatility).Failure Handling: If Redis fails, fall back directly to OpenSearch.
Messaging
Purpose & Decoupling: Kafka decouples the critical booking path from side effects like sending confirmation emails or updating analytics dashboards.
Event / Topic Schema:
reservation_events: { "event_type": "BOOKED", "reservation_id": "123", "user_email": "..." }.Failure Handling: Dead-letter queues (DLQ) for failed email notifications.
Infrastructure (Optional)
Observability: Prometheus for metrics (booking success rate), ELK stack for logs, Jaeger for tracing cross-service reservation flows.
Wrap Up
Advanced Topics
Trade-offs: We chose Eventual Consistency for the Search Index. A room might show as "available" in search but fail at the checkout stage because it was just booked. This is a standard industry trade-off to keep search fast.
Reliability: To prevent "Zombie Reservations" (rooms locked but never paid), a background cleaner service cancels
PENDING reservations older than 15 minutes and restores inventory.Bottleneck Analysis: The Inventory DB is the primary write bottleneck. Sharding by
hotel_id is the mitigation strategy for a 10x scale.Security: All PII (Guest names, emails) is encrypted at rest using AES-256. No credit card data is stored locally (PCI-DSS compliance via Stripe).