The Question
DesignHotel Reservation and Global Search System
Design a global hotel booking platform similar to Booking.com. The system must support searching for properties based on location and real-time availability, and a transactional booking flow that guarantees no double-bookings. Consider the scale of 20 million daily active users, the high-read/low-write nature of the traffic, and the consistency requirements for inventory management during peak seasonal demand.
PostgreSQL
Elasticsearch
Redis
Kafka
Debezium
CDN
API Gateway
Kubernetes
gRPC
Questions & Insights
Clarifying Questions
Scale & Traffic: What is the expected scale (DAU, bookings per day)?
Assumption: 20M DAU, 500k bookings/day, 100:1 read-to-write ratio (heavy search).
Inventory Consistency: Is overbooking allowed, or do we need strict consistency?
Assumption: Strict consistency is required to avoid double-booking; overbooking is a business policy, not a technical failure.
Search Latency: What are the requirements for search freshness (how fast must inventory updates reflect in search)?
Assumption: Search results can be eventually consistent (seconds), but booking MUST be strongly consistent.
Geography: Is this a global system?
Assumption: Global users, multi-region deployment for latency, but centralized/sharded inventory for consistency.
Payment: Do we handle payments or just reservations?
Assumption: Out-of-scope for the core engine; we integrate with a 3rd party PSP (Stripe/Adyen).
Thinking Process
Core Bottleneck: Managing high-concurrency room inventory without double-booking while maintaining high-speed search across millions of listings.
Strategy Path:
How do we ensure no two people book the same room? (Database Transactions + Row-level locking).
How do we handle massive search traffic without hitting the main DB? (Elasticsearch/OpenSearch with CDC).
How do we handle high-frequency inventory updates? (Inventory Service with Redis-backed counters).
How do we handle the "Booking-Payment" state machine? (Saga Pattern or Outbox Pattern for distributed transactions).
Bonus Points
Inventory "Holding" Pattern: Implementing a 15-minute temporary hold on inventory using Redis TTL to improve UX and prevent race conditions during payment.
Geospatial Sharding: Partitioning the Search Index and Database by
geohash or city_id to ensure localized traffic stays within regional clusters.CDC (Change Data Capture): Using Debezium/Kafka to stream updates from the source-of-truth (RDBMS) to the search index (Elasticsearch) to eliminate dual-write inconsistencies.
Availability vs. Consistency (PACELC): Choosing CP (Consistency/Partition Tolerance) for the Booking flow and AP (Availability/Partition Tolerance) for the Search/Browsing flow.
Design Breakdown
Functional Requirements
Core Use Cases:
Users can search for hotels by location, dates, and number of guests.
Users can view hotel details, room availability, and pricing.
Users can reserve a room (Booking flow).
Partners (Hotels) can update their room inventory and pricing.
Scope Control:
In-scope: Search, Inventory management, Booking flow, Notifications.
Out-of-scope: Flight bookings, Car rentals, User reviews, Loyalty programs, Internal Partner Dashboard UI.
Non-Functional Requirements
Scale: Support 100k+ concurrent users and millions of room listings.
Latency: Search results < 200ms; Booking processing < 500ms.
Availability & Reliability: 99.99% availability for search; 99.999% for booking integrity.
Consistency: Strong consistency for room availability during the booking phase.
Fault Tolerance: Handle 3rd party payment provider failures gracefully (idempotency).
Security: PCI-DSS compliance for payment data (via tokenization).
Estimation
Traffic:
20M DAU.
Search: 20M * 10 searches/day = 200M searches/day ≈ 2,300 QPS.
Peak Search: 5,000+ QPS.
Bookings: 500k/day ≈ 6 QPS (Low, but high-value/complex).
Storage:
1M Hotels * 10 Room types = 10M inventory records.
500k bookings/day 365 days 5 years = ~900M booking records.
~1 TB for bookings, ~500 GB for hotel metadata/images.
Bandwidth:
Search response (10KB) * 2300 QPS = 23MB/s outgoing.
Blueprint
Concise Summary: A microservices architecture leveraging a Relational DB for transactional integrity (Booking) and a Search Engine for high-performance discovery (Search).
Major Components:
Search Service: Uses Elasticsearch to filter hotels by geography and availability.
Booking Service: Manages the lifecycle of a reservation using PostgreSQL with ACID transactions.
Inventory Service: Tracks real-time room counts and manages "locks" during the checkout process.
Payment Integration: Async wrapper around 3rd party payment gateways.
Notification Service: Kafka-driven service for confirmation emails/SMS.
Simplicity Audit: This architecture separates the high-volume/eventually-consistent "Search" path from the low-volume/strongly-consistent "Booking" path to ensure reliability without over-engineering.
Architecture Decision Rationale:
Why this architecture?: RDBMS is non-negotiable for financial/inventory integrity. Elasticsearch is standard for geo-spatial/fuzzy search.
Functional Satisfaction: Covers end-to-end user journey from discovery to confirmation.
Non-functional Satisfaction: Scalable via sharding; high availability through service decoupling.
High Level Architecture
Sub-system Deep Dive
Edge (Optional)
Content Delivery & Traffic Routing: CDN (Cloudflare/Akamai) caches hotel images and static UI assets. Latency-based DNS routing directs users to the nearest regional cluster.
Security & Perimeter: API Gateway handles JWT validation, rate limiting (prevents scrapers from stealing price data), and TLS termination.
Service
Topology & Scaling: Stateless microservices deployed in K8s clusters across multiple Availability Zones. Scaling is triggered by CPU (>60%) or Request Count.
API Schema Design:
GET /v1/search?location={lat,long}&dates={start,end} (REST)POST /v1/bookings (REST): Initiates a booking. Request: {room_id, user_id, dates}. Response: booking_id, payment_url.Idempotency: All booking requests require a
client_idempotency_key to prevent double charges on retry.Resilience & Reliability: Circuit breakers on the Payment Integration and Notification services to prevent cascading failures.
Storage
Access Pattern:
Search: Heavy read, geo-query intensive.
Booking: Write-heavy during peak, requires ACID.
Database Table Design (PostgreSQL):
Hotels: hotel_id (PK), name, location, rating.RoomTypes: room_type_id (PK), hotel_id, capacity, base_price.Inventory: room_type_id, date, total_count, available_count. (Composite PK: room_type_id, date).Bookings: booking_id (PK), user_id, room_type_id, check_in, check_out, status (PENDING, CONFIRMED, CANCELLED).Technical Selection: PostgreSQL. Why? Support for
SELECT ... FOR UPDATE row-level locking which is critical for inventory decrementing.Distribution Logic: Shard by
hotel_id or geohash. Most queries are localized.Cache
Purpose & Justification: Redis stores transient inventory locks. When a user starts checkout,
available_count is decremented in Redis for 15 mins. If payment fails/expires, the count is incremented back.Key-Value Schema:
lock:room_type_id:date:session_id -> TTL 15m.Failure Handling: If Redis fails, the system falls back to the RDBMS (slower but safe).
Messaging
Purpose & Decoupling: Kafka decouples the Booking service from downstream tasks like Notifications (Email), Analytics (User behavior), and Search Indexing (CDC).
Event Schema:
BookingCreatedEvent: {booking_id, user_email, total_price}.Technical Selection: Kafka for high throughput and replayability.
Data Processing
Processing Model: Use Debezium (CDC) to stream Inventory DB changes to Elasticsearch.
Processing DAG:
Postgres Binlog -> Kafka Connect -> Transformation -> Elasticsearch Sink.Correctness Guarantees: Ensures that if a hotel adds rooms, the search index reflects this within seconds.
Technical Selection: Kafka Connect for seamless DB-to-ES integration.
Infrastructure (Optional)
Observability: Prometheus for metrics (booking success rate), Grafana for dashboards, Jaeger for tracing requests across the search/booking services.
Wrap Up
Advanced Topics
Trade-offs: We chose Pessimistic Locking (
SELECT FOR UPDATE) in the Inventory DB. Alternative: Optimistic Locking (versioning). Decision: Pessimistic is better for high-contention scenarios (e.g., last room available during a holiday) to prevent overbooking.Reliability: If the Search index is down, the system can fallback to a basic DB query (with limited filters) to maintain availability.
Bottleneck Analysis: The Inventory DB is the ultimate bottleneck. We mitigate this by sharding by
hotel_id. A single DB instance can handle ~10k TPS, which is plenty for 500k bookings/day.Security: Sensitive data (PII) is encrypted at rest using AES-256. All internal traffic is mTLS.
Distinguishing Insight: Inventory Granularity. Don't store "Total Rooms: 100". Store room availability per day (e.g.,
RoomA: Jan 1 -> 5 free, Jan 2 -> 3 free). This allows for flexible multi-day booking checks using a single SQL query: SELECT MIN(available_count) FROM inventory WHERE date BETWEEN start AND end.