DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
The Question
Design

Hotel Reservation and Global Search System

Design a global hotel booking platform similar to Booking.com. The system must support searching for properties based on location and real-time availability, and a transactional booking flow that guarantees no double-bookings. Consider the scale of 20 million daily active users, the high-read/low-write nature of the traffic, and the consistency requirements for inventory management during peak seasonal demand.
PostgreSQL
Elasticsearch
Redis
Kafka
Debezium
CDN
API Gateway
Kubernetes
gRPC
Questions & Insights

Clarifying Questions

Scale & Traffic: What is the expected scale (DAU, bookings per day)?
Assumption: 20M DAU, 500k bookings/day, 100:1 read-to-write ratio (heavy search).
Inventory Consistency: Is overbooking allowed, or do we need strict consistency?
Assumption: Strict consistency is required to avoid double-booking; overbooking is a business policy, not a technical failure.
Search Latency: What are the requirements for search freshness (how fast must inventory updates reflect in search)?
Assumption: Search results can be eventually consistent (seconds), but booking MUST be strongly consistent.
Geography: Is this a global system?
Assumption: Global users, multi-region deployment for latency, but centralized/sharded inventory for consistency.
Payment: Do we handle payments or just reservations?
Assumption: Out-of-scope for the core engine; we integrate with a 3rd party PSP (Stripe/Adyen).

Thinking Process

Core Bottleneck: Managing high-concurrency room inventory without double-booking while maintaining high-speed search across millions of listings.
Strategy Path:
How do we ensure no two people book the same room? (Database Transactions + Row-level locking).
How do we handle massive search traffic without hitting the main DB? (Elasticsearch/OpenSearch with CDC).
How do we handle high-frequency inventory updates? (Inventory Service with Redis-backed counters).
How do we handle the "Booking-Payment" state machine? (Saga Pattern or Outbox Pattern for distributed transactions).

Bonus Points

Inventory "Holding" Pattern: Implementing a 15-minute temporary hold on inventory using Redis TTL to improve UX and prevent race conditions during payment.
Geospatial Sharding: Partitioning the Search Index and Database by geohash or city_id to ensure localized traffic stays within regional clusters.
CDC (Change Data Capture): Using Debezium/Kafka to stream updates from the source-of-truth (RDBMS) to the search index (Elasticsearch) to eliminate dual-write inconsistencies.
Availability vs. Consistency (PACELC): Choosing CP (Consistency/Partition Tolerance) for the Booking flow and AP (Availability/Partition Tolerance) for the Search/Browsing flow.
Design Breakdown

Functional Requirements

Core Use Cases:
Users can search for hotels by location, dates, and number of guests.
Users can view hotel details, room availability, and pricing.
Users can reserve a room (Booking flow).
Partners (Hotels) can update their room inventory and pricing.
Scope Control:
In-scope: Search, Inventory management, Booking flow, Notifications.
Out-of-scope: Flight bookings, Car rentals, User reviews, Loyalty programs, Internal Partner Dashboard UI.

Non-Functional Requirements

Scale: Support 100k+ concurrent users and millions of room listings.
Latency: Search results < 200ms; Booking processing < 500ms.
Availability & Reliability: 99.99% availability for search; 99.999% for booking integrity.
Consistency: Strong consistency for room availability during the booking phase.
Fault Tolerance: Handle 3rd party payment provider failures gracefully (idempotency).
Security: PCI-DSS compliance for payment data (via tokenization).

Estimation

Traffic:
20M DAU.
Search: 20M * 10 searches/day = 200M searches/day ≈ 2,300 QPS.
Peak Search: 5,000+ QPS.
Bookings: 500k/day ≈ 6 QPS (Low, but high-value/complex).
Storage:
1M Hotels * 10 Room types = 10M inventory records.
500k bookings/day 365 days 5 years = ~900M booking records.
~1 TB for bookings, ~500 GB for hotel metadata/images.
Bandwidth:
Search response (10KB) * 2300 QPS = 23MB/s outgoing.

Blueprint

Concise Summary: A microservices architecture leveraging a Relational DB for transactional integrity (Booking) and a Search Engine for high-performance discovery (Search).
Major Components:
Search Service: Uses Elasticsearch to filter hotels by geography and availability.
Booking Service: Manages the lifecycle of a reservation using PostgreSQL with ACID transactions.
Inventory Service: Tracks real-time room counts and manages "locks" during the checkout process.
Payment Integration: Async wrapper around 3rd party payment gateways.
Notification Service: Kafka-driven service for confirmation emails/SMS.
Simplicity Audit: This architecture separates the high-volume/eventually-consistent "Search" path from the low-volume/strongly-consistent "Booking" path to ensure reliability without over-engineering.
Architecture Decision Rationale:
Why this architecture?: RDBMS is non-negotiable for financial/inventory integrity. Elasticsearch is standard for geo-spatial/fuzzy search.
Functional Satisfaction: Covers end-to-end user journey from discovery to confirmation.
Non-functional Satisfaction: Scalable via sharding; high availability through service decoupling.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing: CDN (Cloudflare/Akamai) caches hotel images and static UI assets. Latency-based DNS routing directs users to the nearest regional cluster.
Security & Perimeter: API Gateway handles JWT validation, rate limiting (prevents scrapers from stealing price data), and TLS termination.

Service

Topology & Scaling: Stateless microservices deployed in K8s clusters across multiple Availability Zones. Scaling is triggered by CPU (>60%) or Request Count.
API Schema Design:
GET /v1/search?location={lat,long}&dates={start,end} (REST)
POST /v1/bookings (REST): Initiates a booking. Request: {room_id, user_id, dates}. Response: booking_id, payment_url.
Idempotency: All booking requests require a client_idempotency_key to prevent double charges on retry.
Resilience & Reliability: Circuit breakers on the Payment Integration and Notification services to prevent cascading failures.

Storage

Access Pattern:
Search: Heavy read, geo-query intensive.
Booking: Write-heavy during peak, requires ACID.
Database Table Design (PostgreSQL):
Hotels: hotel_id (PK), name, location, rating.
RoomTypes: room_type_id (PK), hotel_id, capacity, base_price.
Inventory: room_type_id, date, total_count, available_count. (Composite PK: room_type_id, date).
Bookings: booking_id (PK), user_id, room_type_id, check_in, check_out, status (PENDING, CONFIRMED, CANCELLED).
Technical Selection: PostgreSQL. Why? Support for SELECT ... FOR UPDATE row-level locking which is critical for inventory decrementing.
Distribution Logic: Shard by hotel_id or geohash. Most queries are localized.

Cache

Purpose & Justification: Redis stores transient inventory locks. When a user starts checkout, available_count is decremented in Redis for 15 mins. If payment fails/expires, the count is incremented back.
Key-Value Schema: lock:room_type_id:date:session_id -> TTL 15m.
Failure Handling: If Redis fails, the system falls back to the RDBMS (slower but safe).

Messaging

Purpose & Decoupling: Kafka decouples the Booking service from downstream tasks like Notifications (Email), Analytics (User behavior), and Search Indexing (CDC).
Event Schema: BookingCreatedEvent: {booking_id, user_email, total_price}.
Technical Selection: Kafka for high throughput and replayability.

Data Processing

Processing Model: Use Debezium (CDC) to stream Inventory DB changes to Elasticsearch.
Processing DAG: Postgres Binlog -> Kafka Connect -> Transformation -> Elasticsearch Sink.
Correctness Guarantees: Ensures that if a hotel adds rooms, the search index reflects this within seconds.
Technical Selection: Kafka Connect for seamless DB-to-ES integration.

Infrastructure (Optional)

Observability: Prometheus for metrics (booking success rate), Grafana for dashboards, Jaeger for tracing requests across the search/booking services.
Wrap Up

Advanced Topics

Trade-offs: We chose Pessimistic Locking (SELECT FOR UPDATE) in the Inventory DB. Alternative: Optimistic Locking (versioning). Decision: Pessimistic is better for high-contention scenarios (e.g., last room available during a holiday) to prevent overbooking.
Reliability: If the Search index is down, the system can fallback to a basic DB query (with limited filters) to maintain availability.
Bottleneck Analysis: The Inventory DB is the ultimate bottleneck. We mitigate this by sharding by hotel_id. A single DB instance can handle ~10k TPS, which is plenty for 500k bookings/day.
Security: Sensitive data (PII) is encrypted at rest using AES-256. All internal traffic is mTLS.
Distinguishing Insight: Inventory Granularity. Don't store "Total Rooms: 100". Store room availability per day (e.g., RoomA: Jan 1 -> 5 free, Jan 2 -> 3 free). This allows for flexible multi-day booking checks using a single SQL query: SELECT MIN(available_count) FROM inventory WHERE date BETWEEN start AND end.