The Question
DesignOnline Travel Booking Platform
Design an online travel booking platform similar to Booking.com. The system should support hotel and property search by location and availability, handle reservations with strong consistency to prevent overbooking, process payments, and manage real-time inventory for millions of properties worldwide.
PostgreSQL
Elasticsearch
Redis
Kafka
Kubernetes
Questions & Insights
Thinking Process
To design a system like Booking.com, we must solve the core challenge of high-concurrency inventory management and low-latency search.
How do we handle the "Last Room" problem? We use a relational database with ACID transactions and row-level locking (Pessimistic Locking) during the checkout phase to prevent double-booking.
How do we scale search across millions of listings? We decouple the "Search" path from the "Booking" path. Search uses a NoSQL/Search Index (Elasticsearch) updated asynchronously to ensure high read throughput.
How do we handle dynamic pricing and availability? We implement a tiered caching strategy where availability is cached at the "Search Service" level but verified against the "Source of Truth" (DB) the moment a user clicks "Book."
Final Architecture: A microservices-based approach using a Read-Heavy Search Cluster and a Write-Heavy Transactional Booking Engine, synchronized via an Event-Driven Messaging Layer.
Bonus Points
Dual-Phase Locking Strategy: Using Redis-based distributed locks for the "Pending" state (user is entering credit card) to improve UX, followed by DB-level transactions for the final "Confirmed" state.
Availability Data Sharding: Sharding the inventory database by
hotel_id or region_id to ensure horizontal scalability and localize failures.CDC (Change Data Capture): Utilizing Debezium or similar tools to stream updates from the SQL database to Elasticsearch to ensure search results are eventually consistent within seconds.
TCC Pattern (Try-Confirm-Cancel): Managing distributed transactions across the Booking Service and Payment Gateway to ensure data integrity without complex 2PC (Two-Phase Commit).
Design Breakdown
Functional Requirements
Search: Users can search for hotels by location, date range, and number of guests.
View Details: Users can view hotel information, room types, and live availability.
Reservation: Users can book a room and receive a confirmation.
Inventory Management: Hotel owners can update room availability and pricing.
Non-Functional Requirements
High Availability: The search functionality must be available 24/7.
Consistency: Strict consistency for the booking process (No double-bookings).
Low Latency: Search results must return in < 500ms.
Scalability: Support for millions of daily active users (DAU) and massive spikes during holiday seasons.
Estimation
DAU: 10 Million.
Search-to-Booking Ratio: 50:1 (Heavy reads).
Total Hotels: 2 Million.
Avg. QPS (Search): (10M * 10 searches) / 86400s \approx 1,200 QPS (Peak 5k+).
Avg. QPS (Booking): (200k bookings) / 86400s \approx 2.3 QPS (Peak 50+).
Storage: 2M hotels * 5KB/hotel \approx 10GB. Booking records (100M/year) \approx 100GB/year.
Blueprint
Concise Summary: A microservices architecture separating the read-heavy Search path (Elasticsearch) from the write-heavy Booking path (PostgreSQL), coordinated by an asynchronous messaging bus.
Major Components:
Search Service: Aggregates hotel data and availability for low-latency filtering.
Booking Service: Manages the state machine of a reservation and ensures ACID compliance.
Payment Service: An asynchronous bridge to external providers (Stripe/PayPal).
Inventory DB: Relational store for room counts and pricing.
Simplicity Audit: This design avoids complex global distributed transactions by using a single-leader RDBMS for the inventory "Source of Truth" while offloading the heavy lifting of search to a dedicated index.
High Level Architecture
Sub-system Deep Dive
Service
Topology & Scaling: Services are containerized (K8s). Search Service scales horizontally based on CPU/Request count. Booking Service scales but is limited by the DB connection pool.
API Spec:
GET /v1/search?location={}&checkin={}&checkout={}: Returns a list of available hotels.POST /v1/bookings: Initiates a reservation (returns booking_id).PUT /v1/bookings/{id}/confirm: Finalizes booking after payment.Storage
Data Model:
Hotels: {id, name, location, rating}Rooms: {id, hotel_id, type, base_price}Inventory: {room_id, date, total_inventory, reserved_count}. Critical Index: (room_id, date).Database Logic: Uses Pessimistic Locking:
SELECT ... FROM Inventory WHERE room_id = ? AND date = ? FOR UPDATE. This ensures that two users cannot claim the same room simultaneously during the transaction block.Cache
Redis Usage:
Distributed Lock: A short-lived key (TTL 10 mins) is set when a user enters the "Checkout" flow to "hold" the room in the UI.
Search Results: Frequent queries (e.g., "Paris, Next Weekend") are cached to reduce ES load.
Eviction: Least Recently Used (LRU).
Messaging
Kafka Structure:
Topic
inventory-updates: Published by the Booking Service whenever a transaction commits.Topic
booking-notifications: Triggers email/SMS services.Delivery Guarantees: At-least-once delivery; the Indexer Worker must be idempotent.
Data Processing
Indexer Worker: A consumer that reads
inventory-updates from Kafka and updates the Elasticsearch index. This ensures the Search Service sees the reduced availability without hitting the main Postgres DB.Wrap Up
Advanced Topics
Trade-offs: We trade Consistency for Availability in the search results. A user might see a room as "available" in search, but find it "taken" at the final checkout step. This is acceptable in travel systems to maintain performance.
Bottlenecks: The primary bottleneck is the
Inventory table in Postgres. During extreme load, row-level locking causes contention. Failure Handling:
DB Failover: Multi-AZ RDS with a hot standby.
Idempotency: All booking requests include a
request_id (Idempotency Key) to prevent double charges on retry.Alternatives & Optimization: For high-volume scaling, we could move from Pessimistic Locking to Optimistic Locking (using a version column) to reduce DB lock wait times, though this increases the rate of "Update Failed" errors for users.