The Question
DesignScalable Marketplace & Reservation System Design
Design a global vacation rental platform like Airbnb. The system must support millions of listings and handle high-concurrency search and booking flows. Key challenges include ensuring strict consistency for reservations to prevent double-booking, low-latency geo-spatial search across millions of records, and managing high-volume media assets. Discuss your approach to data consistency, search indexing strategy, and how you would handle scaling to a global user base.
PostgreSQL
OpenSearch
Redis
Kafka
Flink
S3
Cloudfront
gRPC
S2 Geometry
CDC
Questions & Insights
Clarifying Questions
Scale: What is the target scale in terms of Daily Active Users (DAU) and total listings?
Assumption: 20M DAU, 10M listings, 100k bookings/day.
Geography: Is this a global system or regional?
Assumption: Global platform requiring low-latency search and geo-distributed availability.
Search Latency: What is the SLA for search results?
Assumption: P99 < 200ms for geo-spatial queries with filters.
Consistency: How critical is real-time availability in search?
Assumption: Search results can be eventually consistent (seconds delay), but booking MUST be strictly consistent (no double-bookings).
Thinking Process
Core Bottleneck: High-read volume for search vs. high-integrity requirements for the booking state machine.
Strategy Steps:
Establish a Read-Optimized Search Path using Geo-spatial indexing (ElasticSearch/OpenSearch) to handle complex filters.
Implement a Write-Heavy/Transaction-Safe Booking Path using RDBMS with row-level locking or distributed locks to prevent overbooking.
Use an Event-Driven Bridge (Change Data Capture) to keep search indexes in sync with listing/availability changes in the database.
Design a Geo-sharding Strategy based on location IDs to minimize cross-region latency.
Bonus Points
Availability Guardrails: Implementing "Fenced Tokens" with distributed locks (Redis/Zookeeper) to handle the "split-brain" problem during network partitions in the booking flow.
Quadtree/S2 Geometry: Using Google S2 library for efficient spherical geometry math to shard search data by Hilbert Curve cell IDs.
Tiered Storage for Photos: Cost optimization using CDN for hot images and S3 Intelligent-Tiering for older, less-viewed listing media.
Dynamic Pricing Engine: Utilizing Flink for real-time stream processing of demand signals to suggest price adjustments to hosts.
Design Breakdown
Functional Requirements
Core Use Cases:
Users can search for listings by location, date range, and price.
Users can view listing details and high-quality photos.
Users can book a listing (Reservation system).
Hosts can create and manage listings.
Scope Control:
In-Scope: Search, Booking, Listing Management.
Out-of-Scope: Payments (Third-party integration), Reviews/Ratings (secondary feature), Messaging/Chat between host and guest.
Non-Functional Requirements
Scale: Support 10M+ listings and 50k+ QPS for search.
Latency: Search results under 200ms; booking confirmation under 1s.
Availability: 99.99% for search; 99.9% for booking.
Consistency: Strong consistency for booking transactions; eventual consistency (seconds) for search.
Security: GDPR compliance for user data; TLS for all transit.
Estimation
Traffic:
20M DAU.
Search: 5 queries/user = 100M queries/day ≈ 1,200 QPS average (Peak: 5,000 QPS).
Booking: 100k bookings/day ≈ 1.2 QPS (Peak: 50 QPS).
Storage:
10M listings. 10 images/listing (200KB each) = 20TB for photos.
Metadata: 10M listings * 5KB = 50GB.
Bandwidth:
Search results: 1,200 QPS * 50KB response = 60MB/s (Outgoing).
Blueprint
Concise Summary: A microservices-based architecture separating the read-heavy discovery path (Search Service) from the write-critical transactional path (Booking Service).
Major Components:
Search Service: Uses OpenSearch for geo-spatial indexing and filtering.
Booking Service: Manages reservation state machine using PostgreSQL.
Listing Service: Source of truth for listing metadata and availability.
Media Store: S3 and Cloudfront for image delivery.
CDC/Kafka: Syncs availability changes to the Search index.
Simplicity Audit: This architecture avoids complex distributed transactions across services by using a central RDBMS for bookings and an async sync mechanism for search.
Architecture Decision Rationale:
Search Efficiency: Relational DBs struggle with geo-spatial complex filters at scale; OpenSearch is purpose-built for this.
Integrity: PostgreSQL ACID properties are non-negotiable for financial/reservation records.
Scalability: Decoupling allows search to scale horizontally without impacting the booking database.
High Level Architecture
Sub-system Deep Dive
Edge (Optional)
Content Delivery: CDN caches listing photos and static UI assets globally. Dynamic search results are not cached at the edge due to high filter variance.
API Gateway: Handles AuthN/AuthZ, rate limiting (100 req/sec per user), and request routing. Terminates TLS 1.3.
Service
Topology: Stateless services deployed across multiple Availability Zones (AZs).
Search Service:
Protocol: gRPC for internal, REST for external.
Schema:
GET /v1/search?lat=...&long=...&checkin=...&checkout=...&guests=...Booking Service:
Logic: Implements a state machine (Pending -> Confirmed -> Completed/Cancelled).
Idempotency: Uses
booking_id (UUID) to prevent double-charging/double-booking on retries.Resilience: 3x retries with exponential backoff for Booking DB calls; circuit breakers on Search Service to fallback to "cached popular listings" if OpenSearch is down.
Storage
Access Pattern: 100:1 Read/Write ratio for listings; 1:1 for bookings.
Database Table Design (Booking DB):
reservations: id (PK), listing_id (FK), guest_id (FK), start_date, end_date, status (Enum), version (for optimistic locking).Technical Selection:
PostgreSQL: Chosen for transactional integrity.
OpenSearch: Chosen for Geo-distance and Bounding-box query support.
Distribution:
Sharding: Postgres sharded by
listing_id to ensure all reservations for one listing hit the same node, simplifying locking.Replication: Master-Slave (1 Master, 2 Read Replicas per region).
Cache
Purpose: Reduce Listing DB load and store session data.
Schema:
Key:
listing_meta:{id}, Value: JSON string of listing details. TTL: 1 hour.Key:
availability:{id}:{month}, Value: Bitfield/Bitmap of booked dates.Failure Handling: If Redis fails, fall back to Postgres. Use "Cache Aside" pattern.
Messaging
Purpose: Decouple the source of truth (DB) from the search index.
Event Schema:
listing_updated or booking_confirmed. Includes listing_id, availability_dates, price.Technical Selection: Kafka. High throughput and message retention allow for re-indexing if the search cluster needs to be rebuilt.
Data Processing
Processing Model: Streaming via Flink.
DAG: Consumes from Kafka -> Enriches listing data with host info -> Formats for OpenSearch -> Bulk index update.
Late Data: Flink watermarks handle out-of-order events from different regions.
Infrastructure (Optional)
Observability: Prometheus for metrics (QPS, Latency), Grafana for dashboards.
Distributed Coordination: Not used in MVP; standard DB locking suffices.
Wrap Up
Advanced Topics
Trade-offs: We chose Eventual Consistency for search. A listing might show as available in search for a few seconds after being booked. We mitigate this by re-validating availability in the Booking Service before finalizing.
Reliability: Multi-AZ deployment for all databases. Daily snapshots to S3.
Hotspot Management: Popular listings (e.g., "The Barbie Dreamhouse") can cause DB contention. Mitigation: High-frequency availability bitmask in Redis to reject booking requests before they hit the DB.
Security: PII (Guest emails/phones) is encrypted at rest using AES-256. Access restricted via IAM roles.
Staff-Level Insight: To handle "Global Search" efficiently, we can use Geo-Partitioning where search queries are routed to the regional OpenSearch cluster closest to the searched coordinates, rather than the user's location, reducing cross-continent data transfer.