The Question

Scalable Marketplace & Reservation System Design

Design a global vacation rental platform like Airbnb. The system must support millions of listings and handle high-concurrency search and booking flows. Key challenges include ensuring strict consistency for reservations to prevent double-booking, low-latency geo-spatial search across millions of records, and managing high-volume media assets. Discuss your approach to data consistency, search indexing strategy, and how you would handle scaling to a global user base.

PostgreSQL

OpenSearch

Redis

Kafka

Flink

Cloudfront

gRPC

S2 Geometry

CDC

Questions & Insights

Clarifying Questions

Scale: What is the target scale in terms of Daily Active Users (DAU) and total listings?

Assumption: 20M DAU, 10M listings, 100k bookings/day.

Geography: Is this a global system or regional?

Assumption: Global platform requiring low-latency search and geo-distributed availability.

Search Latency: What is the SLA for search results?

Assumption: P99 < 200ms for geo-spatial queries with filters.

Consistency: How critical is real-time availability in search?

Assumption: Search results can be eventually consistent (seconds delay), but booking MUST be strictly consistent (no double-bookings).

Thinking Process

Core Bottleneck: High-read volume for search vs. high-integrity requirements for the booking state machine.

Strategy Steps:

Establish a Read-Optimized Search Path using Geo-spatial indexing (ElasticSearch/OpenSearch) to handle complex filters.

Implement a Write-Heavy/Transaction-Safe Booking Path using RDBMS with row-level locking or distributed locks to prevent overbooking.

Use an Event-Driven Bridge (Change Data Capture) to keep search indexes in sync with listing/availability changes in the database.

Design a Geo-sharding Strategy based on location IDs to minimize cross-region latency.

Bonus Points

Availability Guardrails: Implementing "Fenced Tokens" with distributed locks (Redis/Zookeeper) to handle the "split-brain" problem during network partitions in the booking flow.

Quadtree/S2 Geometry: Using Google S2 library for efficient spherical geometry math to shard search data by Hilbert Curve cell IDs.

Tiered Storage for Photos: Cost optimization using CDN for hot images and S3 Intelligent-Tiering for older, less-viewed listing media.

Dynamic Pricing Engine: Utilizing Flink for real-time stream processing of demand signals to suggest price adjustments to hosts.

Design Breakdown

Functional Requirements

Core Use Cases:

Users can search for listings by location, date range, and price.

Users can view listing details and high-quality photos.

Users can book a listing (Reservation system).

Hosts can create and manage listings.

Scope Control:

In-Scope: Search, Booking, Listing Management.

Out-of-Scope: Payments (Third-party integration), Reviews/Ratings (secondary feature), Messaging/Chat between host and guest.

Non-Functional Requirements

Scale: Support 10M+ listings and 50k+ QPS for search.

Latency: Search results under 200ms; booking confirmation under 1s.

Availability: 99.99% for search; 99.9% for booking.

Consistency: Strong consistency for booking transactions; eventual consistency (seconds) for search.

Security: GDPR compliance for user data; TLS for all transit.

Estimation

Traffic:

20M DAU.

Search: 5 queries/user = 100M queries/day ≈ 1,200 QPS average (Peak: 5,000 QPS).

Booking: 100k bookings/day ≈ 1.2 QPS (Peak: 50 QPS).

Storage:

10M listings. 10 images/listing (200KB each) = 20TB for photos.

Metadata: 10M listings * 5KB = 50GB.

Bandwidth:

Search results: 1,200 QPS * 50KB response = 60MB/s (Outgoing).

Blueprint

Concise Summary: A microservices-based architecture separating the read-heavy discovery path (Search Service) from the write-critical transactional path (Booking Service).

Major Components:

Search Service: Uses OpenSearch for geo-spatial indexing and filtering.

Booking Service: Manages reservation state machine using PostgreSQL.

Listing Service: Source of truth for listing metadata and availability.

Media Store: S3 and Cloudfront for image delivery.

CDC/Kafka: Syncs availability changes to the Search index.

Simplicity Audit: This architecture avoids complex distributed transactions across services by using a central RDBMS for bookings and an async sync mechanism for search.

Architecture Decision Rationale:

Search Efficiency: Relational DBs struggle with geo-spatial complex filters at scale; OpenSearch is purpose-built for this.

Integrity: PostgreSQL ACID properties are non-negotiable for financial/reservation records.

Scalability: Decoupling allows search to scale horizontally without impacting the booking database.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery: CDN caches listing photos and static UI assets globally. Dynamic search results are not cached at the edge due to high filter variance.

API Gateway: Handles AuthN/AuthZ, rate limiting (100 req/sec per user), and request routing. Terminates TLS 1.3.

Service

Topology: Stateless services deployed across multiple Availability Zones (AZs).

Search Service:

Protocol: gRPC for internal, REST for external.

Schema: GET /v1/search?lat=...&long=...&checkin=...&checkout=...&guests=...

Booking Service:

Logic: Implements a state machine (Pending -> Confirmed -> Completed/Cancelled).

Idempotency: Uses booking_id (UUID) to prevent double-charging/double-booking on retries.

Resilience: 3x retries with exponential backoff for Booking DB calls; circuit breakers on Search Service to fallback to "cached popular listings" if OpenSearch is down.

Storage

Access Pattern: 100:1 Read/Write ratio for listings; 1:1 for bookings.

Database Table Design (Booking DB):

reservations: id (PK), listing_id (FK), guest_id (FK), start_date, end_date, status (Enum), version (for optimistic locking).

Technical Selection:

PostgreSQL: Chosen for transactional integrity.

OpenSearch: Chosen for Geo-distance and Bounding-box query support.

Distribution:

Sharding: Postgres sharded by listing_id to ensure all reservations for one listing hit the same node, simplifying locking.

Replication: Master-Slave (1 Master, 2 Read Replicas per region).

Cache

Purpose: Reduce Listing DB load and store session data.

Schema:

Key: listing_meta:{id}, Value: JSON string of listing details. TTL: 1 hour.

Key: availability:{id}:{month}, Value: Bitfield/Bitmap of booked dates.

Failure Handling: If Redis fails, fall back to Postgres. Use "Cache Aside" pattern.

Messaging

Purpose: Decouple the source of truth (DB) from the search index.

Event Schema: listing_updated or booking_confirmed. Includes listing_id, availability_dates, price.

Technical Selection: Kafka. High throughput and message retention allow for re-indexing if the search cluster needs to be rebuilt.

Data Processing

Processing Model: Streaming via Flink.

DAG: Consumes from Kafka -> Enriches listing data with host info -> Formats for OpenSearch -> Bulk index update.

Late Data: Flink watermarks handle out-of-order events from different regions.

Infrastructure (Optional)

Observability: Prometheus for metrics (QPS, Latency), Grafana for dashboards.

Distributed Coordination: Not used in MVP; standard DB locking suffices.

Wrap Up

Advanced Topics

Trade-offs: We chose Eventual Consistency for search. A listing might show as available in search for a few seconds after being booked. We mitigate this by re-validating availability in the Booking Service before finalizing.

Reliability: Multi-AZ deployment for all databases. Daily snapshots to S3.

Hotspot Management: Popular listings (e.g., "The Barbie Dreamhouse") can cause DB contention. Mitigation: High-frequency availability bitmask in Redis to reject booking requests before they hit the DB.

Security: PII (Guest emails/phones) is encrypted at rest using AES-256. Access restricted via IAM roles.

Staff-Level Insight: To handle "Global Search" efficiently, we can use Geo-Partitioning where search queries are routed to the regional OpenSearch cluster closest to the searched coordinates, rather than the user's location, reducing cross-continent data transfer.