DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
The Question
Design

Scalable Marketplace & Reservation System Design

Design a global vacation rental platform like Airbnb. The system must support millions of listings and handle high-concurrency search and booking flows. Key challenges include ensuring strict consistency for reservations to prevent double-booking, low-latency geo-spatial search across millions of records, and managing high-volume media assets. Discuss your approach to data consistency, search indexing strategy, and how you would handle scaling to a global user base.
PostgreSQL
OpenSearch
Redis
Kafka
Flink
S3
Cloudfront
gRPC
S2 Geometry
CDC
Questions & Insights

Clarifying Questions

Scale: What is the target scale in terms of Daily Active Users (DAU) and total listings?
Assumption: 20M DAU, 10M listings, 100k bookings/day.
Geography: Is this a global system or regional?
Assumption: Global platform requiring low-latency search and geo-distributed availability.
Search Latency: What is the SLA for search results?
Assumption: P99 < 200ms for geo-spatial queries with filters.
Consistency: How critical is real-time availability in search?
Assumption: Search results can be eventually consistent (seconds delay), but booking MUST be strictly consistent (no double-bookings).

Thinking Process

Core Bottleneck: High-read volume for search vs. high-integrity requirements for the booking state machine.
Strategy Steps:
Establish a Read-Optimized Search Path using Geo-spatial indexing (ElasticSearch/OpenSearch) to handle complex filters.
Implement a Write-Heavy/Transaction-Safe Booking Path using RDBMS with row-level locking or distributed locks to prevent overbooking.
Use an Event-Driven Bridge (Change Data Capture) to keep search indexes in sync with listing/availability changes in the database.
Design a Geo-sharding Strategy based on location IDs to minimize cross-region latency.

Bonus Points

Availability Guardrails: Implementing "Fenced Tokens" with distributed locks (Redis/Zookeeper) to handle the "split-brain" problem during network partitions in the booking flow.
Quadtree/S2 Geometry: Using Google S2 library for efficient spherical geometry math to shard search data by Hilbert Curve cell IDs.
Tiered Storage for Photos: Cost optimization using CDN for hot images and S3 Intelligent-Tiering for older, less-viewed listing media.
Dynamic Pricing Engine: Utilizing Flink for real-time stream processing of demand signals to suggest price adjustments to hosts.
Design Breakdown

Functional Requirements

Core Use Cases:
Users can search for listings by location, date range, and price.
Users can view listing details and high-quality photos.
Users can book a listing (Reservation system).
Hosts can create and manage listings.
Scope Control:
In-Scope: Search, Booking, Listing Management.
Out-of-Scope: Payments (Third-party integration), Reviews/Ratings (secondary feature), Messaging/Chat between host and guest.

Non-Functional Requirements

Scale: Support 10M+ listings and 50k+ QPS for search.
Latency: Search results under 200ms; booking confirmation under 1s.
Availability: 99.99% for search; 99.9% for booking.
Consistency: Strong consistency for booking transactions; eventual consistency (seconds) for search.
Security: GDPR compliance for user data; TLS for all transit.

Estimation

Traffic:
20M DAU.
Search: 5 queries/user = 100M queries/day ≈ 1,200 QPS average (Peak: 5,000 QPS).
Booking: 100k bookings/day ≈ 1.2 QPS (Peak: 50 QPS).
Storage:
10M listings. 10 images/listing (200KB each) = 20TB for photos.
Metadata: 10M listings * 5KB = 50GB.
Bandwidth:
Search results: 1,200 QPS * 50KB response = 60MB/s (Outgoing).

Blueprint

Concise Summary: A microservices-based architecture separating the read-heavy discovery path (Search Service) from the write-critical transactional path (Booking Service).
Major Components:
Search Service: Uses OpenSearch for geo-spatial indexing and filtering.
Booking Service: Manages reservation state machine using PostgreSQL.
Listing Service: Source of truth for listing metadata and availability.
Media Store: S3 and Cloudfront for image delivery.
CDC/Kafka: Syncs availability changes to the Search index.
Simplicity Audit: This architecture avoids complex distributed transactions across services by using a central RDBMS for bookings and an async sync mechanism for search.
Architecture Decision Rationale:
Search Efficiency: Relational DBs struggle with geo-spatial complex filters at scale; OpenSearch is purpose-built for this.
Integrity: PostgreSQL ACID properties are non-negotiable for financial/reservation records.
Scalability: Decoupling allows search to scale horizontally without impacting the booking database.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery: CDN caches listing photos and static UI assets globally. Dynamic search results are not cached at the edge due to high filter variance.
API Gateway: Handles AuthN/AuthZ, rate limiting (100 req/sec per user), and request routing. Terminates TLS 1.3.

Service

Topology: Stateless services deployed across multiple Availability Zones (AZs).
Search Service:
Protocol: gRPC for internal, REST for external.
Schema: GET /v1/search?lat=...&long=...&checkin=...&checkout=...&guests=...
Booking Service:
Logic: Implements a state machine (Pending -> Confirmed -> Completed/Cancelled).
Idempotency: Uses booking_id (UUID) to prevent double-charging/double-booking on retries.
Resilience: 3x retries with exponential backoff for Booking DB calls; circuit breakers on Search Service to fallback to "cached popular listings" if OpenSearch is down.

Storage

Access Pattern: 100:1 Read/Write ratio for listings; 1:1 for bookings.
Database Table Design (Booking DB):
reservations: id (PK), listing_id (FK), guest_id (FK), start_date, end_date, status (Enum), version (for optimistic locking).
Technical Selection:
PostgreSQL: Chosen for transactional integrity.
OpenSearch: Chosen for Geo-distance and Bounding-box query support.
Distribution:
Sharding: Postgres sharded by listing_id to ensure all reservations for one listing hit the same node, simplifying locking.
Replication: Master-Slave (1 Master, 2 Read Replicas per region).

Cache

Purpose: Reduce Listing DB load and store session data.
Schema:
Key: listing_meta:{id}, Value: JSON string of listing details. TTL: 1 hour.
Key: availability:{id}:{month}, Value: Bitfield/Bitmap of booked dates.
Failure Handling: If Redis fails, fall back to Postgres. Use "Cache Aside" pattern.

Messaging

Purpose: Decouple the source of truth (DB) from the search index.
Event Schema: listing_updated or booking_confirmed. Includes listing_id, availability_dates, price.
Technical Selection: Kafka. High throughput and message retention allow for re-indexing if the search cluster needs to be rebuilt.

Data Processing

Processing Model: Streaming via Flink.
DAG: Consumes from Kafka -> Enriches listing data with host info -> Formats for OpenSearch -> Bulk index update.
Late Data: Flink watermarks handle out-of-order events from different regions.

Infrastructure (Optional)

Observability: Prometheus for metrics (QPS, Latency), Grafana for dashboards.
Distributed Coordination: Not used in MVP; standard DB locking suffices.
Wrap Up

Advanced Topics

Trade-offs: We chose Eventual Consistency for search. A listing might show as available in search for a few seconds after being booked. We mitigate this by re-validating availability in the Booking Service before finalizing.
Reliability: Multi-AZ deployment for all databases. Daily snapshots to S3.
Hotspot Management: Popular listings (e.g., "The Barbie Dreamhouse") can cause DB contention. Mitigation: High-frequency availability bitmask in Redis to reject booking requests before they hit the DB.
Security: PII (Guest emails/phones) is encrypted at rest using AES-256. Access restricted via IAM roles.
Staff-Level Insight: To handle "Global Search" efficiently, we can use Geo-Partitioning where search queries are routed to the regional OpenSearch cluster closest to the searched coordinates, rather than the user's location, reducing cross-continent data transfer.