DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.DowngradedOur downstream service providers are currently experiencing outages, and our engineering team is actively working on a resolution. Some services—including the Solver, Partner, and Tools—are temporarily degraded with higher latency and lower bandwidth. Rest assured, Intervipedia, Solutions, and the Question Bank features are not impacted and remain fully operational.
The Question
Design

Design a Proximity-Based Discovery Service

Design a high-scale system similar to Yelp or Google Maps Places that allows users to discover businesses 'near me'. The system must support searching millions of businesses with sub-second latency, handling high-volume user reviews, and scaling to tens of thousands of requests per second. Explain your strategy for spatial indexing, how you handle data consistency between search results and reviews, and your approach to sharding a global geographical dataset.
PostgreSQL
PostGIS
Redis
Kafka
NoSQL
Geohash
CDN
API Gateway
Questions & Insights

Clarifying Questions

Scale: What is the expected number of businesses and daily active users (DAU)?
Assumption: 10 million businesses globally, 50 million DAU, and 100k search QPS at peak.
Search Requirements: Does the search need to support complex filters (e.g., "open now", price range, rating) or just proximity?
Assumption: Must support proximity (lat/long) plus basic filtering (category, rating).
Update Frequency: How often is business information (location, hours) updated vs. how often are reviews added?
Assumption: Business metadata is static (updates hourly/daily), but reviews are high-velocity (thousands per minute).
Geography: Is the search global or restricted to specific regions?
Assumption: Global support with sub-second latency for local results.

Thinking Process

Core Bottleneck: Efficiently querying spatial data (latitude/longitude) across millions of records without a full table scan.
Strategy Questions:
How do we convert 2D Earth coordinates into a 1D searchable index? (Geohash vs. Quadtree).
How do we handle the massive read-to-write ratio (100:1) for search vs. business updates?
How do we ensure search results are fresh when new reviews or businesses are added?
How do we shard the database to prevent "hotspots" in dense cities like New York or London?

Bonus Points

Precision vs. Recall: Implementing a multi-stage ranking pipeline—retrieving candidates via Geohash and re-ranking them based on a hybrid score (distance, rating, relevance).
Dynamic Grid Sizing: Adjusting Geohash precision based on density (e.g., use 6-character hashes for cities, 4-character for rural areas) to optimize query performance.
Write-Behind Caching: Using CDC (Change Data Capture) from the Business DB to update the Search Index asynchronously to keep the read-path ultra-fast.
Availability Zones: Multi-region deployment with latency-based routing to ensure users hit the closest data center for "Near Me" searches.
Design Breakdown

Functional Requirements

Core Use Cases:
Users can search for businesses based on current location or a specific city.
Users can view business details (photos, hours, description).
Users can post reviews and ratings.
Business owners can add/update their business profile.
Scope Control:
In-scope: Proximity search, Business metadata management, Review system.
Out-of-scope: Reservation system, Social feed (friends), Real-time map navigation, Advertising platform.

Non-Functional Requirements

Scale: Support 10M businesses and 100k peak QPS.
Latency: Search results must return in < 200ms.
Availability & Reliability: 99.99% availability (critical for a discovery app).
Consistency: Eventual consistency is acceptable for reviews and business updates; high availability is prioritized (AP over CP).
Fault Tolerance: No single point of failure; regional failover capability.

Estimation

Traffic:
50M DAU * 5 searches/day = 250M searches/day.
Average QPS: ~3,000. Peak QPS: ~10,000 - 15,000.
Storage:
Business: 10M * 2KB = 20GB.
Reviews: 10M businesses 100 reviews/business 1KB = 1TB.
Total Storage: ~1.1TB (easily fits in modern distributed DBs).
Bandwidth:
Peak Search: 15k QPS * 5KB (response) = 75MB/s outgoing.

Blueprint

Concise Summary: A geo-distributed architecture utilizing Geohashing for spatial indexing and a Read/Write split to handle high search volumes.
Major Components:
API Gateway: Entry point for authentication, rate limiting, and request routing.
Search Service: Computes Geohashes and queries the spatial index to find nearby business IDs.
Business Service: Manages CRUD operations for business metadata and media.
Review Service: Handles the high-volume ingestion of user ratings and text reviews.
Geohash Index (Redis): In-memory store for mapping Geohashes to lists of Business IDs for sub-millisecond lookups.
Simplicity Audit: This design avoids complex Quadtree rebalancing in-memory by using Geohashes, which can be stored in standard, horizontally scalable KV stores.
Architecture Decision Rationale:
Why Geohash?: It allows us to treat spatial search as a simple prefix-match or KV lookup, which is highly compatible with cloud-native caching and sharding.
Functional Satisfaction: Covers search, view, and review flows.
Non-functional Satisfaction: Redis ensures low latency, while sharded PostgreSQL/NoSQL ensures scalability and storage reliability.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing:
CDN: Use Cloudflare/CloudFront to cache business photos and static metadata.
DNS: Latency-based routing to direct users to the nearest AWS/GCP region.
Security & Perimeter:
API Gateway: Handles JWT-based AuthN/AuthZ.
Rate Limiting: 100 requests/minute per user to prevent scraping of business data.

Service

Topology & Scaling:
Stateless Services: All services (Search, Business, Review) are deployed as Docker containers on Kubernetes (EKS), scaling based on CPU and Request Count.
API Schema Design:
GET /v1/search?lat=x&long=y&radius=5km&category=restaurant
Protocol: REST
Idempotency: Yes (GET)
POST /v1/reviews
Request: { business_id: uuid, rating: int, comment: string }
Idempotency: Supported via client_request_id header.
Resilience & Reliability:
Retries: Exponential backoff for Review submission.
Circuit Breaker: If the Search Cache (Redis) is down, the Search Service falls back to querying the Primary DB with a strict timeout.

Storage

Access Pattern:
Business Data: Heavy Read, Low Write. Relational integrity needed.
Reviews: High Write, High Read. Time-series nature.
Database Table Design:
Business Table: id (PK), name, address, lat, long, geohash, rating_avg.
Review Table: id (PK), business_id (FK), user_id, rating, comment, created_at.
Technical Selection:
PostgreSQL (with PostGIS): Chosen for the Business DB for ACID compliance and powerful spatial indexing functions.
Cassandra / DynamoDB: Chosen for the Review Store to handle massive write throughput and horizontal scaling.
Distribution Logic:
Shard Business DB by Geohash prefix (e.g., first 3 characters) to keep local businesses on the same shard.

Cache

Purpose & Justification: Redis is used to store Geohash -> [Business IDs] mappings to avoid expensive PostGIS queries on every search.
Key-Value Schema:
Key: geo:v1:<geohash_prefix> (e.g., geo:v1:9q8yy)
Value: Sorted Set (ZSET) where Score = distance or rating, Value = BusinessID.
Failure Handling: If Redis fails, the system bypasses the cache and queries PostGIS directly (Graceful Degradation).

Messaging

Purpose & Decoupling: Kafka decouples the Review Service from the Review Storage and Search Index updates.
Event Schema: ReviewCreatedEvent { business_id, new_rating, timestamp }.
Failure Handling: Dead-letter queues (DLQ) for failed rating aggregations.
Technical Selection: Kafka for high-throughput and replayability (re-calculating average ratings if needed).

Data Processing

Processing Model: Stream processing using Kafka Streams or a lightweight worker (Review Processor).
Processing DAG:
Review Stream -> Update Average Rating in Business DB -> Invalidate/Update Search Cache.
Technical Selection: A Custom Service (Go/Java) is sufficient for MVP instead of a heavy Spark cluster.
Wrap Up

Advanced Topics

Trade-offs:
Geohash vs. Quadtree: Geohash was chosen for its simplicity in sharding. Quadtrees offer better precision for varying densities but are harder to maintain in a distributed environment.
Consistency: We chose Eventual Consistency for reviews. A user might not see their review immediately in the "Average Rating," but this allows the system to remain highly available.
Bottleneck Analysis:
Hotspots: A celebrity restaurant might get thousands of reviews. Sharding by business_id in the Review Store mitigates this.
Dense Areas: Manhattan has more businesses than the Sahara. We use variable-length Geohash prefixes to ensure shards remain balanced.
Security:
Use TLS 1.3 for all transit.
PII (user emails in reviews) is encrypted at rest in the NoSQL store.