The Question

Design a Scalable Autocomplete System

Design a search autocomplete (typeahead) system similar to Google Search. The system must support 5 billion searches per day with sub-100ms latency. Your design should handle prefix matching, suggest the top 5 most frequent completions, and provide a mechanism for updating suggestions as new search trends emerge. Explain your choices for data structures, storage, and how you would maintain performance at peak loads of 600,000 QPS.

Redis

Apache Kafka

Apache Spark

Cassandra

NoSQL

Trie

gRPC

Anycast DNS

Questions & Insights

Clarifying Questions

Scale & Traffic: What is the scale of the system? (Assumption: Google scale, ~5 billion searches per day, ~300k QPS average, ~600k QPS peak).

Latency Requirement: What is the target P99 latency for suggestions? (Assumption: Real-time feel requires < 100ms latency from client to response).

Freshness: How quickly should new trending queries appear in suggestions? (Assumption: T+1 hour for general updates; real-time not required for MVP).

Personalization: Do we need to account for user history? (Assumption: Out of scope for MVP; focus on global/regional popularity).

Language/Localization: Does the system support multiple languages and regions? (Assumption: Global support, but results are partitioned by language/locale).

Thinking Process

The core challenge is balancing extreme read throughput with a massive, evolving dataset.

Core Bottleneck: How to retrieve top-k suggestions for a prefix in < 10ms (server-side processing time).

Data Structure: Use a Trie (Prefix Tree) for logical representation, but optimize storage as a Key-Value pair (Prefix -> List of Suggestions) for O(1) retrieval.

Update Strategy: Decouple the query path from the ingestion path. Use an offline/near-real-time pipeline to aggregate logs and rebuild the Trie/KV store.

Scaling: Shard the Trie based on the prefix (e.g., hash of the first 2-3 characters) to distribute load across multiple nodes.

Bonus Points

Browser-Side Optimization: Implement "debouncing" on the client to prevent firing requests on every keystroke and leverage local localStorage for the top 100 most frequent prefixes.

Probabilistic Data Structures: Use Count-Min Sketch during the ingestion phase to estimate heavy hitters (trending queries) with minimal memory before full aggregation.

Secondary Indexing for Trending: Implement a "Hot-Prefix" cache that bypasses the standard Trie for rapid injection of breaking news or viral topics.

Trie Compaction: Use Radix trees or DAWG (Directed Acyclic Word Graph) to reduce memory footprint by 40% compared to a standard Trie.

Design Breakdown

Functional Requirements

Core Use Cases:

Given a prefix, return the top 5 most popular completions.

Filter offensive or blacklisted content.

Support for multi-word queries.

Scope Control:

In-Scope: Global popularity, basic prefix matching, high-scale read/write.

Out-of-Scope: Personalization (user history), spell check (fuzzy matching), and ads.

Non-Functional Requirements

Scale: Support 5B+ searches daily.

Latency: P99 < 100ms (end-to-end).

Availability: 99.99% availability (High Availability is critical for search).

Consistency: Eventual consistency is acceptable; it is okay if a new popular query takes 30-60 minutes to appear.

Fault Tolerance: The system must function (perhaps with stale data) if the ingestion pipeline fails.

Estimation

Traffic:

5 Billion searches/day

\approx

60,000 queries per second (QPS).

Average 5-20 characters per search. Assuming 5 requests per search (after debouncing), Peak QPS

\approx

300,000 - 600,000.

Storage:

100 million unique search terms.

Each term average 20 bytes.

Trie/KV Store size: 100M * 20 bytes + pointers/metadata

\approx

20-50 GB (easily fits in distributed RAM).

Bandwidth:

Outgoing: 600k QPS * 500 bytes (JSON payload)

\approx

300 MB/s.

Blueprint

Concise Summary: A read-optimized architecture using an In-memory Cache (Redis) to serve pre-calculated suggestions, fueled by an offline Spark pipeline that processes search logs.

Major Components:

Query Service: Stateless service that fetches suggestions from Redis for a given prefix.

Log Collector: Captures every search query and pushes to a message queue.

Data Processing Layer (Spark): Aggregates query frequencies over a window (e.g., hourly) and updates the Trie database.

Trie Storage (NoSQL): Persistent storage for the full prefix-to-suggestion mapping.

Simplicity Audit: This design uses a KV-store approach instead of a live-updated Trie. Building a Trie in real-time is complex and prone to locking issues; pre-calculating the top-k suggestions for every prefix makes the read path extremely fast and simple.

Architecture Decision Rationale:

Why this?: Decoupling ingestion and queries prevents heavy write loads from impacting search latency.

Functional: Meets prefix-matching and filtering requirements.

Non-functional: Horizontal scaling of Query Service and Redis ensures 600k+ QPS capability.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing: Use Anycast DNS to route users to the nearest Data Center.

API Gateway: Handles SSL termination and Rate Limiting.

Debouncing: Client-side logic waits for 100-200ms of inactivity before sending a request to reduce server load by ~70%.

Service

Query Service:

Protocol: gRPC for internal service communication; REST/WebSockets for client.

Logic: 1. Check Redis for prefix. 2. If miss, fetch from NoSQL. 3. Filter against Blacklist.

Scaling: Horizontal scaling via CPU-based triggers.

Log Collector:

High-throughput ingestion service that buffers logs before sending to Kafka to minimize I/O overhead.

Storage

Access Pattern: Heavy read, batch write.

Database Table Design:

Table: Suggestions

Fields: prefix (PK), suggestions (List<String>), last_updated (Timestamp).

Technical Selection: Cassandra or DynamoDB.

Rationale: High write throughput for bulk updates from the Spark job and predictable single-digit ms lookups for the prefix key.

Distribution Logic: Shard by prefix. Using a consistent hash of the prefix ensures even distribution across nodes.

Cache

Purpose: Reducing latency to < 10ms for the most common prefixes.

Key-Value Schema:

Key: autocomplete::[locale]::[prefix]

Value: JSON list of strings

Technical Selection: Redis Cluster.

Failure Handling: If Redis fails, Query Service falls back to NoSQL DB (Graceful degradation).

Messaging

Purpose: Decouples the live query traffic from the heavy aggregation logic.

Technical Selection: Kafka.

Partitioning: Partition by query_string to ensure all instances of the same query land in the same Spark partition for easy aggregation.

Data Processing

Processing Model: Batch processing (Lambda architecture).

Processing DAG:

Consumer reads from Kafka.

Map-Reduce job calculates query frequency.

Filter offensive queries.

Build prefix-to-suggestion list (Top 5).

Bulk load into NoSQL/Redis.

Technical Selection: Apache Spark.

Infrastructure (Optional)

Observability: Prometheus for monitoring QPS and Latency.

Platform Security: mTLS between Query Service and NoSQL/Redis.

Wrap Up

Advanced Topics

Trade-offs: We trade off Freshness for Latency. By pre-calculating suggestions, we cannot show a trending topic the second it happens (unless using a separate "Hot-Path" service).

Reliability: If the Spark pipeline dies, the system continues to serve stale (but valid) suggestions from Redis/NoSQL.

Hot Spot Shards: Common prefixes like 'a', 's', or 't' might cause hot spots. Solution: Use sub-sharding or replicate the most common 10,000 prefixes across all Redis nodes.

Security: A Blacklist Filter is placed in the Query Service to ensure that if a term is banned, it is removed immediately from suggestions without waiting for a re-index.