The Question
DesignScalable Retrieval-Augmented Generation (RAG) System
Design a production-ready RAG system capable of indexing millions of enterprise documents and answering user queries with high relevance and low latency. The system must handle document ingestion asynchronously, support semantic search across billions of text chunks, and ensure strict document-level access control. Discuss your strategies for chunking, hybrid retrieval, re-ranking, and handling LLM service limits while maintaining a P99 latency under 2.5 seconds.
Vector Database
PostgreSQL
Redis
S3
SQS
LLM API
HNSW
Hybrid Search
Cross-Encoder
JWT
Questions & Insights
Clarifying Questions
What is the scale of the document corpus and the expected growth? (Assumption: 1 million documents, roughly 1 billion chunks, with 10k new documents added daily).
What are the latency requirements for the end-to-end generation? (Assumption: P99 latency under 2.5 seconds for the complete retrieval and generation cycle).
What types of data are being ingested, and do they require OCR? (Assumption: Mostly text-heavy PDFs and Markdown; complex image-to-text OCR is out of scope for MVP).
Are there strict data privacy or access control requirements (ACLs)? (Assumption: Users can only retrieve information from documents they have uploaded or have explicit permission to view).
What is the target query volume? (Assumption: 100 QPS average, 500 QPS peak).
Thinking Process
Chunking Strategy: How do we split documents to maintain semantic meaning while fitting into the LLM context window?
Retrieval Quality: How do we ensure the most relevant context is retrieved to minimize hallucinations?
Latency Optimization: How do we orchestrate vector search, re-ranking, and LLM calls to meet the 2.5s SLA?
End-to-End Flow:
How is the raw data transformed into searchable vectors?
How does the system handle a user query to fetch the right context?
How is the context injected into the prompt and served by the LLM?
Bonus Points
Hybrid Search: Combining Dense Retrieval (Vector/Embeddings) with Sparse Retrieval (BM25/Keyword) to handle both semantic meaning and specific technical terms.
Two-Stage Retrieval (Re-ranking): Using a fast vector search for the top 100 candidates, followed by a computationally expensive Cross-Encoder re-ranker for the top 5 to significantly boost precision.
Semantic Caching: Implementing a cache (e.g., GPTCache) to store and retrieve responses for semantically similar questions, reducing LLM costs and latency.
Query Rewriting: Using a small LLM to transform vague user queries into optimized search terms before hitting the vector database.
Design Breakdown
Functional Requirements
Core Use Cases:
Users can upload documents (PDF, Text).
Users can query the system in natural language.
The system provides a generated answer with citations/references to source documents.
Scope Control:
In-Scope: Text extraction, chunking, embedding generation, vector storage, and RAG orchestration.
Out-of-Scope: Multi-modal RAG (video/audio), training custom base models, and real-time web-crawling.
Non-Functional Requirements
Scale: Support up to 1 million documents and 500 peak QPS.
Latency: P99 < 2.5 seconds.
Availability & Reliability: 99.9% uptime; ingestion failures should be retriable via dead-letter queues.
Consistency: Eventual consistency for document updates (searchable within 1 minute of upload).
Security & Privacy: Document-level ACLs; data encryption at rest and in transit.
Estimation
Storage: 1M docs 1000 chunks/doc 1536 dimensions (Float32) \approx 6 TB for vector storage.
Traffic: 100 QPS * 86,400 seconds \approx 8.6M queries/day.
Bandwidth:
Inbound: 10k docs/day * 5MB/doc \approx 50 GB/day.
Outbound: 100 QPS * 1KB response \approx 100 KB/s.
Blueprint
Concise Summary: An asynchronous ingestion pipeline processes documents into a vector database, while a synchronous query service performs hybrid retrieval, re-ranking, and LLM generation.
Major Components:
Ingestion Service: Handles document parsing, chunking, and embedding generation via a worker pool.
Vector DB: Stores document embeddings and metadata for low-latency similarity search.
Query Service: Orchestrates the retrieval of context and interacts with the LLM API.
Blob Storage: Acts as the source of truth for raw uploaded documents.
Simplicity Audit: This architecture uses managed LLM APIs and a purpose-built vector database to avoid the operational overhead of hosting large models in-house for the MVP.
Architecture Decision Rationale:
Why this architecture?: Decoupling ingestion from querying ensures that heavy document processing doesn't impact user query latency.
Functional Satisfaction: Meets the need for document-based Q&A with citations.
Non-functional Satisfaction: Scalable via horizontal scaling of workers and sharded vector storage.
High Level Architecture
Sub-system Deep Dive
Edge (Optional)
Content Delivery & Traffic Routing: Global DNS with latency-based routing to the nearest regional API Gateway.
Security & Perimeter:
API Gateway: Handles JWT-based Authentication and Rate Limiting (1000 requests/min per user).
WAF: Standard protection against SQLi and Prompt Injection at the edge.
Service
Topology & Scaling: Stateless microservices deployed on Kubernetes (EKS/GKE). Horizontal Pod Autoscaler (HPA) triggered by CPU and Request Count.
API Schema Design:
POST /v1/query: { "query": "string", "filters": { "doc_ids": [] } } -> { "answer": "string", "sources": [] }.POST /v1/documents: Multipart upload -> { "job_id": "uuid" }.Resilience & Reliability:
LLM Fallback: If primary LLM (e.g., GPT-4) fails, fallback to a faster/cheaper model (e.g., GPT-3.5) or a secondary provider (Anthropic).
Circuit Breaker: Implemented for all external API calls (Embedding, LLM).
Observability: Prometheus for RED metrics; OpenTelemetry for tracing the "Query -> Retrieval -> Generation" spans.
Storage
Access Pattern: Heavy write during ingestion; high-concurrency read during user queries.
Database Table Design (Metadata DB):
documents: { id, owner_id, s3_path, status, created_at }.chunks: { id, doc_id, text_content, page_number }.Technical Selection:
Vector DB: Pinecone or Milvus. Chosen for managed scaling and native support for metadata filtering (ACLs).
Metadata DB: PostgreSQL. Handles structured data and relational queries for permissions.
Blob Storage: AWS S3 for durability of raw files.
Distribution Logic: Vector DB sharding by
owner_id to prevent "noisy neighbor" issues and ensure data isolation.Cache
Purpose & Justification: Semantic Cache to store responses for frequent queries.
Key-Value Schema: Key is the
hash(embedding(query)), Value is the JSON response.Technical Selection: Redis.
Failure Handling: If Redis is down, the Query Service bypasses the cache and proceeds to standard retrieval.
Messaging
Purpose & Decoupling: Decouples document upload from the long-running embedding/indexing process.
Event Schema:
{ "doc_id": "uuid", "action": "INDEX" }.Throughput & Partitioning: SQS for simple scaling.
Failure Handling: Standard DLQ (Dead Letter Queue) for messages that fail processing after 3 retries.
Technical Selection: AWS SQS. Simple, serverless, and highly reliable for MVP.
Data Processing
Processing Model: Asynchronous worker-based processing.
Processing DAG:
Fetch from S3.
Extract Text.
Recursive Character Chunking (512 tokens, 10% overlap).
Generate Embeddings (batch call to Embedding API).
Upsert to Vector DB with metadata.
Technical Selection: Python-based Celery workers. High library support for LangChain/LlamaIndex.
Wrap Up
Advanced Topics
Trade-offs: We chose Eventual Consistency for document indexing. While users want immediate searchability, the async pipeline prevents the API from timing out on large PDF uploads.
Reliability: Exponential backoff is applied to LLM API calls to handle rate limits (
429 Too Many Requests).Bottleneck Analysis:
LLM Throughput: LLM APIs often have token-per-minute (TPM) limits. We implement a local token bucket rate limiter to stay within limits.
Vector Search Latency: As the index grows, search slows. We use HNSW (Hierarchical Navigable Small World) indexing in the Vector DB for O(\log N) search time.
Security: Prompt Injection mitigation is handled by sanitizing inputs and using strict "System Prompts" that define the LLM's boundaries.