The Question

Scalable Retrieval-Augmented Generation (RAG) System

Design a production-ready RAG system capable of indexing millions of enterprise documents and answering user queries with high relevance and low latency. The system must handle document ingestion asynchronously, support semantic search across billions of text chunks, and ensure strict document-level access control. Discuss your strategies for chunking, hybrid retrieval, re-ranking, and handling LLM service limits while maintaining a P99 latency under 2.5 seconds.

Vector Database

PostgreSQL

Redis

SQS

LLM API

HNSW

Hybrid Search

Cross-Encoder

JWT

Questions & Insights

Clarifying Questions

What is the scale of the document corpus and the expected growth? (Assumption: 1 million documents, roughly 1 billion chunks, with 10k new documents added daily).

What are the latency requirements for the end-to-end generation? (Assumption: P99 latency under 2.5 seconds for the complete retrieval and generation cycle).

What types of data are being ingested, and do they require OCR? (Assumption: Mostly text-heavy PDFs and Markdown; complex image-to-text OCR is out of scope for MVP).

Are there strict data privacy or access control requirements (ACLs)? (Assumption: Users can only retrieve information from documents they have uploaded or have explicit permission to view).

What is the target query volume? (Assumption: 100 QPS average, 500 QPS peak).

Thinking Process

Chunking Strategy: How do we split documents to maintain semantic meaning while fitting into the LLM context window?

Retrieval Quality: How do we ensure the most relevant context is retrieved to minimize hallucinations?

Latency Optimization: How do we orchestrate vector search, re-ranking, and LLM calls to meet the 2.5s SLA?

End-to-End Flow:

How is the raw data transformed into searchable vectors?

How does the system handle a user query to fetch the right context?

How is the context injected into the prompt and served by the LLM?

Bonus Points

Hybrid Search: Combining Dense Retrieval (Vector/Embeddings) with Sparse Retrieval (BM25/Keyword) to handle both semantic meaning and specific technical terms.

Two-Stage Retrieval (Re-ranking): Using a fast vector search for the top 100 candidates, followed by a computationally expensive Cross-Encoder re-ranker for the top 5 to significantly boost precision.

Semantic Caching: Implementing a cache (e.g., GPTCache) to store and retrieve responses for semantically similar questions, reducing LLM costs and latency.

Query Rewriting: Using a small LLM to transform vague user queries into optimized search terms before hitting the vector database.

Design Breakdown

Functional Requirements

Core Use Cases:

Users can upload documents (PDF, Text).

Users can query the system in natural language.

The system provides a generated answer with citations/references to source documents.

Scope Control:

In-Scope: Text extraction, chunking, embedding generation, vector storage, and RAG orchestration.

Out-of-Scope: Multi-modal RAG (video/audio), training custom base models, and real-time web-crawling.

Non-Functional Requirements

Scale: Support up to 1 million documents and 500 peak QPS.

Latency: P99 < 2.5 seconds.

Availability & Reliability: 99.9% uptime; ingestion failures should be retriable via dead-letter queues.

Consistency: Eventual consistency for document updates (searchable within 1 minute of upload).

Security & Privacy: Document-level ACLs; data encryption at rest and in transit.

Estimation

Storage: 1M docs 1000 chunks/doc 1536 dimensions (Float32)

\approx

6 TB for vector storage.

Traffic: 100 QPS * 86,400 seconds

\approx

8.6M queries/day.

Bandwidth:

Inbound: 10k docs/day * 5MB/doc

\approx

50 GB/day.

Outbound: 100 QPS * 1KB response

\approx

100 KB/s.

Blueprint

Concise Summary: An asynchronous ingestion pipeline processes documents into a vector database, while a synchronous query service performs hybrid retrieval, re-ranking, and LLM generation.

Major Components:

Ingestion Service: Handles document parsing, chunking, and embedding generation via a worker pool.

Vector DB: Stores document embeddings and metadata for low-latency similarity search.

Query Service: Orchestrates the retrieval of context and interacts with the LLM API.

Blob Storage: Acts as the source of truth for raw uploaded documents.

Simplicity Audit: This architecture uses managed LLM APIs and a purpose-built vector database to avoid the operational overhead of hosting large models in-house for the MVP.

Architecture Decision Rationale:

Why this architecture?: Decoupling ingestion from querying ensures that heavy document processing doesn't impact user query latency.

Functional Satisfaction: Meets the need for document-based Q&A with citations.

Non-functional Satisfaction: Scalable via horizontal scaling of workers and sharded vector storage.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing: Global DNS with latency-based routing to the nearest regional API Gateway.

Security & Perimeter:

API Gateway: Handles JWT-based Authentication and Rate Limiting (1000 requests/min per user).

WAF: Standard protection against SQLi and Prompt Injection at the edge.

Service

Topology & Scaling: Stateless microservices deployed on Kubernetes (EKS/GKE). Horizontal Pod Autoscaler (HPA) triggered by CPU and Request Count.

API Schema Design:

POST /v1/query: { "query": "string", "filters": { "doc_ids": [] } } -> { "answer": "string", "sources": [] }.

POST /v1/documents: Multipart upload -> { "job_id": "uuid" }.

Resilience & Reliability:

LLM Fallback: If primary LLM (e.g., GPT-4) fails, fallback to a faster/cheaper model (e.g., GPT-3.5) or a secondary provider (Anthropic).

Circuit Breaker: Implemented for all external API calls (Embedding, LLM).

Observability: Prometheus for RED metrics; OpenTelemetry for tracing the "Query -> Retrieval -> Generation" spans.

Storage

Access Pattern: Heavy write during ingestion; high-concurrency read during user queries.

Database Table Design (Metadata DB):

documents: { id, owner_id, s3_path, status, created_at }.

chunks: { id, doc_id, text_content, page_number }.

Technical Selection:

Vector DB: Pinecone or Milvus. Chosen for managed scaling and native support for metadata filtering (ACLs).

Metadata DB: PostgreSQL. Handles structured data and relational queries for permissions.

Blob Storage: AWS S3 for durability of raw files.

Distribution Logic: Vector DB sharding by owner_id to prevent "noisy neighbor" issues and ensure data isolation.

Cache

Purpose & Justification: Semantic Cache to store responses for frequent queries.

Key-Value Schema: Key is the hash(embedding(query)), Value is the JSON response.

Technical Selection: Redis.

Failure Handling: If Redis is down, the Query Service bypasses the cache and proceeds to standard retrieval.

Messaging

Purpose & Decoupling: Decouples document upload from the long-running embedding/indexing process.

Event Schema: { "doc_id": "uuid", "action": "INDEX" }.

Throughput & Partitioning: SQS for simple scaling.

Failure Handling: Standard DLQ (Dead Letter Queue) for messages that fail processing after 3 retries.

Technical Selection: AWS SQS. Simple, serverless, and highly reliable for MVP.

Data Processing

Processing Model: Asynchronous worker-based processing.

Processing DAG:

Fetch from S3.

Extract Text.

Recursive Character Chunking (512 tokens, 10% overlap).

Generate Embeddings (batch call to Embedding API).

Upsert to Vector DB with metadata.

Technical Selection: Python-based Celery workers. High library support for LangChain/LlamaIndex.

Wrap Up

Advanced Topics

Trade-offs: We chose Eventual Consistency for document indexing. While users want immediate searchability, the async pipeline prevents the API from timing out on large PDF uploads.

Reliability: Exponential backoff is applied to LLM API calls to handle rate limits (429 Too Many Requests).

Bottleneck Analysis:

LLM Throughput: LLM APIs often have token-per-minute (TPM) limits. We implement a local token bucket rate limiter to stay within limits.

Vector Search Latency: As the index grows, search slows. We use HNSW (Hierarchical Navigable Small World) indexing in the Vector DB for

O(\log N)

search time.

Security: Prompt Injection mitigation is handled by sanitizing inputs and using strict "System Prompts" that define the LLM's boundaries.