The Question
ML Design

Large-Scale Enterprise RAG Chatbot System

Design a high-scale, retrieval-augmented generation (RAG) chatbot system for an enterprise corpus of 10 million documents. The system must support 100 million DAU with sub-second response times. Detail the end-to-end ML lifecycle, specifically focusing on multi-stage retrieval strategies, latency optimization for LLM inference (quantization, caching), handling data freshness, and implementing robust evaluation frameworks (e.g., RAGAS) to mitigate hallucinations. Address the production trade-offs between model size, retrieval accuracy, and operational cost.
RAG
LLM
Transformers
vLLM
Milvus
Quantization
Cross-Encoder
DPO
SFT
Redis
Kafka
HNSW
Questions & Insights

Clarifying Questions

Business Goal: Is the primary objective to maximize user engagement, task completion rate (e.g., customer support resolution), or accuracy of information?
Assumption: The goal is an Enterprise RAG-based Chatbot focused on high-accuracy information retrieval and user satisfaction.
Constraints & Scale: What is the scale of the document corpus and the expected traffic?
Assumption: Corpus of 10M documents, 100M DAU, 2,000 peak QPS, and a P99 Time-To-First-Token (TTFT) budget of <200ms.
Data Freshness: How quickly must new information be available to the chatbot?
Assumption: Near real-time (minutes) for document updates.
Edge Cases: How do we handle safety, toxic content, and hallucinations?
Assumption: We need a robust guardrail layer and a mechanism to cite sources to minimize hallucinations.

Thinking Process

Identify the Core Pattern: This is a classic Retrieval-Augmented Generation (RAG) problem. Fine-tuning an LLM on the entire corpus is too expensive and brittle for frequent updates; RAG provides the necessary grounding and explainability.
Retrieval vs. Ranking: With 10M documents, a single-stage retrieval isn't enough. I need a multi-stage approach: fast vector search (Retrieval) followed by a cross-encoder (Re-ranking) to ensure the top-K context is highly relevant.
Latency Bottleneck: LLM generation is slow. I must use streaming, quantization (e.g., AWQ/FP8), and potentially a semantic cache to bypass the LLM for repeated queries.
Scalability: The system must handle high QPS. Decoupling the ingestion (indexing) pipeline from the inference (serving) pipeline is critical.

Elite Bonus Points

Semantic Cache: Implementing a vector-similarity cache for common queries (e.g., "What is the return policy?") to reduce LLM costs and latency by 80%.
Speculative Decoding: Using a small "draft" model to predict tokens and a large model to verify them, significantly increasing inference throughput.
Query Rewriting/Expansion: Using a lightweight LLM to rewrite ambiguous user queries (e.g., "Tell me more") into standalone search queries based on conversation history.
Negative Feedback Loop: Implementing a "DPO (Direct Preference Optimization) on-the-fly" mechanism where user thumbs-down signals are used to automatically update the re-ranker or guardrail models.
Design Breakdown

Requirements

Product Goal: Provide accurate, safe, and helpful responses grounded in a private document corpus.
Success Metrics:
Online: Task Success Rate (TSR), Average Session Length, User Rating (CSAT), TTFT.
Offline: RAGAS metrics (Faithfulness, Answer Relevancy, Context Precision), Retrieval Recall@K.
Guardrail Metrics: Toxicity rate, Hallucination rate (via NLI-based checking).
System Constraints: 10M documents, 2k QPS, <2s total response time, high availability (99.99%).
Data Availability: Corporate wikis, PDF manuals, customer interaction logs, real-time news feeds.

ML Problem Framing

ML Task Type: Two-stage retrieval (Bi-encoder search + Cross-encoder re-ranking) followed by Conditional Text Generation.
Prediction Target: Generate a sequence of tokens Y = \{y_1, y_2, ..., y_n\} that maximizes P(Y | \text{Query, Context, History}).
Inputs:
User: Current query + Session history (last N turns).
Context: Top-K retrieved document chunks.
Metadata: User permissions, location, language.
Outputs: Streamed text response + Source citations.
ML Challenges: Context window limits, hallucination, keeping embeddings in sync with model updates (Embedding Versioning).

Design Summary & MVP

Concise Summary: A RAG-based architecture utilizing a Vector Database for retrieval, a Cross-Encoder for re-ranking, and a Quantized LLM (vLLM-backed) for streaming generation.
Model Architecture & Selection:
Baseline: BM25 (Keyword Search) + GPT-3.5 API.
Target: BGE-Embeddings (Retrieval) + BGE-Reranker + Llama-3-70B (8-bit quantized) for Generation.
Choice Rationale: BGE models provide state-of-the-art retrieval performance; Llama-3-70B offers GPT-4 class performance with the ability to host internally for privacy and lower TCO.
ML Life Cycle Summary: Ingest docs -> Chunk -> Embed -> Store in Milvus. User Query -> Re-write -> Vector Search -> Re-rank -> Prompt LLM -> Stream to User.
Simplicity Audit: Avoids complex RLHF in the MVP. Relies on SFT and high-quality prompt engineering.
Architecture Decision Rationale: Two-stage retrieval balances the "speed of Bi-encoders" with the "precision of Cross-encoders," ensuring the context window isn't filled with noise.

System Architecture

Pipeline Deep Dive

Data Pipeline

Data Source: Mix of streaming (Slack/Teams messages) via Kafka and batch (Confluence/Sharepoint) via Airflow.
Data Ingestion: Use Unstructured.io for parsing complex PDFs/Tables. Batch ingestion for history; Change Data Capture (CDC) for real-time document updates.
Data Storage:
Bronze: Raw files in S3.
Silver: Cleaned text and metadata (Parquet).
Gold: Vectorized chunks in Milvus with HNSW indexing for O(\log N) search.
Data Quality: Detect duplicate documents using MinHash LSH to prevent redundant context in the LLM.

Feature Pipeline

Feature Definition: Primary feature is the Text Embedding. Secondary features include document freshness, "PageRank-style" document authority, and user intent.
Feature Engineering:
Chunking: Overlapping recursive character splitting (e.g., 512 tokens with 50-token overlap).
Late Interaction: Using ColBERT embeddings if accuracy lags, allowing for better alignment between query and doc terms.
Online Feature Pipeline: Real-time embedding of the user query using a GPU-accelerated embedding microservice (TEI - Text Embeddings Inference).
Training/Serving Skew: Ensure the same tokenizer and embedding model version are used for indexing and query-time search.

Model Architecture

Problem Formulation: Retrieval as a Maximum Inner Product Search (MIPS) problem; Generation as an Autoregressive language modeling task.
Candidate Model Families:
Retrieval: Bi-Encoders (Sentence-BERT, BGE, OpenAI text-embedding-3-small).
Generation: Mistral-7B (Fast/Cheap), Llama-3-70B (High Intelligence).
Architecture Design:
Bi-Encoder: BGE-large-en-v1.5 for the initial retrieval of top 100 docs.
Cross-Encoder: BGE-reranker-v2-m3 to narrow down the top 100 to the best top 5.
Architecture Optimization:
Quantization: Use 4-bit (AWQ) or 8-bit (FP8) to fit 70B models on fewer GPUs (A100/H100) while maintaining >99% accuracy.
Continuous Batching: Use vLLM to maximize throughput by dynamically batching incoming requests.

Training Pipeline

Dataset Construction: Use LLM-as-a-judge to generate synthetic (Query, Relevant Context, Response) triplets from raw docs for SFT.
Data Splitting: Split by document category to ensure the model generalizes to new topics not seen during fine-tuning.
Training Infrastructure: PyTorch FSDP (Fully Sharded Data Parallel) for fine-tuning Llama-3 on a cluster of A100s.
Retraining Strategy: Embeddings stay frozen. The re-ranker is retrained monthly on user click/feedback data.

Serving Pipeline

Serving Pattern: Online Inference with streaming (Server-Sent Events) to minimize perceived latency.
Serving Architecture: K8s-based deployment with NVIDIA Triton or vLLM.
Latency Optimization:
KV Cache: Critical for chat history performance.
Semantic Caching: Store query-response pairs in Redis. If a new query has >0.95 cosine similarity to a cached query, return the cached result.
Reliability: Fallback to a smaller, faster model (Mistral-7B) if the primary 70B model times out or hits OOM.

Evaluation Pipeline

Offline Evaluation: Use RAGAS (Retrieval Augmented Generation Assessment).
Faithfulness: Does the answer use only the provided context? (Prevents hallucination).
Answer Relevance: Does the answer address the query?
Online Evaluation: A/B test the "Re-ranker vs. No Re-ranker" and measure the impact on "Helpful" ratings and CTR on citations.

Monitoring Pipeline

System Monitoring: Track Token-per-second (TPS) and GPU memory utilization.
Data Monitoring: Track "Retrieval Misses" (queries where the vector DB returned low-similarity scores), indicating a gap in the knowledge base.
Model Monitoring: Log "Hallucination Score" by comparing LLM output against retrieved context using a secondary "NLI Model."
Wrap Up

Final Evaluation

Observability: Implement distributed tracing (OpenTelemetry) to track a request from Query-Rewrite -> Retrieval -> Rerank -> LLM.
Feedback Loop: Negative feedback (thumbs down) triggers an automated flow to human annotators for "Golden Set" updates.
Edge Cases:
Cold Start: For new docs, ensure they are indexed within 5 minutes.
Prompt Injection: Use a dedicated "System Prompt Guard" to prevent users from leaking system instructions.
Trade-offs:
Accuracy vs. Latency: A re-ranker adds 100ms but significantly improves relevance.
Cost vs. Quality: 70B models are 10x more expensive than 8B but significantly reduce logical errors.
Distinguishing Insights: For a true Principal signal, discuss Multi-Vector Retrieval (ColBERT) which stores embeddings for every token* in a document, allowing for incredibly granular retrieval at the cost of higher storage. Also, mention GraphRAG** for queries requiring multi-hop reasoning (e.g., "How does Project X relate to Employee Y's work on Project Z?").