The Question
ML DesignLarge Language Model Chatbot System
Design a scalable, production-grade conversational AI system similar to ChatGPT. The system must support multi-turn dialogue, grounding via Retrieval-Augmented Generation (RAG) to minimize hallucinations, and a multi-stage alignment pipeline (SFT and DPO/RLHF). Constraints include a peak load of 5,000 QPS, a P99 Time to First Token (TTFT) of less than 200ms, and a robust safety moderation framework. Explain the end-to-end lifecycle from data curation and tokenization to high-throughput inference using modern memory management techniques like PagedAttention.
LLM
SFT
DPO
RLHF
RAG
vLLM
PagedAttention
KV-Cache
FlashAttention
LoRA
Quantization
BPE
VectorDB
March 16, 2026