The Question
DesignDomain-Specific LLM Fine-Tuning Platform
Design a system for fine-tuning large language models on domain-specific datasets. The system should support dataset versioning and preprocessing, distributed training job orchestration, evaluation pipelines, and model registry with deployment capabilities for iterative improvement.
S3
Redis Queue
PostgreSQL
GPU Workers
PEFT/LoRA
Questions & Insights
Thinking Process
Fine-tuning an LLM is a compute-intensive, long-running asynchronous process. The core challenge is not just the "tuning" itself, but the orchestration of heavy data, state management of long-running jobs, and efficient GPU utilization.
Progressive Discovery Questions:
The Bottleneck: Fine-tuning takes hours/days; how do we prevent the API from timing out? (Answer: Decouple via an asynchronous Task Queue).
Data Gravity: We are moving GBs of domain data and model weights; how do we store and access them efficiently? (Answer: Centralized Object Storage with local SSD caching on GPU nodes).
The Compute Cost: Full parameter fine-tuning is expensive; how do we make this feasible for an MVP? (Answer: Implement PEFT/LoRA to reduce memory footprint and training time).
The Reliability: What happens if a GPU node fails 10 hours into a 12-hour job? (Answer: Implement frequent weight checkpointing to S3).
Bonus Points
PEFT (LoRA/QLoRA): Instead of tuning billions of parameters, we tune a small adapter (~1% of weights), drastically reducing VRAM requirements and storage costs.
Spot Instance Orchestration: Implementing a "checkpoint-and-resume" logic allows using AWS Spot or GCP Preemptible GPUs, cutting compute costs by up to 70-90%.
Data Lineage: Versioning datasets and model weights together (e.g., using DVC or S3 Versioning) to ensure reproducibility for regulatory/audit compliance.
Triton/Flash Attention: Integrating hardware-specific kernels in the training loop to maximize TFLOPS utilization on A100/H100 clusters.
Design Breakdown
Functional Requirements
Users can upload domain-specific datasets (PDF, JSONL, TXT).
Users can trigger a fine-tuning job specifying a base model (e.g., Llama-3).
System must provide real-time status updates (Queued, Training, Completed, Failed).
Users can download the fine-tuned adapter/weights.
System must perform a basic evaluation (loss/perplexity) post-training.
Non-Functional Requirements
Scalability: Support horizontal scaling of GPU worker nodes.
Reliability: Persistence of job state and automated retries for transient infrastructure failures.
Efficiency: Minimize data transfer time between storage and GPU.
Security: Isolation of tenant data during the training process.
Estimation
Dataset Size: Average domain corpus is 500MB - 5GB.
Model Size: Llama-3 8B (FP16) ~15GB; 70B ~140GB.
Throughput: 10 concurrent fine-tuning jobs for an MVP.
Storage: 10 jobs * (5GB data + 15GB base + 1GB adapter) = ~210GB active storage.
Compute: 1x A100 (80GB) per 8B model job; multi-GPU for larger models.
Blueprint
Concise Summary: A queue-based asynchronous architecture where a lightweight API manages job metadata and high-performance GPU workers consume tasks to perform LoRA fine-tuning using Object Storage for persistence.
Major Components:
API Gateway & Service: Handles user requests, data uploads, and job orchestration.
Metadata Database: Tracks job states, hyper-parameters, and file locations.
Task Queue (Redis): Decouples the API from long-running GPU processes.
Object Store (S3): Acts as the "source of truth" for raw data, base models, and tuned weights.
GPU Training Worker: Dedicated compute nodes that execute the training loop and evaluate the model.
Simplicity Audit: This design avoids complex Kubernetes operators (KubeRay/Kubeflow) in favor of a simple worker-consumer pattern, which is sufficient for an MVP and easier to debug.
High Level Architecture
Sub-system Deep Dive
Service
Topology & Scaling: The API and Job Manager are deployed as stateless microservices in a standard container environment. They scale based on CPU/Request count.
API Spec:
POST /v1/datasets: Upload data (Multipart or S3 Presigned URL).POST /v1/jobs: Trigger FT (Params: base_model, dataset_id, hyperparams).GET /v1/jobs/{id}: Poll for status and metrics (loss/accuracy).Communication: Asynchronous. The API returns a
job_id immediately after persisting the request and pushing it to the queue.Storage
Data Model: PostgreSQL stores:
Jobs: id, user_id, status, base_model_path, dataset_path, output_path, config_json.Datasets: id, s3_uri, version, token_count.Database Logic: Read-heavy for status polling; Write-heavy only when workers heartbeat metrics.
Cache
Usage: Redis is used for ephemeral state tracking.
TTL: Job status keys (e.g.,
job:status:123) have a TTL of 24 hours post-completion to prevent memory bloat.Eviction: LRU (Least Recently Used) is standard, though the volume of status keys is low enough that memory is rarely a constraint.
Messaging
Topology: Redis List or Stream used as a FIFO queue.
Delivery Guarantees: "At least once" delivery. Workers acknowledge (ACK) the task only after the final weights are successfully uploaded to S3.
Visibility Timeout: If a worker crashes, the task reappears in the queue after a timeout for another worker to claim.
Data Processing
Component: GPU Training Worker.
Execution Flow:
Initialization: Pull base model (Llama-3) and dataset from S3 to local NVMe SSD.
Pre-processing: Tokenize the dataset using the model's tokenizer.
Training Loop: Execute PEFT/LoRA using libraries like HuggingFace
peft and bitsandbytes (for 4-bit quantization).Checkpointing: Every N steps, upload partial weights to S3.
Evaluation: Run a validation set; generate a JSON report.
Finalization: Merge (optional) and upload final LoRA adapters.
Wrap Up
Advanced Topics
Trade-offs:
Consistency vs. Availability: We favor Eventual Consistency. The API might show "Starting" while the worker is still downloading the model.
LoRA vs. Full Tuning: We sacrifice potential absolute performance for 90% cost reduction and significantly lower hardware requirements.
Bottlenecks:
Data I/O: Downloading 100GB models from S3 to Workers. Optimization: Keep a local cache of popular base models on the worker's persistent disk.
Cold Starts: GPU nodes take time to spin up. Optimization: Maintain a "warm pool" of workers.
Failure Handling:
OOM (Out of Memory): If a training job hits CUDA OOM, the worker catches the exception, updates the Job DB status to "Failed," and logs the error.
Alternatives:
Serverless GPUs: Using services like RunPod or Modal instead of managing EC2 G5/P4 instances to simplify infrastructure at the cost of higher per-minute rates.