The Question

Scalable Machine Learning Training & Evaluation Platform

Design a distributed system for managing and executing machine learning training and evaluation jobs. The platform must support heterogeneous hardware requirements (CPUs and GPUs), handle long-running tasks (up to 24 hours), and provide real-time log streaming and artifact management. Key challenges include job scheduling, fault tolerance for worker failures, resource isolation, and handling high-volume log data at scale. Define the end-to-end flow from job submission to result retrieval, ensuring high reliability and observability.

PostgreSQL

Redis

Kafka

Kubernetes

gRPC

Docker

FluentBit

Questions & Insights

Clarifying Questions

What is the nature and duration of the jobs? Are we talking about short-lived evaluation scripts (seconds/minutes) or long-running deep learning training jobs (hours/days)? Assumption: Jobs are heterogeneous, ranging from 5 minutes to 24 hours.

What is the scale of the system? How many concurrent jobs and total jobs per day? Assumption: 10,000 concurrent jobs, 100,000 jobs created per day.

What are the resource requirements? Do jobs require specific hardware like GPUs or high-memory instances? Assumption: Support for heterogeneous worker pools (CPU/GPU) is required.

How should we handle job logs and metrics? Do users need real-time streaming of logs during training? Assumption: Near real-time log streaming is required for monitoring training progress.

What is the failure recovery policy? Should the system automatically retry failed jobs? Assumption: At-least-once execution with configurable retry policies per job.

Thinking Process

Core Bottleneck: Efficiently scheduling and tracking long-running, resource-intensive tasks across a distributed worker fleet while maintaining state consistency and observability.

Key Progressive Questions:

How do we decouple job submission from execution to ensure the API remains responsive? (Async Task Queue)

How do we ensure workers are healthy and handle job timeouts or preemption? (Heartbeat & State Machine)

How do we efficiently stream and store massive volumes of training logs without choking the database? (Distributed Logging & Object Storage)

How do we manage resource isolation and heterogeneous hardware (GPU vs. CPU)? (Container Orchestration/K8s)

Bonus Points

Fair-Share Scheduling: Implementing a multi-tenant scheduler that prevents a single user or team from exhausting the entire cluster's GPU capacity.

Spot Instance Integration: Designing a "preemption-aware" worker that can checkpoint model state to S3 when a cloud provider reclaims spot instances, reducing costs by up to 70%.

Data Locality Optimization: Intelligence in the scheduler to place jobs on workers physically closer to the training data (S3 bucket region) to minimize egress costs and latency.

Sidecar Log Aggregation: Using a sidecar pattern (FluentBit/Vector) on workers to stream logs directly to Object Storage and a Search Index, bypassing the primary application database.

Design Breakdown

Functional Requirements

Core Use Cases:

Users can submit training or evaluation jobs with specific resource requirements (vCPU, RAM, GPU).

Users can monitor job status (Pending, Running, Succeeded, Failed).

Users can view/stream real-time logs and download output artifacts (model weights, report files).

Admin can cancel or restart jobs.

Scope Control:

In-Scope: Job lifecycle management, worker orchestration, log aggregation, and artifact storage.

Out-of-Scope: Building the actual ML training frameworks (TensorFlow/PyTorch), data labeling tools, or model serving (Inference).

Non-Functional Requirements

Scale: Support for 10k+ concurrent active jobs and 100k daily submissions.

Latency: Job state transitions (e.g., status updates) should reflect in the UI within < 1 second.

Availability & Reliability: 99.9% availability for the API; jobs must be persisted and retried upon infrastructure failure.

Consistency: Strong consistency for job state transitions to prevent duplicate executions (Idempotency).

Fault Tolerance: Handle worker crashes via heartbeats and timeouts.

Security: Isolation between jobs (containers); secure access to user data.

Estimation

Traffic Estimation:

Job Submissions: 100k/day ≈ 1.2 jobs/sec (Avg). Peak ≈ 10-20 jobs/sec.

Status Polls/Websocket Updates: If 10k users poll every 5s, that's 2,000 QPS.

Storage Estimation:

Metadata: 100k jobs/day * 2KB/job = 200MB/day. 5-year retention ≈ 365GB (Small).

Logs: 100MB logs per job * 100k jobs = 10TB/day (High - requires Object Storage).

Artifacts: 1GB per training job * 10k training jobs = 10TB/day (Significant).

Bandwidth Estimation:

Outgoing: Streaming logs and downloading models. 10k active jobs streaming logs ≈ 100MB/s total.

Blueprint

Concise Summary: A microservices-based architecture using a persistent Task Queue (Kafka) to decouple job submission from worker execution, managed by a centralized Job Orchestrator.

Major Components:

API Gateway: Entry point for authentication, rate limiting, and job submission.

Job Service (Orchestrator): Manages the state machine, metadata, and interacts with the scheduler.

Task Queue (Kafka): Distributed log for reliable job distribution and worker load leveling.

Worker Fleet: Containerized nodes (K8s) that pull tasks, execute the code, and stream logs.

Object Storage (S3): Source of truth for large artifacts and training logs.

Simplicity Audit: This design avoids complex distributed locking by using Kafka's consumer group mechanics for task distribution and a relational database for state management.

Architecture Decision Rationale:

Why this architecture?: Decoupling via Kafka allows the system to handle bursts in job submissions without overwhelming workers.

Functional Satisfaction: Covers the full lifecycle from submission to artifact retrieval.

Non-functional Satisfaction: Scalable worker fleet (K8s), durable storage (S3), and reliable state (PostgreSQL).

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing: Global DNS with latency-based routing. Static assets (UI) served via CDN.

Security & Perimeter:

API Gateway: Handles JWT validation.

Rate Limiting: Tiered limits (e.g., Free users: 5 jobs/day; Enterprise: Unlimited).

SSL/TLS: Terminated at the Gateway.

Service

Topology & Scaling:

Job Service: Stateless, horizontally scaled deployment in multiple Availability Zones (AZs).

Worker Fleet: Auto-scaled based on Kafka lag and resource utilization (CPU/GPU).

API Schema Design:

POST /v1/jobs: Create job. (Body: image_url, resources, cmd, env_vars).

GET /v1/jobs/{id}: Returns status, metadata, and log pointers.

DELETE /v1/jobs/{id}: Cancellation request.

Idempotency: client_token in headers to prevent duplicate submissions.

Resilience & Reliability:

Heartbeats: Workers update a TTL-based key in Redis every 30s. If a heartbeat is missed, the Job Service re-queues the task.

Graceful Shutdown: Workers catch SIGTERM to checkpoint and update status to PREEMPTED.

Observability:

Metrics: Track job_wait_time, job_failure_rate, and gpu_utilization.

Storage

Access Pattern: Heavy writes for job creation; heavy reads for status monitoring. Large sequential writes/reads for logs/artifacts.

Database Table Design (PostgreSQL):

Jobs: id (UUID), user_id, status (enum), priority, resource_spec (JSONB), created_at, started_at, finished_at.

Job_Retries: job_id, attempt_number, error_log_snippet.

Technical Selection:

PostgreSQL: Chosen for ACID compliance on state transitions.

S3: Chosen for virtually infinite scaling of large artifacts and logs.

Distribution Logic:

Sharding by user_id if the job count exceeds 100M rows, but for MVP, a single RDS instance with read replicas is sufficient.

Cache

Purpose & Justification: Reduces DB load for high-frequency status checks and manages ephemeral worker heartbeats.

Key-Value Schema:

job:status:{id} -> String value. TTL 24h.

worker:heartbeat:{worker_id} -> Timestamp. TTL 90s.

Failure Handling: If Redis fails, the Job Service falls back to the Postgres DB.

Messaging

Purpose & Decoupling: Acts as a buffer between the API and the compute-heavy workers.

Event / Topic Schema:

job.submitted: Contains job metadata and resource requirements.

job.events: Status updates (Started, Progress, Finished) consumed by the Job Service to update the DB.

Technical Selection: Kafka. Required for high throughput and the ability to replay events if the Job Service or DB goes down.

Infrastructure (Optional)

Distributed Coordination:

Kubernetes (K8s): Used for orchestrating containerized workers. K8s Jobs or Custom Controllers can be used to manage the worker lifecycle and resource scheduling (e.g., NVIDIA GPU operator).

Wrap Up

Advanced Topics

Trade-offs: We choose Availability over Consistency (AP) for the log streaming, but Consistency over Availability (CP) for the job state transitions to avoid double-billing or double-running.

Reliability: Uses a "Reconciliation Loop" (similar to K8s controllers). A background task checks for "zombie" jobs (status = RUNNING but no heartbeat) and moves them back to PENDING.

Scalability: The use of Kafka allows us to add thousands of workers without modifying the API layer.

Security: Workers should run in a restricted VPC with no access to the internal Metadata DB; they only communicate via the Task Queue and Object Storage using pre-signed URLs.