The Question

Large-Scale Distributed ML Checkpointing System

Design a highly performant and reliable checkpointing system for a distributed machine learning cluster consisting of 10,000+ GPUs. The system must minimize the time training is paused (blocking time) while ensuring that multi-terabyte model states are durably stored. Address the challenges of massive synchronized I/O, network congestion, and high failure rates in large clusters, and explain how you would handle recovery and model resharding.

NVMe

etcd

gRPC

CRC32C

mTLS

Copy-on-Write

Questions & Insights

Clarifying Questions

What is the average model size and state? For a 175B parameter model, total state (parameters + gradients + optimizer states) is roughly 2.5TB–3TB.

What is the network topology and local storage availability? We assume a high-performance compute (HPC) environment with InfiniBand/RoCE and at least 1-2TB of local NVMe SSD per training node.

What is the target "blocking time" for GPUs? Ideally, the time GPUs are stalled for checkpointing should be < 5% of total training time.

What is the failure frequency? In a cluster of 10,000+ GPUs, we expect a Mean Time Between Failures (MTBF) of 12-24 hours; thus, checkpoints must be frequent and reliable.

Assumptions:

Model Size: 2TB per snapshot.

Scale: 1,250 nodes (8 GPUs each = 10,000 GPUs).

Backend: S3-compatible Object Storage or a Parallel File System (Lustre/WEKA).

Strategy: Asynchronous multi-staged checkpointing.

Thinking Process

The core bottleneck is the "I/O Wall": 10,000 GPUs trying to write TBs of data to a shared storage system simultaneously will cause massive congestion and idle time.

How do we eliminate the synchronous I/O bottleneck? Implement a 2-tier persistence strategy: Snapshot to local NVMe (fast) and offload to global storage in the background.

How do we handle the RAM-to-NVMe transfer without stalling the next training step? Use a "Copy-on-Write" or "In-memory double buffering" approach to allow GPUs to resume training while data is flushed.

How do we coordinate 10,000+ agents without a central bottleneck? Use a decentralized metadata service (etcd/Redis) and a peer-to-peer or staggered upload schedule.

How do we optimize storage costs and transfer speed? Implement incremental checkpointing (only saving changed weights/states) and sharding-aware writes.

Bonus Points

Zero-copy Snapshots: Utilizing kernel-level features or specialized ML frameworks (like PyTorch Distributed Checkpoint) to move data from GPU HBM to Host RAM without CPU-intensive serialization.

Topology-Aware Resumption: On failure, the system should ideally reschedule the job on the same physical topology to reuse locally cached checkpoints, minimizing cross-rack traffic during recovery.

Adaptive Staggering: Automatically jittering the upload start times for nodes to avoid saturating the top-of-rack (ToR) switches or the global storage frontend.

Delta-Compression: Storing only the differences between training steps for optimizer states, which change predictably, reducing storage by up to 40%.

Design Breakdown

Functional Requirements

Core Use Cases:

Save model states, optimizer states, and dataloader positions across 10k GPUs.

Restore the full cluster state to a specific "Global Step" after a crash.

Support "Resharding" (e.g., resuming on a different number of GPUs).

Scope Control:

In-Scope: Local buffering, background transfer logic, metadata management, and recovery orchestration.

Out-of-Scope: Model training loops, dataset management, and hyperparameter logging.

Non-Functional Requirements

Scale: Must handle 10,000+ concurrent GPU streams and TB-scale single-point-in-time snapshots.

Latency: Minimize GPU "stalling" (blocking) time to seconds, even for TB-scale models.

Availability & Reliability: The checkpoint service must be highly available; a failure in the checkpointing system should not crash the training job.

Consistency: Strong consistency for "Global Step" metadata; atomic commits for multi-node checkpoints.

Security: Encryption at rest and in transit for sensitive model weights.

Estimation

Data Size: 2 TB per checkpoint.

Frequency: Every 2 hours.

Aggregate Write Bandwidth (Raw): If writing to S3 directly over 10 minutes:

2TB / 600s \approx 3.3 GB/s

(sustainable), but bursty writes from 1,250 nodes can hit

1250 \times 1 Gbps \approx 150 GB/s

Storage Need: 10 checkpoints/day = 20TB/day. 30-day retention = 600TB.

Local NVMe Throughput: ~3-5 GB/s per node (easily handles the local buffer).

Blueprint

The design uses a Two-Tiered Asynchronous Checkpointing architecture.

Checkpoint Agent: A sidecar on every training node that captures GPU HBM data into local RAM/NVMe.

Checkpoint Controller: A central service managing metadata and orchestrating the "Global Commit."

Persistent Storage: S3/Object Storage for long-term durability.

Simplicity Audit: We avoid complex streaming frameworks. The MVP relies on the filesystem and a simple background process to move files, which is robust and easy to debug in HPC environments.

Architecture Decision Rationale:

Why this?: Moving data to local NVMe is orders of magnitude faster than the network. By treating local storage as a write-back cache, we decouple GPU training from network I/O.

Functional Satisfaction: Covers save/restore and enables scaling to 10k GPUs via distributed local writes.

Non-functional Satisfaction: Provides high availability via S3 and low latency via local buffering.

High Level Architecture

Sub-system Deep Dive

Service

Topology & Scaling:

Checkpoint Controller: A stateless gRPC service deployed in a High-Availability (HA) pair.

Agent: Runs as a daemonset or sidecar on every node.

Scaling is linear; as nodes are added, more Agents are added.

API Schema Design:

POST /v1/checkpoint/start: (Controller -> Agents) Trigger snapshot. Includes GlobalStepID.

POST /v1/checkpoint/status: (Agents -> Controller) Report local flush completion.

GET /v1/checkpoint/latest: (Job Scheduler -> Controller) Fetch metadata for recovery.

Resilience & Reliability:

Lease Mechanism: Agents take a lease on a checkpoint task. If an Agent dies, the Controller marks the global checkpoint as "Partial/Failed."

Staggered Uploads: The Controller provides each Agent a "Delay Window" for uploading to S3 to prevent a DDoS on the storage backend.

Storage

Access Pattern: Heavy burst writes (checkpointing), infrequent heavy reads (recovery).

Database Table Design (etcd/Metadata):

Checkpoints: checkpoint_id (PK), global_step, status (Pending/Committed/Failed), timestamp.

NodeSlices: node_id (PK), checkpoint_id, local_path, s3_uri, checksum.

Technical Selection:

Local: NVMe (XFS/Ext4) for low-latency buffering.

Global: S3/Object Store for high aggregate throughput and durability.

Distribution Logic:

Sharding: Data is naturally sharded by the model's Parallelism strategy (Data/Pipeline/Tensor Parallelism). Each GPU saves its own rank.

Cache

Purpose: Local NVMe acts as a "Cache" for writes.

Key-Value Schema: Not applicable (File-based).

Failure Handling: If NVMe is full, the Agent triggers an alert and falls back to synchronous network write (slow-path).

Data Processing

Processing Model: Asynchronous background file transfer.

Technical Selection: rclone or a custom Go-based multi-part uploader with integrity checks (CRC32C).

Correctness: Checkpoints are not marked "Committed" in the Controller until all Agents report successful upload and checksum verification.

Infrastructure (Optional)

Observability:

Metrics: CheckpointBlockingTime (seconds GPUs are idle), UploadThroughput (GB/s), LocalDiskUsage.

Alerting: Alert if LocalDiskUsage > 80% or if BackgroundUploadTime > TrainingStepTime.

Wrap Up

Advanced Topics

Trade-offs (Consistency vs. Availability): We choose Consistency (CP). A partial checkpoint is useless for training. If one node fails to save its state, the entire global checkpoint is discarded to prevent model divergence.

Optimization (Incremental Checkpointing): For large models, optimizer states (like Adam's

m

and

v

) don't need to be saved every time if we can reconstruct them or if they change slowly. We can implement a policy to save "Heavy" states every 10 steps and "Light" states (weights) every step.

Bottleneck Analysis:

S3 Ingress: Even S3 has limits. We mitigate this using a "Prefix Sharding" strategy (writing to s3://bucket/folder1/, s3://bucket/folder2/) to spread load across S3 partitions.

Security: All checkpoint data is encrypted using KMS-managed keys. mTLS is used for Agent-to-Controller communication to prevent unauthorized model weight access.