The Question
DesignScalable Ephemeral CI/CD Orchestration System
Design a highly available and scalable CI/CD system capable of handling 100,000 builds per day. The system should support event-driven triggers from version control providers, manage the lifecycle of thousands of ephemeral containerized runners, and provide efficient mechanisms for log streaming and artifact storage. Focus on job scheduling, isolation between builds, and the storage strategy for large-scale build metadata and binary data.
Kubernetes
S3
PostgreSQL
Kafka
Redis
Docker
Vault
gVisor
Vector
Questions & Insights
Clarifying Questions
Scale & Load: What is the expected scale in terms of developers, repositories, and concurrent builds? (e.g., 10,000 developers, 100,000 builds/day, 5,000 concurrent jobs).
Environment Support: Should the system support heterogeneous runners (Windows, MacOS, ARM) or focus on containerized (Docker/K8s) workloads for the MVP?
Artifact Management: Are we responsible for hosting a full-scale container registry/package manager, or just storing build logs and temporary binaries?
Security Requirements: How should secrets (API keys, SSH keys) be injected into the build environment?
Answers (Assumptions):
Scale: 10k devs, 100k builds/day, peak 5k concurrent jobs.
Environment: MVP focuses on Docker-based runners using Kubernetes.
Artifacts: Store build logs and zip/tar artifacts in Object Storage; external registry for Docker images.
Security: Integration with a Secret Management service (e.g., HashiCorp Vault).
Thinking Process
Event-Driven Triggering: How to reliably capture and queue build requests from VCS providers (GitHub/GitLab) without losing events during traffic spikes?
Resource Orchestration: How to manage the lifecycle of ephemeral runners to ensure job isolation, cost efficiency, and low "cold-start" latency?
State Management: How to track complex build DAGs (Directed Acyclic Graphs) and provide real-time status updates to the UI?
Data Persistence Strategy: How to handle the high-write volume of build logs and large binary artifacts without bottlenecking the primary database?
Bonus Points
DAG Optimization: Implementing "Incremental Builds" and "Remote Caching" (e.g., Bazel-style) to skip redundant steps by hashing inputs/outputs.
Fair-Share Scheduling: Implementing a scheduler that prevents a single large monorepo or team from hogging all runner resources.
Log Streaming Architecture: Using a sidecar pattern in runners to stream logs to a distributed log-buffer (Kafka) to ensure logs are preserved even if a runner node crashes.
Ephemeral Security: Implementing "Just-In-Time" (JIT) short-lived credentials for runners to interact with cloud resources, minimizing the blast radius of a compromised build.
Design Breakdown
Functional Requirements
Core Use Cases:
Define pipelines via YAML configuration.
Trigger builds on git push/pull-request events.
Execute multi-step jobs (Build, Test, Lint, Deploy).
View real-time build logs and status updates.
Store and download build artifacts.
Scope Control:
In-Scope: Webhook handling, Job scheduling, Ephemeral runner management, Log/Artifact storage.
Out-of-Scope: Native Git hosting (assume GitHub/Bitbucket), IDE integrations, Managed Kubernetes service.
Non-Functional Requirements
Scale: Support 5,000+ concurrent runners and millions of historical build records.
Latency: Low overhead for job scheduling (< 5 seconds from webhook to runner start).
Availability & Reliability: 99.9% availability for the API; 99.99% durability for artifacts.
Consistency: Strong consistency for build status updates to avoid "Build Succeeded" UI flickers.
Fault Tolerance: Automatic job retries on infrastructure failure (not test failure).
Security: Strict isolation between runners; encrypted secrets at rest and in transit.
Estimation
Traffic Estimation:
100k builds/day \approx 1.15 builds/second average.
Peak QPS (Webhooks): Assuming 10x average = 12 QPS.
Log ingestion: 100k builds * 10MB logs/build \approx 1TB log data/day.
Storage Estimation:
Metadata: 100k builds * 1KB/record \approx 100MB/day \approx 36GB/year.
Artifacts: 100k builds * 100MB/build (50% produce artifacts) \approx 5TB/day.
Bandwidth Estimation:
Inbound: 5TB/day \approx 460 Mbps (primarily artifacts and logs).
Outbound: Significant during deployment phases, varying by target.
Blueprint
Concise Summary: An event-driven microservices architecture where an API Gateway ingests VCS webhooks, a Build Service manages state, and a Scheduler dispatches jobs to a pool of ephemeral Kubernetes runners.
Major Components:
API Gateway: Entry point for webhooks and developer UI/CLI.
Build Service: The "Brain" that parses YAML and manages the build state machine.
Job Queue: Decouples build orchestration from execution.
Runner Manager: Auto-scales Kubernetes pods based on queue depth.
Artifact/Log Store: Durable storage for binary outputs and logs.
Simplicity Audit: This design uses a standard producer-consumer pattern with a managed Kubernetes cluster for runners, avoiding the complexity of building custom VM management or proprietary scheduling logic.
Architecture Decision Rationale:
Why this architecture?: Kubernetes provides the best-in-class primitives for container isolation and scaling, which is the industry standard for modern CI/CD.
Functional Requirement Satisfaction: Meets the need for pipeline execution, log viewing, and artifact storage.
Non-functional Requirement Satisfaction: Uses horizontal scaling (K8s) for scale and S3 for cost-effective, high-durability storage.
High Level Architecture
Sub-system Deep Dive
Edge (Optional)
Content Delivery & Traffic Routing:
API Gateway: Handles SSL termination and routes traffic to the Build Service.
Rate Limiting: Applied per GitHub Organization/User to prevent webhook floods.
Security:
Webhook Validation: HMAC signature verification for incoming VCS events.
WAF: Standard protection against common web exploits.
Service
Build Service:
YAML Parser: Fetches
.ci.yaml from the repo and converts it into a Job DAG.State Machine: Tracks build status (Queued, Running, Succeeded, Failed).
API: RESTful endpoints for the UI to query build status.
Runner Manager:
Scaling Logic: Monitors
Job Queue depth. Uses K8s Horizontal Pod Autoscaler (HPA) or Cluster Autoscaler to provision nodes/pods.Cleanup: Implements a "Reaper" to kill stale pods or runners that exceed a timeout (e.g., 60 minutes).
API Schema:
POST /webhooks/github: Processes push events.GET /builds/{id}: Returns status and metadata.GET /builds/{id}/logs: Returns log stream/URL.Storage
Access Pattern: Metadata is Read-Heavy (UI polling); Logs/Artifacts are Write-Heavy during build and Read-Light afterward.
Database Table Design (PostgreSQL):
Projects: id, repo_url, secret_id, created_at.Builds: id, project_id, commit_sha, status, trigger_type, start_time, end_time.Jobs: id, build_id, name, status, runner_id, log_path.Technical Selection:
PostgreSQL: Chosen for transactional integrity of build states.
Object Storage (S3): Chosen for massive scalability and cost-efficiency for artifacts and logs.
Cache
Purpose: Reducing DB load and managing transient runner state.
Key-Value Schema:
build_status:{id}: TTL 5 mins. Fast lookup for UI.runner_heartbeat:{id}: TTL 30s. Tracks health of active runners.Technical Selection: Redis. High performance and built-in TTL support.
Messaging
Purpose: Asynchronous job dispatching and decoupling.
Event Schema:
JobEvent { build_id, job_id, image, commands, env_vars }.Throughput & Partitioning: Kafka partitions based on
project_id to ensure ordered execution if needed, though CI jobs are typically concurrent.Technical Selection: Kafka or AWS SQS. Kafka is preferred for replaying events and high throughput.
Data Processing
Log Aggregator:
Model: A Sidecar container in the Runner Pod tails log files and pushes chunks to the Log Aggregator.
Output Sink: Batches logs and writes to S3 every 5 seconds or 1MB.
Technical Selection: Vector or Fluent-bit for high-performance log shipping.
Wrap Up
Advanced Topics
Trade-offs:
Isolation vs. Speed: We choose Docker-in-Docker or K8s pods for isolation, which has a higher startup time compared to pre-warmed shared runners but offers better security.
Reliability:
Idempotency: If the Build Service sends the same Job ID to the Queue twice, the Runner Manager ensures only one pod is created using K8s Job constraints.
Bottleneck Analysis:
K8s API Server: Under 5,000 concurrent jobs, the K8s API might become a bottleneck. Solution: Use multiple K8s clusters and shard the Runner Manager.
Artifact Bandwidth: Massive artifacts can saturate NICs. Solution: Use VPC endpoints for S3 and direct-connect.
Security:
Runner Hardening: Use
gVisor or Kata Containers to provide stronger isolation than standard Docker containers, preventing container escape attacks.Optimization:
Layer Caching: Provide a shared SSD volume or a local Docker registry cache for runners to speed up
docker build steps.