The Question
Design

Distributed Cron Job Scheduler

Design a highly available and scalable distributed system to schedule and execute millions of recurring jobs (cron tasks). The system must support high-precision triggering (within seconds of the scheduled time), handle massive execution spikes at specific time boundaries (e.g., top of the hour), and ensure that jobs are triggered even in the event of partial system failure. Detail your strategy for job metadata storage, the mechanism for polling or triggering due jobs, and how you ensure tasks are distributed to workers efficiently while maintaining at-least-once execution guarantees.
PostgreSQL
Kafka
Etcd
Timing Wheel
CDC
gRPC
Docker
Questions & Insights

Clarifying Questions

Scale: How many total scheduled jobs are we supporting, and what is the peak execution rate per second?
Precision: Does the system need second-level precision, or is minute-level (standard Unix cron) sufficient?
Execution Guarantees: Is "at-least-once" execution acceptable, or do we strictly require "exactly-once" (which involves heavy distributed locking)?
Job Duration: Are jobs typically short-lived (e.g., calling a webhook) or long-running (e.g., data processing tasks)?
Assumptions:
10 million total registered jobs.
10,000 jobs triggered per second at peak.
Minute-level precision is acceptable for MVP.
At-least-once execution is the target (idempotency handled by the user's task).

Thinking Process

Core Bottleneck: How to efficiently "scan" millions of jobs to find which ones are due without melting the database or missing a window.
Progressive Approach:
Metadata Storage: How do we store cron expressions and the "next run time"?
The Poller/Scheduler: How do we pick up jobs? (Leader election vs. Sharding).
Task Distribution: How do we decouple the "decision to run" from the "actual execution"?
Reliability: How do we handle worker crashes or scheduler failovers?

Bonus Points

Hierarchical Timing Wheels: Implementing a timing wheel for sub-second precision and O(1) task scheduling.
Sharded Polling: Partitioning the job metadata by next_run_time and job_id to allow multiple scheduler instances to work in parallel without overlap.
Execution Isolation: Using containerized runners (e.g., temporary K8s pods) to ensure one rogue job doesn't consume all worker resources.
Missed Window Strategy: Defining a "misfire instruction" (e.g., skip, run once immediately, or catch up) for jobs that were missed during a system outage.
Design Breakdown

Functional Requirements

Core Use Cases:
Users can create, update, and delete cron jobs (CRUD).
Users can view execution history/status.
System automatically triggers jobs based on Cron expressions.
Scope Control:
In-scope: Scheduling logic, task distribution, basic retry logic, and metadata management.
Out-of-scope: Complex workflow orchestration (DAGs), UI development, and direct job execution logic (the system triggers, it doesn't "be" the job).

Non-Functional Requirements

Scale: Support for 10^7 jobs and 10^4 QPS execution.
Latency: Triggering delay should be < 10 seconds from the scheduled time.
Availability & Reliability: 99.99% availability. No single point of failure (SPOF).
Consistency: High consistency for job metadata; eventual consistency for execution logs.
Fault Tolerance: If a scheduler node dies, another must take over its shard immediately.

Estimation

Traffic:
10M total jobs.
If each job runs once a day: 10M / 86400 \approx 115 executions/sec.
Peak: Assume 100x average for "top of the hour" spikes \approx 11,500 executions/sec.
Storage:
Job Metadata: 1KB per job \times 10M = 10GB.
Execution Logs (30 days): 10M \times 30 \times 500B \approx 150GB.
Bandwidth:
10,000 triggers/sec \times 1KB \approx 10MB/s (Minimal).

Blueprint

Concise Summary: A leader-sharded polling architecture where a set of Scheduler nodes query a partitioned PostgreSQL database for upcoming jobs and push them into a Kafka cluster for distributed execution by a Worker fleet.
Major Components:
API Gateway: Handles job CRUD and authentication.
Metadata DB (PostgreSQL): Stores job definitions and the computed next_run_time.
Scheduler Nodes: Act as the "brains" that poll the DB and enqueue tasks.
Message Queue (Kafka): Acts as a buffer to handle execution spikes and decouple scheduling from execution.
Worker Fleet: Consumes tasks from Kafka and performs the actual job (e.g., HTTP POST, RPC call).
Simplicity Audit: This design avoids complex distributed locking for every job trigger by using database-level sharding and "claim" mechanics.
Architecture Decision Rationale:
Why this?: PostgreSQL provides ACID for job state transitions. Kafka handles the 10k+ QPS bursts effectively and provides persistence if workers are slow.
Functional Satisfaction: Covers the full lifecycle from creation to execution.
Non-functional Satisfaction: Horizontally scalable through DB sharding and Kafka partitioning.

High Level Architecture

Sub-system Deep Dive

Edge (Optional)

Content Delivery & Traffic Routing: Not majorly impactful for this backend-heavy system.
Security & Perimeter:
API Gateway: Standard JWT-based AuthN/AuthZ.
Rate Limiting: Applied per-user to prevent a single tenant from flooding the scheduler with millions of jobs.

Service

Topology & Scaling:
Job Management Service: Stateless, scaled horizontally based on CPU.
Scheduler Nodes: Use Etcd for leader election and shard allocation. Each node is responsible for a subset of job_id shards (e.g., node 1 handles job_id % 10 == 0..2).
API Schema Design:
POST /v1/jobs: Create job. Body: {cron: "*/5 * * * *", target: "URL", payload: "{}"}.
GET /v1/jobs/{id}/history: Returns execution status.
Resilience & Reliability:
Lease Mechanism: Schedulers maintain a heartbeat in Etcd. If a node fails, its shards are redistributed.

Storage

Access Pattern: Heavy read-modify-write for the Scheduler (updating next_run_time).
Database Table Design:
jobs: job_id (PK), user_id, cron_expression, next_run_time (Indexed), status (active/paused), shard_id.
execution_logs: log_id, job_id, status, start_time, end_time, response.
Technical Selection: PostgreSQL.
Rationale: Excellent support for indices on next_run_time. SELECT ... FOR UPDATE SKIP LOCKED allows multiple workers to pick up tasks safely within a shard if needed.
Distribution Logic:
Hash-based sharding on job_id to distribute load across multiple DB instances if one becomes a bottleneck.

Messaging

Purpose & Decoupling: Kafka buffers the high-volume spikes that occur exactly at the turn of an hour/minute.
Event / Topic Schema: Topic job-triggers. Key: job_id (ensures ordering if a job has multiple instances, though rare for cron). Payload: {job_id, target, payload, scheduled_time}.
Technical Selection: Kafka.
Rationale: High throughput, allows replaying triggers if the worker fleet fails.
Wrap Up

Advanced Topics

Trade-offs: We choose At-Least-Once over Exactly-Once. If a worker executes a job but fails to commit to Kafka, the job might run again. This is a standard trade-off for high-scale distributed systems; users are expected to make their endpoints idempotent.
Reliability:
Two-Phase Lock Avoidance: To avoid locking the whole DB, we use the "Polling with Shards" approach.
Backoff: Workers use exponential backoff when calling external services.
Bottleneck Analysis:
DB Scanning: As the job count grows, SELECT next_run_time < NOW() slows down. Optimization: Move "soon-to-run" jobs (next 1 hour) into an in-memory priority queue/Timing Wheel in the Scheduler nodes.
Security: Workers should run in a restricted network or sandbox (Docker/Firecracker) if they are executing user-provided scripts, though the MVP assumes predefined HTTP/RPC targets.