The Question
Design

Distributed Cron Job Scheduler

Design a highly available and scalable distributed cron job scheduler capable of managing 100 million scheduled tasks. The system must support standard cron expressions for periodic execution and ensure at-least-once delivery with a peak throughput of 50,000 executions per second. Focus on the decoupling of job discovery and job execution, handling massive bursts of tasks scheduled at the same time (e.g., top of the hour), and providing a reliable mechanism to recover from system downtime without missing task windows.
PostgreSQL
Kafka
REST
JSONB
HPA
JWT
Exponential Backoff
Distributed Locking
Questions & Insights

Clarifying Questions

Scale and Throughput: What is the expected number of total scheduled jobs and the peak execution rate (Jobs Per Second)?
Precision: Does the system require second-level precision, or is minute-level granularity sufficient?
Execution Guarantee: Is "at-least-once" delivery acceptable, or do we need strict "exactly-once" (which involves heavy distributed locking)?
Job Type: Are jobs predominantly short-lived (e.g., calling a webhook) or long-running compute tasks?
Assumptions:
100 million total registered jobs.
Peak 50,000 job executions per second (EPS).
Minute-level precision is the primary requirement.
At-least-once delivery; idempotency is handled by the job consumer.

Thinking Process

How to identify due jobs efficiently? Instead of scanning a massive table, we use a time-bucketed index and a dedicated "Scheduler" service that moves jobs from the database to a high-speed Task Queue.
How to prevent double-scheduling? We implement a state-transition mechanism in the database (e.g., status: QUEUED) and use optimistic locking or distributed locks during the dispatch phase.
How to handle execution? We decouple "Scheduling" from "Execution" using a Message Queue (Kafka/SQS) to ensure that slow-running jobs do not block the scheduler's ability to trigger new ones.
How to handle scale? We shard the job metadata by job_id and the scheduling lookups by next_run_time buckets.

Bonus Points

Hierarchical Timing Wheels: Using an in-memory timing wheel for sub-second precision and low-overhead scheduling.
Clock Drift Mitigation: Discussing how NTP sync issues across distributed nodes can lead to double-firing and how to use logical "Time Windows" to mitigate this.
Partitioned Polling: Using a "Leader-Follower" or "Shared-Nothing" architecture where each scheduler node owns a specific shard of time-buckets to prevent contention.
Backpressure Handling: Implementing a feedback loop where the scheduler slows down if the execution queue depth exceeds a specific threshold.
Design Breakdown

Functional Requirements

Users can create, update, delete, and pause cron jobs.
Support for standard Cron expressions (e.g., 0 0 * * ).
Ability to track job execution history (Success/Failure/Retry).
Manual trigger (Run Now) capability.

Non-Functional Requirements

High Availability: The scheduler must be a multi-node system with no single point of failure.
Reliability: No missed executions due to node crashes.
Scalability: Support for 100M+ jobs and high burst execution.
Durability: Job definitions and logs must persist.

Estimation

Storage: 100M jobs * 1KB/job ≈ 100GB (Metadata).
History/Logs: 50k EPS * 86,400s/day ≈ 4.3B records/day. Requires a TTL-based storage or Cold Storage (S3).
Throughput: 50,000 EPS. If using a single SQL instance, polling becomes a bottleneck. Sharding/Bucketing is required.

Blueprint

Concise Summary: A distributed system where an API manages job metadata, a Poller Service identifies jobs due for execution using time-buckets, and a Task Queue decouples scheduling from actual execution by Workers.
Major Components:
API Service: Handles CRUD operations for job configurations.
Metadata DB (Postgres): Persistent storage for job definitions and schedules.
Scheduler Poller: Scans the DB for jobs due in the next minute and pushes them to a queue.
Task Queue (Kafka): Buffers jobs to be executed, providing durability and load leveling.
Worker Service: Consumes from the queue and executes the job (e.g., HTTP callback).
Simplicity Audit: This design avoids complex distributed coordination (like Akka actors) in favor of a proven DB-to-Queue pattern which is easier to debug and scale.
Architecture Decision Rationale:
Why this architecture?: Decoupling scheduling and execution is critical; if a worker hangs, the scheduler remains unaffected.
Functional Satisfaction: Covers the full lifecycle from creation to execution.
Non-functional Satisfaction: Scalable via DB sharding and Kafka partitioning; highly available through stateless service replication.

High Level Architecture

Sub-system Deep Dive

Service

Topology & Scaling:
Stateless API and Worker services deployed across multiple Availability Zones (AZs).
Horizontal Pod Autoscaling (HPA) based on CPU for API and Consumer Lag for Workers.
API Schema Design:
POST /v1/jobs: Create job (returns job_id).
GET /v1/jobs/{id}: Fetch job status.
PUT /v1/jobs/{id}: Update cron expression or target.
Protocol: REST/JSON.
Idempotency: client_request_id header to prevent duplicate job creation.
Resilience & Reliability:
Retries: Workers use exponential backoff for failed HTTP callbacks.
Timeouts: Hard 30s timeout for job execution to prevent worker exhaustion.
Security:
JWT-based AuthN for API.
Target URL validation to prevent SSRF (Server-Side Request Forgery).

Storage

Access Pattern:
Write-heavy during job creation.
Read-heavy for Poller (scans next_run_time).
High-volume writes for execution logs.
Database Table Design:
jobs: job_id (UUID, PK), user_id, cron_expr, target_config (JSONB), next_run_time (Indexed), status (Active/Paused).
job_execution_log: execution_id, job_id, start_time, end_time, status, response_code.
Technical Selection:
PostgreSQL: Robustness and ACID compliance for job metadata.
Partitioning: List partitioning on next_run_time by hour/day to keep the active index small and performant.
Distribution Logic:
Sharding by user_id to distribute load across multiple Postgres instances if 100M jobs exceed single-node IOPS.

Messaging

Purpose & Decoupling: Kafka acts as a buffer between the Scheduler (which finds jobs) and Workers (which execute them).
Throughput & Partitioning:
Topic: cron-task-executions.
Partition Key: job_id (ensures sequential execution if a job fires twice quickly, though unlikely with minute precision).
Failure Handling:
Dead-letter Queue (DLQ) for jobs that fail after N retries (e.g., target URL 404s).
Technical Selection: Kafka. Rationale: High throughput and ability to replay events if workers fail.
Wrap Up

Advanced Topics

Trade-offs (Consistency vs Availability): We prioritize "At-least-once" delivery (Availability). In the rare case of a network partition between the Poller and Kafka, we might re-poll and send the same job twice. Users must implement idempotent handlers.
Reliability & Failure Handling:
Lease Mechanism: To prevent multiple Poller nodes from picking up the same bucket, we use a Lease table in Postgres (Distributed Lock) so only one node processes a specific time-bucket.
Missed Slots: If the system goes down for 5 minutes, the Poller is designed to scan next_run_time < NOW() to catch up on missed jobs.
Bottleneck Analysis:
The "Hot Minute" Problem: Millions of jobs scheduled at 0 0 * * (midnight).
Optimization: The Scheduler Poller doesn't execute; it just enqueues. Kafka handles the burst. We can also "jitter" job execution times slightly if the user doesn't require precise second-zero firing.
Security: Workers should run in a hardened VPC; outgoing requests to user-defined URLs must be proxied through an Egress Gateway to prevent internal network scanning.