Distributed Job Scheduler: System Design

Requirements & API: Distributed Job Scheduler

The first move in any interview: define requirements and sketch the API before drawing a single box.

Functional requirements

•Define jobs with a cron expression and a handler (HTTP callback or function reference).
•Execute each job at its scheduled time (within a few seconds of the target time).
•Support one-off jobs (run once at a specific time) and recurring jobs.
•Provide job execution history: status, start time, end time, output/error.
•Allow jobs to be paused, resumed, or manually triggered.

Non-functional requirements

•At-least-once execution: a job must never be silently dropped if the scheduler crashes.
•Exactly-once semantics where required: billing jobs cannot run twice.
•Scale to millions of scheduled jobs with sub-second scheduling precision.
•High availability: scheduler downtime must not cause missed runs.

API contract

POST /jobs { name, cronExpr, handlerUrl, payload?, timeout?, retries? } → { jobId }

GET /jobs/:jobId/runs?limit=50 → { runs: [{ runId, status, startedAt, finishedAt, error }] }

Execution history for a job.

POST /jobs/:jobId/trigger → { runId }

Manually trigger an immediate execution outside the schedule.

PATCH /jobs/:jobId { paused: true } → 200

Pause/resume a job without deleting it.

About Distributed Job Scheduler

A distributed job scheduler runs tasks on a schedule across a cluster of worker machines. It's the infrastructure behind nightly report generation, weekly billing runs, email digest delivery, data pipeline triggers, and any other work that needs to happen at a specific time or on a recurring schedule. At small scale this is a cron job. At Google or Airflow scale, it's one of the more subtly difficult systems to design correctly.

The hard problems aren't obvious until you run in production. What happens if a scheduled job fires but no worker is available? What if the scheduler itself crashes right between deciding to run a job and actually dispatching it? What if a job fires twice because of a network partition during dispatch? These edge cases (missed runs, duplicate executions, scheduler downtime) are what separate a real distributed job scheduler from a cron job with aspirations.

The standard design separates the scheduler (which tracks what to run and when) from the workers (which actually execute jobs). A job queue (usually Kafka or SQS) acts as the buffer between them. The scheduler writes a job execution record atomically with enqueuing it, so if the scheduler crashes after writing but before enqueuing, or vice versa, a recovery process can detect the inconsistency and fix it.