The first move in any interview: define requirements and sketch the API before drawing a single box.
POST /jobs { name, cronExpr, handlerUrl, payload?, timeout?, retries? } → { jobId }GET /jobs/:jobId/runs?limit=50 → { runs: [{ runId, status, startedAt, finishedAt, error }] }POST /jobs/:jobId/trigger → { runId }PATCH /jobs/:jobId { paused: true } → 200A distributed job scheduler runs tasks on a schedule across a cluster of worker machines. It's the infrastructure behind nightly report generation, weekly billing runs, email digest delivery, data pipeline triggers, and any other work that needs to happen at a specific time or on a recurring schedule. At small scale this is a cron job. At Google or Airflow scale, it's one of the more subtly difficult systems to design correctly.
The hard problems aren't obvious until you run in production. What happens if a scheduled job fires but no worker is available? What if the scheduler itself crashes right between deciding to run a job and actually dispatching it? What if a job fires twice because of a network partition during dispatch? These edge cases (missed runs, duplicate executions, scheduler downtime) are what separate a real distributed job scheduler from a cron job with aspirations.
The standard design separates the scheduler (which tracks what to run and when) from the workers (which actually execute jobs). A job queue (usually Kafka or SQS) acts as the buffer between them. The scheduler writes a job execution record atomically with enqueuing it, so if the scheduler crashes after writing but before enqueuing, or vice versa, a recovery process can detect the inconsistency and fix it.