| name | scheduled-job-reliability |
|---|---|
| description | Use when a periodic, cron, or triggered compute job must run correctly despite missed runs, overlap, retries, or schedule drift |
NO SCHEDULED JOB WITHOUT IDEMPOTENCY, MISSED-RUN DETECTION, OVERLAP CONTROL, AND A COMPLETION SIGNAL
"It runs on a timer" is not a reliability design. A timer that silently skips, double-fires, or overlaps is an outage waiting for a quiet weekend.
A scheduled job is compute triggered by time or an external event rather than a user request. Its failure modes are distinct from request services and from data pipelines: the trigger can be missed, fire twice, drift across time zones, or overlap a still-running prior instance, and no user is watching when it does.
Core principle: make each run idempotent and bounded, detect when a run is missed or stuck, prevent overlap, and emit a completion signal that something alerts on.
- A job runs on a schedule (periodic, cron-style) or on an external trigger and must complete correctly.
- Runs can be missed, delayed, double-fired, or overlap a previous run.
- The job mutates state, sends messages, or produces artifacts other systems depend on.
- Schedule behavior depends on time zones, daylight-saving transitions, or leap handling.
- The work is data-pipeline freshness, lineage, or backfill of a dataset; use
data-pipeline-reliability. - The work is broker- or queue-driven event consumption; use
event-workflows. - The job is a one-shot destructive bulk mutation; use
configuration-and-automation-safety. - The concern is the container/runtime lifecycle, drain, or probes; use
container-runtime-and-orchestration.
- Current work phase, next decision, what is known, and assumptions where details are missing.
- Trigger source, schedule, time-zone basis, expected run frequency, and acceptable lateness.
- What the run mutates or emits, and which downstream systems depend on it.
- Run duration distribution, deadline, and behavior when a run exceeds the interval.
- Idempotency/dedup design for repeated or partial runs.
- Missed-run, failed-run, and stuck-run detection and alerting.
- Catch-up policy: backfill missed windows, run once, or skip.
- Concurrency/locking: singleton, leader election, or bounded parallelism.
- State the contract. Write what one successful run guarantees and the acceptable lateness in user or downstream terms.
- Make the run idempotent. Use an idempotency key or dedup record per logical window so a retried or double-fired run does not double-apply effects; checkpoint partial progress so a resumed run continues rather than restarts harmfully.
- Prevent overlap. Choose singleton execution, leader election, or bounded concurrency; decide whether a late run is skipped or queued when the prior run is still active.
- Pin the time basis. State the time zone and how daylight-saving and leap transitions are handled so the job neither double-fires nor skips a window at a transition.
- Bound the run. Set a per-run deadline and behavior on overrun (kill and retry, or let it finish and skip the next window); never let runs pile up unbounded.
- Detect missed and stuck runs. Alert on "expected a run by T and none completed" and on "a run started but did not finish within the deadline," beyond explicit errors; a job that never fires emits no error.
- Define catch-up. Decide whether missed windows are backfilled, collapsed to one run, or skipped, and make backfill idempotent and rate-bounded.
- Emit completion evidence. Record start, end, outcome, and window covered so success is observable and gaps are countable.
Run each job as an idempotent, deadline-bounded, single-writer operation over an explicit time window, with missed-run and stuck-run alerting and a recorded completion signal. Prefer skip-if-overlapping with catch-up over unbounded queuing. Treat a silently missed run as an incident-class defect, not a cosmetic gap.
- Ideation: identify risks, defaults, unknowns, options, and the next decision before code exists.
- Design: shape the target artifact, tradeoffs, checks, and details to gather.
- Development: guide sequencing, code boundaries, checks, and acceptance criteria.
- Testing: define release-blocking tests, evals, fixtures, and failure probes.
- Release: define rollout, observability, abort, rollback, and readiness details.
- Maintenance: define owners, drift checks, cleanup triggers, and refresh cadence.
- Existing artifact: use current code, schedules, logs, or incidents as context for the next engineering decision; do not wait for a finished artifact before guiding design, build, release, or operation.
- Missing details: state assumptions and say what to check next instead of blocking lifecycle guidance.
- A purely advisory, side-effect-free job may skip overlap control if a double run is provably harmless.
- A job whose missed run is recovered by the next run may use skip-with-no-catch-up if downstream tolerance is explicit.
- Very low-frequency jobs may rely on lighter detection if a missed run is caught by a human cadence that is named.
- Lead with the run contract, idempotency design, overlap policy, or missed-run detection requested.
- Cover idempotency, overlap, time basis, deadline, missed/stuck detection, catch-up, and completion evidence before optional scheduling breadth.
- Make recommendations actionable with concrete trigger, lock, deadline, alert, and catch-up decisions.
- Name the details to inspect, such as schedule, run durations, lock mechanism, dedup keys, and run history; do not state details you have not seen.
- Stay technology-agnostic by default: do not introduce provider, product, framework, database, protocol, or command names unless the user supplied them or explicitly requested tool-specific guidance.
- Stay inside scheduled/triggered compute reliability. Route data-pipeline freshness, broker workflows, one-shot mutations, and runtime lifecycle when those are central.
- Be concise: prefer a run-contract table and failure-mode list over generic cron explanation.
- Output shape: render the matching shared template headings or tables in the reply, or use the same shape.
- Run contract: what one successful run guarantees and acceptable lateness.
- Idempotency and partial-progress design.
- Overlap/concurrency policy and lock mechanism.
- Time-basis and daylight-saving/leap handling.
- Deadline and overrun behavior.
- Missed-run and stuck-run detection and alerting plan.
- Catch-up/backfill policy.
- Completion-evidence and run-history record.
idempotent_run: a repeated or double-fired run does not double-apply effects; partial progress resumes safely.overlap_control: singleton/leader/bounded concurrency is defined, with skip-or-queue behavior on overlap.time_basis: time zone and daylight-saving/leap behavior cannot double-fire or skip a window.run_deadline: a per-run deadline and overrun behavior exist; runs cannot pile up unbounded.missed_run_alert: missed-run and stuck-run conditions alert without relying on an explicit error.catch_up_policy: backfill/skip/collapse behavior for missed windows is explicit and idempotent.completion_signal: each run records start, end, outcome, and window covered.
- The job has retries but no idempotency, so a retry double-applies effects.
- Two runs can overlap with no lock, corrupting shared state.
- A missed run produces no signal because the trigger simply never fired.
- A run can exceed its interval and queue indefinitely behind itself.
- Daylight-saving or time-zone change double-fires or skips a window.
- "Success" is inferred from no error rather than a recorded completion.
| Mistake | Correction |
|---|---|
| Trusting the timer | Detect missed and stuck runs explicitly; a no-fire emits no error. |
| Retry without idempotency | Make each run idempotent before allowing retries. |
| Ignoring overlap | Enforce singleton/leader/bounded concurrency with skip-or-queue rules. |
| Wall-clock naivety | Pin the time zone and handle daylight-saving and leap transitions. |
| Unbounded catch-up | Make backfill idempotent and rate-bounded, or collapse missed windows. |