name	oncall-health
description	Use when pages, suppression rules, toil, runbook gaps, or recurring manual ops are hurting responders

Oncall Health And Toil Reduction

Iron Law

NO RECURRING PAGE OR MANUAL RUNBOOK STEP WITHOUT A FIX PATH AND ELIMINATION PLAN

If the same alert or manual operation keeps recurring, it needs engineering work, not another manual fix.

Overview

Repeated pages and manual operations are engineering defects.

Core principle: keep pages urgent and actionable, convert repeated manual work into durable fixes, and protect responders from avoidable operational load.

When To Use

The user is designing or revising paging alerts, toil, runbook, suppression, escalation, or manual-operation decisions that affect responder load.
The user asks to reduce pages, alert fatigue, toil, manual operations, repeated runbook work, or operational burden.
On-call responders are paged by non-urgent, unactionable, duplicate, or noisy alerts.
Manual mitigations are repeated often enough to automate or remove.
Runbooks are missing, stale, unsafe, or too vague to execute under pressure.

When Not To Use

The user asks about staffing, compensation, rotation fairness, headcount, or HR process; out of scope unless reframed as technical toil reduction.
The main deliverable is new telemetry or alert construction; use observability-and-alerting instead.
The main work is defining SLOs or paging thresholds from scratch; use slo-and-error-budgets instead.
The request is generic developer productivity with no operational pain; out of scope for routed specialists.

Info To Gather

Current work phase, next decision, what is known, and assumptions where details are missing.
Paging history: alert name, count, time, severity, duration, action taken, user impact, and fix path.
Toil inventory: manual, repetitive, automatable, tactical, page-driven work.
Runbooks, fallback paths, responsibility, checkpoint notes, and incident/postmortem actions.
Alert-to-runbook reachability, runbook freshness, impact check, mitigation path, and verification step for paging alerts.
Alert policy: paging versus non-paging response, SLO mapping, diagnostic alerts, dedupe, grouping, and suppression.
Automation candidates, recurring incident classes, platform gaps, and unsafe manual steps.
Responder load: after-hours pages, sleep-impacting pages, unresolved alerts, and checkpoint friction.

Workflow

Classify pages. Mark each paging alert as urgent/actionable/user-visible/novel, non-paging, diagnostic, duplicate, stale, or false positive.
Find top load sources. Rank by page count, duration, user impact, recurrence, and manual effort.
Separate symptom from cause. Keep user-impact pages, but remove duplicate cause alerts unless they drive distinct action.
Fix runbooks. Every paging alert needs a reachable, current runbook with impact check, mitigation, fallback, rollback, and verification.
Eliminate toil. Automate, self-heal, remove, or redesign repeated manual operations; documentation alone is insufficient.
Create an engineering backlog. Give every recurring class a priority, expected page reduction, and verification metric.
Protect the signal. Use SLO burn, grouping, dedupe, maintenance windows, and non-paging routing to prevent alert erosion.
Set a page-rate budget. State a numeric per-shift or per-week page target and how it will be measured. Compare against current rate.
Check runbook freshness. For every paging alert, record runbook last-verified date and require freshness cadence alongside coverage.
Refresh regularly. Feed incident/postmortem findings back into alert rules, platform work, and reliability standards.

Synthesized Default

Pages should be urgent, actionable, user-visible, and novel. Everything else should become a non-paging follow-up, automation, grouping, suppression, or removal. Toil reduction should produce engineering work with measured page or manual-effort reduction, and live-site responsibility should feed back into product engineering priorities.

Phase Behavior

Ideation: identify risks, defaults, unknowns, options, and the next decision before code exists.
Design: shape the target artifact, tradeoffs, checks, and details to gather.
Development: guide sequencing, code boundaries, checks, and acceptance criteria.
Testing: define release-blocking tests, evals, fixtures, and failure probes.
Release: define rollout, observability, abort, rollback, and readiness details.
Maintenance: define owners, drift checks, cleanup triggers, and refresh cadence.
Existing artifact: use current code, docs, telemetry, incidents, or diffs as context for the next engineering decision; do not wait for a finished artifact before guiding design, build, release, or operation.
Missing details: state assumptions and say what to check next instead of blocking lifecycle guidance.

Exceptions

Some pre-user-impact alerts may page if they are tested leading indicators with a safe, immediate mitigation.
Low-impact internal systems may route most operational signals to non-paging follow-up if user impact is limited and the user accepts the response latency.
Temporary noisy alerts are allowed during a risky migration only with expiry, and cleanup task.
Staffing and compensation questions remain out of scope unless translated into technical page/toil reduction.

Response Quality Bar

Lead with the page classification, toil inventory, alert-change decision, or automation backlog requested.
Cover urgency, actionability, user visibility, novelty, runbook quality, repeated manual work, responsibility, and measurement before optional on-call breadth.
Make recommendations actionable with page/follow-up/remove decisions, runbook fixes, automation tasks, expiry dates, and measured reduction targets where relevant.
Name the details to inspect, such as alert history, pages per responder, after-hours volume, runbook links, toil hours, manual steps, suppression rules, and incident outcomes; do not state details you have not seen.
Stay technology-agnostic by default: do not introduce provider, product, framework, database, protocol, or command names unless the user supplied them or explicitly requested tool-specific guidance.
Stay inside technical on-call health and toil. Mark staffing, compensation, and HR questions out of scope unless translated into engineering controls.
Be concise: avoid generic on-call advice and prefer compact page inventories and remediation backlogs.

Required Outputs

Output shape: render the matching shared template headings or tables in the reply, or use the same shape.
Paging-alert inventory and classification.
Top toil sources with frequency, effort, and removal path.
Alert changes: page/follow-up/remove/group/dedupe/suppress decisions.
Runbook gap list and required updates.
Automation or redesign backlog with expected page/manual-effort reduction.
Responsibility and fallback fixes.
Measurement plan for page volume, after-hours pages, and toil hours.
Numeric page budget per shift or week with the measurement window and source.
Runbook coverage AND freshness check (last-verified date, freshness cadence) for each paging alert.
Alert-to-runbook path showing each paging alert reaches a current runbook with impact check, mitigation, and verification.

Checks Before Moving On

paging_classification: each paging alert is classified by urgency, actionability, user visibility, and novelty.
toil_inventory: repeated manual work has frequency, and elimination or automation plan.
runbook_check: remaining paging alerts link to executable runbooks with mitigation and verification.
alert_runbook_path: each paging alert links to a reachable, current runbook with impact check, mitigation path, and verification step.
noise_reduction: proposed changes state expected page or toil reduction and how it will be measured.
scope_check: staffing, compensation, and HR issues are reframed or marked out of scope.
page_budget: a numeric per-shift or per-week page target is stated with measurement window.
runbook_freshness: each paging alert has a last-verified date and a freshness cadence.

Red Flags - Stop And Rework

The solution is "make the runbook longer" for repeated manual work.
A paging alert has no action other than "look at dashboard".
Responders routinely silence, ignore, or rerun alerts.
Every cause alert pages alongside the symptom alert.
Alert reduction removes the only user-impact signal.

Common Mistakes

Mistake	Correction
Treating pages as inevitable	Treat avoidable pages as engineering defects.
Automating bad operations	Remove or redesign unsafe manual work when possible.
Deleting noisy alerts blindly	Preserve user-impact coverage and verify replacement signal.
Measuring only count	Track after-hours load, duration, recurrence, and toil hours.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Oncall Health And Toil Reduction

Iron Law

Overview

When To Use

When Not To Use

Info To Gather

Workflow

Synthesized Default

Phase Behavior

Exceptions

Response Quality Bar

Required Outputs

Checks Before Moving On

Red Flags - Stop And Rework

Common Mistakes

FilesExpand file tree

oncall-health.md

Latest commit

History

oncall-health.md

File metadata and controls

Oncall Health And Toil Reduction

Iron Law

Overview

When To Use

When Not To Use

Info To Gather

Workflow

Synthesized Default

Phase Behavior

Exceptions

Response Quality Bar

Required Outputs

Checks Before Moving On

Red Flags - Stop And Rework

Common Mistakes