Operational playbooks for BrandBlitz on-call engineers. Each runbook covers one failure scenario: symptom → impact → diagnosis → mitigation → remediation.
| Runbook | Description |
|---|---|
| horizon-outage.md | Stellar Horizon API is degraded or unreachable — payouts blocked |
| hot-wallet-low-balance.md | Hot wallet USDC balance too low to process payouts |
| payout-stuck-in-queue.md | Payout jobs stuck in BullMQ — recipients not receiving USDC |
| cdn-purge.md | Force-invalidate a cached asset from Cloudflare R2 / nginx |
| leaked-secret.md | Secret committed to git or exposed in logs — contain and rotate |
| rotate-minio-certs.md | Rotate expired or compromised MinIO TLS certificates |
| rotate-s3-credentials.md | Rotate S3 / R2 access keys |
| rotate-secrets.md | General secret rotation procedure for all services |
| session-forced-signout.md | Force-revoke refresh tokens for a user or entire fleet |
- Copy the template below into a new
docs/runbooks/<name>.md. - Add a row to the table above.
- Link it from any related ADR or issue.
# Runbook: <Title>
## Symptom
What an engineer sees (alert text, log line, user report).
## Impact
Who is affected and how severely.
## Diagnosis
Step-by-step commands / queries to confirm root cause.
## Mitigation
Fast action to stop the bleeding (may be temporary).
## Remediation
Permanent fix and verification steps.
## Post-mortem
Link to the post-mortem template: [post-mortem template](../post-mortem-template.md)