Summary
The SDK currently lets a destructive Atlas action (pause-cluster, disconnect, delete) be configured against a dedicated production-tier (M10+) cluster with no guardrail. This is a sharp edge that caused two production-MongoDB outages for the Divinci dogfood account (2026-06-12 and 2026-06-20): a placeholder mongodbStorageSizeGB: 10 threshold on an M30 cluster (~16.8 GB used) tripped a cost rule wired to pause-cluster, taking prod offline.
Crucially, pausing a fixed-tier (M10+) cluster does not save cost — the dedicated tier bills the same hourly rate regardless of storage/usage — so a destructive action there is all downside.
Proposed safe-by-default
Treat dedicated (M10+) Atlas clusters as protected-from-destructive-actions by default. A pause-cluster / disconnect / delete rule targeting a dedicated-tier cluster should be refused (or downgraded to alert-only) unless the operator sets an explicit, loud opt-in, e.g.:
{ type: "pause-cluster", target: "...", allowProductionPause: true }
- Shared/serverless (M0/M2/M5, Flex) tiers keep today's behavior — pausing there genuinely saves cost.
- Dedicated tiers default to alert-only (
snapshot); destructive actions require the explicit flag.
This single default would have prevented both outages regardless of the threshold misconfiguration.
Context
- Root-cause write-up:
notebooks/2026-06-23-divinci-prod-atlas-pause-incident.md
- Live remediation (done): prod cluster added to
protectedServices, all prod rules are alert-only snapshot, and the stray divinci-stage-atlas-* rules (one still carrying pause-cluster) have been deleted.
Summary
The SDK currently lets a destructive Atlas action (
pause-cluster,disconnect,delete) be configured against a dedicated production-tier (M10+) cluster with no guardrail. This is a sharp edge that caused two production-MongoDB outages for the Divinci dogfood account (2026-06-12 and 2026-06-20): a placeholdermongodbStorageSizeGB: 10threshold on an M30 cluster (~16.8 GB used) tripped acostrule wired topause-cluster, taking prod offline.Crucially, pausing a fixed-tier (M10+) cluster does not save cost — the dedicated tier bills the same hourly rate regardless of storage/usage — so a destructive action there is all downside.
Proposed safe-by-default
Treat dedicated (M10+) Atlas clusters as protected-from-destructive-actions by default. A
pause-cluster/disconnect/deleterule targeting a dedicated-tier cluster should be refused (or downgraded to alert-only) unless the operator sets an explicit, loud opt-in, e.g.:snapshot); destructive actions require the explicit flag.This single default would have prevented both outages regardless of the threshold misconfiguration.
Context
notebooks/2026-06-23-divinci-prod-atlas-pause-incident.mdprotectedServices, all prod rules are alert-onlysnapshot, and the straydivinci-stage-atlas-*rules (one still carryingpause-cluster) have been deleted.