Workspace-scale adverse-event horizon scanner for clinical AI. Scans an entire workspace in one shot, runs a three-method statistical ensemble (Bayesian online change-point detection + Poisson z-score syndromic surveillance + CUSUM control chart) across every drug × event × stratum combination, applies Benjamini–Hochberg FDR control across the stratum grid, and emits signed FHIR DetectedIssue resources for each implicated patient — plus a workspace-level Composition with the statistical methodology travelling inline as a Provenance attachment.
The thesis: signal-significance decisions are made by statistics, not by an LLM. Three independent methods (BOCPD + Poisson-z + CUSUM) must all agree before any signal fires. The LLM is invoked only for clinician-facing narrative — CI greps the repo and fails the build if any LLM call escapes the
vigil/vigil_core/explain/allow-list.
| Surface | URL | What it serves |
|---|---|---|
| MCP server | https://vigil-mcp-ag707.fly.dev/mcp |
Scan, detect, explain, and write tools over MCP |
| A2A workspace agent | https://vigil-a2a.fly.dev/ |
Agent-to-Agent endpoint, workspace-scope (no per-patient selection required) |
Both deployed on Fly.io. Local quick-start in DEMO.md runs a full workspace scan against 65 committed FHIR patients in ~3 seconds.
Every per-patient agent in this ecosystem requires a selected patient before it can do anything. Vigil does not. Vigil ships as a workspace-scope A2A agent and a workspace-scope MCP server: experimental.fhir_context_required.value=false, experimental.workspace_scope.value=true. The operator invokes Vigil with a single workspace-level prompt — "scan the workspace for emerging adverse-event clusters" — and receives a complete signal report: which clusters fired, which methods agreed, which patients are implicated, what the Benjamini–Hochberg q-value is, and an ed25519 signature on the whole packet.
Cluster detection : 30/30 (100%) on C30 corpus
10/10 true clusters flagged
0/20 false positives
expected FDR 0.000 (gate ≤ 0.10)
Live demo path : 65 FHIR patients · 4/4 hero clusters fired
through compose_workspace_signal_report
Calibration : 4/4 engineered hero clusters clear
z-score + BOCPD + CUSUM with margin
B20/T15 corpora : 20 baseline + 15 temporal cases committed
and schema/aggregate-validated offline
─────────────────────────────────────────
65 committed eval cases + live FHIR workspace evidence:
python -m vigil.evals.runner --frozen v1
python -m vigil.scripts.calibrate_demo_thresholds
pytest vigil/tests/test_demo_workspace_live_path.py -q
C30 is the numeric runner (frozen 30/30). B20 and T15 are committed corpora with validator tests rather than part of the C30 runner command. No live openFDA calls at demo or judging time — frozen runs use only the committed seed parquet at data/faers_seed.parquet.
Five scenarios round-trip end-to-end against the 65-patient demo workspace at fhir-bundles/:
| # | Scenario | Signal |
|---|---|---|
| 1 | Warfarin + Amiodarone bleeding | 15 patients on combo, 15 major bleeding events. Calibration: z=15.88, BOCPD=1.000, CUSUM fired; BH q ≤ 0.05 live. |
| 2 | NSAID + ACE-I + Diuretic AKI | 12 patients on triple-whammy, 12 AKI events. Connects to Lumen's Margaret scenario. z=16.26, BOCPD=1.000, CUSUM fired. |
| 3 | Sulfonylurea ≥75y hypoglycaemia | 10 patients ≥75 on sulfonylurea, 10 severe hypoglycaemia events. Age-stratified detection. z=17.71, BOCPD=1.000, CUSUM fired. |
| 4 | Methotrexate hepatotoxicity | 8 patients on long-term MTX, 8 ALT/LFT elevations. Observation-based extraction (no Condition code needed). z=12.02, BOCPD=1.000, CUSUM fired. |
| 5 | No-signal control | 20-patient control cohort. Vigil scans, returns zero flagged clusters. The all-three-agree gate is not a rubber stamp. |
Three independent methods must all agree before any signal fires:
- Bayesian online change-point detection — Adams & MacKay 2007 (arXiv:0710.3742). Posterior over run-length; threshold default 0.95. Hazard rate recorded in signal metadata.
- Poisson z-score syndromic surveillance — CDC EARS (Hutwagner 2003) / FDA Sentinel (Platt 2012). Observed-vs-expected rate-ratio z on stratified workspace cohorts; threshold default 3.0.
- CUSUM control chart — Page 1954. Running cumulative-sum upward chart with reference, slack
k, decision thresholdh; threshold default 4.0.
A signal fires only when all three clear their thresholds. The Benjamini & Hochberg 1995 q-value is computed once across the workspace stratum grid and reported on every fired signal (target FDR ≤ 0.05; over-target signals are surfaced but flagged q_above_target).
LLMs operate only inside vigil/vigil_core/explain/ for clinician-facing narrative. The CI invariant test_no_hardcoded_llm_model_strings rejects any LLM client invocation outside that allow-list.
Full methodology with hazard-rate sensitivity analysis: docs/METHODOLOGY.md.
flowchart TB
A[Workspace prompt:<br/>"scan for emerging AE clusters"] --> O[Vigil A2A Orchestrator<br/>workspace-scope agent]
O -->|self-fan| M[Vigil MCP<br/>FastMCP · workers=1]
M --> S[SCAN<br/>build cohort · stratify ·<br/>extract recent events]
S --> D[DETECT ensemble]
subgraph D[DETECT ensemble — all 3 must agree]
D1[BOCPD<br/>Adams & MacKay 2007]
D2[Poisson z-score<br/>CDC EARS · FDA Sentinel]
D3[CUSUM<br/>Page 1954]
end
D --> F[BH FDR control<br/>across stratum grid]
F --> E[EXPLAIN<br/>guideline cite · confounders<br/>· LLM narrative]
E --> W[COMPOSE + WRITE]
W --> R[Signed FHIR Bundle<br/>DetectedIssue × N patients +<br/>workspace Composition +<br/>Provenance + AuditEvent<br/>ed25519 · SHA-256 chain]
Module map:
vigil/vigil_mcp/tools/scan/— workspace cohort builder, stratification, recent-event extractionvigil/vigil_mcp/tools/detect/— BOCPD, z-score, CUSUM, ensemble combiner, baseline storevigil/vigil_mcp/tools/explain/— guideline citation, confounder check, LLM narrativevigil/vigil_mcp/tools/write/—DetectedIssue, workspaceComposition, auditProvenancevigil/vigil_mcp/tools/compose_workspace_signal_report.py— end-to-end composervigil/vigil_orchestrator/— A2A endpoint with workspace-scope agent cardvigil/vigil_core/explain/— explanation logic (sole LLM call site)vigil/evals/— C30 numeric runner + B20/T15 validatorsshared/— vendored FHIR client, ed25519 audit chain, LLM client, types
Full walkthrough: ARCHITECTURE.md.
Enforced by CI, not aspirational:
- Statistics decide signals, not LLMs. Three methods, all must agree, plus BH FDR control. The LLM only writes clinician-facing narrative on signals already fired.
- LLM call sites are an allow-list.
vigil/vigil_core/explain/only. CI greps the rest of the repo forclaude-,gpt-,o1-,haiku,gemini-and fails the build on a hit outside the allow-list. uvicorn --workers 1everywhere. FastMCP's session cache breaks under multi-worker. CI rejects any other value.- Forward-hash audit chain. Every
DetectedIssueand the workspaceCompositionare ed25519-signed and SHA-256 chained in canonical JSON.Chain.verify()raises on tamper. - Frozen eval is hermetic.
python -m vigil.evals.runner --frozen v1uses only committed parquet data; no network, no openFDA hits, no API keys required.
git clone https://github.com/AbhinavGupta707/Vigil
cd Vigil
python3 -m venv .venv && source .venv/bin/activate
pip install -e '.[dev]'
pytest # full suite
python -m vigil.evals.runner --frozen v1 # eval headline
uvicorn vigil.vigil_mcp.main:app --workers 1 # MCP at :8000
uvicorn vigil.vigil_orchestrator.main:app --port 8080 --workers 1 # A2A at :8080Hands-on hero scenario walkthrough: DEMO.md.
- A statistical ensemble that won't fire alone. Three independent methods, three different theoretical bases (Bayesian, frequentist, control-chart). Any one might be wrong; all three agreeing is a strong signal.
- FDR-controlled across the stratum grid. Benjamini–Hochberg q-value computed once over the full grid, not per-cluster, so multiple-testing inflation doesn't silently quadruple the false-positive rate.
- Workspace-scope agent capability. Most marketplace agents need a patient context. Vigil exposes
experimental.workspace_scope=trueandexperimental.fhir_context_required=false, demonstrating the first surveillance-style agent against a real FHIR workspace. - FHIR-native output. Not a JSON report, not a CSV — signed FHIR R4
DetectedIssueresources, one per implicated patient, plus a workspace-levelComposition. Slots directly into downstream EHR pipelines. - Verifiable audit chain. ed25519 per entry, SHA-256 forward chain. External auditors can verify the entire chain with only the public key.
Vigil ships alongside Lumen — a formal-verification meta-layer for prospective clinical recommendations. Same author, shared infrastructure (shared/fhir/, shared/audit/, shared/llm/, shared/types/ are vendored in both repos). Different scope: Lumen proves safety for a single recommendation per patient; Vigil detects harm signals retrospectively across a workspace. Together they form a forward-proof + backward-surveillance stack.
MIT. See LICENSE. This is a research and engineering demonstration — not a medical device. It has not been evaluated by any regulatory authority and must not be used to make clinical decisions for real patients.
This work uses openFDA FAERS (FDA Adverse Event Reporting System) data as a prior baseline for surveillance comparison only — not as standalone clinical evidence. Source: U.S. FDA — openFDA Drug Adverse Event endpoint (https://api.fda.gov/drug/event.json). License: public domain. Use governed by openFDA terms. Disclaimer: "Do not rely on openFDA to make decisions regarding medical care. Always speak to your health provider about the risks and benefits of FDA-regulated products." Snapshot provenance: vigil/data/faers_seed_metadata.json.