Detect pharmacovigilance signals from FDA FAERS adverse event data.
Runs entirely on your laptop. No API keys required. No cloud. No licenses.
git clone https://github.com/al-Zamakhshari/drug-safety-signal-agent
cd drug-safety-signal-agent
cp .env.example .env # step 0: copy config (defaults work out of the box)
uv sync
docker compose up -d # start OpenSearch
ollama pull gemma4:12b-mlx # pull Gemma 4 12B MLX ~10GB (once)
./ingestion/download_faers.sh # downloads FAERS 2018–2026 to ~/faers_data/
uv run python -m ingestion.faers_zip_indexer --dir ~/faers_data --all-drugs
uv run python -m ingestion.discover_comparators --drug <your-drug> # per drug you want
uv run python -m ingestion.compute_class_ratio
uv run python -m ingestion.register_mcp_tools
uv run python -m app.server # → http://localhost:8080Detects drug safety signals from FDA FAERS adverse event reports using a fully local pipeline — no cloud services, no API keys, no licenses.
| Stage | Method | Technology |
|---|---|---|
| Signal detection | PRR + ROR + 95% CI + BH-FDR + EBGM/EB05 | OpenSearch aggregations |
| Within-class comparison | Mantel–Haenszel stratified rate ratio + CI | OpenSearch faers_ml_rates |
| Label cross-reference | MedDRA LLT-expanded, negation-aware token-overlap | openFDA API |
| Literature evidence | PubMed search | NCBI eUtils |
| Investigation | Two-phase function calling — class effect / trend / temporal emergence | Qwen3.5-9B |
| Signal memory | Cross-run persistence | OpenSearch ML Memory (3.6+) |
| Web interface | Real-time streaming briefing | FastAPI + SSE |
Example output — semaglutide, 82,700 reports, 18M baseline (2004–2026):
### PRR + ROR + EBGM + BCPNN Signals (EMA/FDA/WHO standards)
| Reaction | PRR (95% CI) | ROR (95% CI) | EBGM / EB05 | IC / IC025 | n | Label? |
|---------------------------------|--------------------|--------------------|---------------|-------------|-------|--------|
| IMPAIRED GASTRIC EMPTYING | 84.94 (81.5–88.5) | 88.16 (84.5–91.9) | 57.0 / 55.2 ✓ | 5.8 / 5.8 ✓ | 3,057 | Yes |
| GLYCOSYLATED HB INCREASED | 11.80 (11.1–12.5) | 11.95 (11.2–12.7) | 11.2 / 10.7 ✓ | 3.5 / 3.4 ✓ | 1,111 | No ⚠️ |
| PANCREATITIS | 6.81 (6.47–7.16) | 6.91 (6.56–7.28) | 6.6 / 6.3 ✓ | 2.7 / 2.6 ✓ | 1,504 | Yes |
| BLOOD GLUCOSE DECREASED | 7.25 (6.87–7.66) | 7.36 (6.96–7.77) | 7.0 / 6.7 ✓ | 2.8 / 2.7 ✓ | 1,311 | No ⚠️ |
Risk: HIGH | Action: ESCALATE
flowchart TD
start([drug name]) --> RN
RN["🔍 resolve_names\nRxNorm API → all brand/generic names"]
RN --> LM
LM["🧠 load_memory\nOpenSearch ML Memory\nprior run findings → PERSISTENT tags"]
LM --> PRR
PRR["📊 calculate_prr\nPRR + ROR + 95% CI\nBH-FDR + EBGM/EB05\nOpenSearch filters agg"]
PRR --> AD
AD["📈 anomaly_detection\nMantel–Haenszel rate ratio\nvs comparator class\nquarterly strata"]
AD --> FL
FL["📋 fetch_label\nopenFDA label text\nMedDRA LLT synonyms cached"]
FL --> R1
R1{"robust signal\nAND unlabeled\nAND PRR >= 3?"}
R1 -->|Yes| SL
R1 -->|No| R2
SL["📚 search_lit\nPubMed API\ntop 3 unlabeled signals"]
SL --> R2
R2{"PRR >= 5\nAND n >= 10\nAND robust CI\nAND FDR q < 0.05\nAND unlabeled?"}
R2 -->|Yes| INV
R2 -->|No| CS
INV["🔬 investigate\nPhase 1: Python (tools + ratio)\nPhase 2: LLM free-form exploration"]
INV --> CS
CS["🔁 classify_signals\nCross-run lifecycle diff\nNEW - VALIDATED - DISMISSED\nCI-overlap test vs prior run\nagent-signal-runs index"]
CS --> WR
WR["✍️ write_report\nGemma 4 12B MLX\nnarrative prose only\nnumbers never re-typed by model"]
WR --> SM
SM["💾 save_memory\nOpenSearch ML Memory\npersist text trail for next run"]
SM --> out([Briefing])
style PRR fill:#dbeafe,stroke:#3b82f6
style AD fill:#dbeafe,stroke:#3b82f6
style FL fill:#dbeafe,stroke:#3b82f6
style SL fill:#dbeafe,stroke:#3b82f6
style RN fill:#dbeafe,stroke:#3b82f6
style LM fill:#dbeafe,stroke:#3b82f6
style CS fill:#dbeafe,stroke:#3b82f6
style SM fill:#dbeafe,stroke:#3b82f6
style INV fill:#fef3c7,stroke:#f59e0b
style WR fill:#fef3c7,stroke:#f59e0b
Blue nodes = deterministic Python — same input always produces the same output.
Yellow nodes = Gemma 4 12B MLX (via Ollama) — advisory, non-deterministic, clearly labelled in the briefing.
flowchart TB
subgraph INFRA["🐳 Infrastructure (Docker Compose)"]
OS[("OpenSearch 3.6.0\nfaers_reports — 18M docs (2004–2026)\nfaers_ml_rates — 17K class-ratio docs\nML Memory + agent-signal-runs — lifecycle registry")]
QW["Gemma 4 12B MLX\nOllama (native MLX)\n10 GB - Apple Silicon / GPU"]
PX["Arize Phoenix\nOTLP traces - port 4317"]
end
subgraph STATS["📐 Statistics Layer (Python — deterministic)"]
P1["PRR + ROR\nCorrect 2×2, non-exposed denominator\nper-reaction baseline via filters agg"]
P2["95% CI + BH-FDR\nEvans 2001 log-normal\nBenjamini-Hochberg q across all m tested"]
P3["EBGM / EB05\nGamma-Poisson Shrinker\nDuMouchel 1999 - FDA MGPS standard"]
P4["Mantel–Haenszel\nQuarterly strata\nRobins–Breslow–Greenland variance"]
P5["Label matching\nMedDRA LLT + negation + direction\n3-state: Yes / Possible / No ⚠️"]
end
subgraph LLM_LAYER["🤖 LLM Layer (Gemma 4 12B MLX — advisory)"]
L1["Investigation\nPhase 1: Python calls tools + ratio (deterministic)\nLLM: one INSIGHT sentence per signal\nPhase 2: free-form exploration\nTools: get_prr, check_class_effect\nget_signal_trend, compare_time_periods"]
L2["Report writing\nKey Findings narrative only\nNo numbers re-typed\n_DISCLAIMER always appended"]
end
OS -->|aggregations| P1
OS -->|per-quarter counts| P4
P1 --> P2 --> P3
P3 -->|signals + CI + EBGM| L1
P4 -->|within-class ratios| L1
P5 -->|labeled / unlabeled| L1
L1 --> L2
QW -.->|serves models| L1 & L2
PX -.->|traces all LLM calls| L1 & L2
Every candidate signal climbs five rungs before triggering investigation. Each rung addresses a different failure mode of the previous one.
flowchart LR
R0["FAERS reports\n(raw counts)"]
R0 -->|n >= 3\neliminate\nsingle-event noise| R1
R1["PRR / ROR\npoint estimate\n2×2 table"]
R1 -->|Evans 2001\nlog-normal CI\nHaldane +0.5 for zero cells| R2
R2["95% CI\nPRR lower > 1.0\n= robust gate\npenalises small n"]
R2 -->|Yates χ²\nper-test p-value| R3
R3["χ² significance\nannotation ✓/~\nnot a gate — surfaced\nfor human review"]
R3 -->|Benjamini-Hochberg\nm = ALL reactions tested\nnot just PRR>=2| R4
R4["BH q-value\n< 0.05 gate\nfamily-wise FDR\nacross ~50 reactions"]
R4 -->|DuMouchel 1999\nGPS mixture prior\nfit across all drug reactions| R5
R5["EBGM / EB05\nEB05 >= 2 = FDA flag\nshrinks PRR=15 at n=3\ndown to EB05~1.1"]
R5 -->|All five passed| GATE{{"Investigation\ngate"}}
GATE -->|unlabeled\nPRR >= 5| INV["🔬 LLM investigates"]
GATE -->|all signals| TABLE["📊 Report table"]
style GATE fill:#dcfce7,stroke:#16a34a
For each strong unlabeled signal, the pipeline runs two sequential phases:
flowchart TD
SIG["Strong unlabeled signal\nPRR >= 5, n >= 10, robust CI, FDR q < 0.05"]
SIG --> LOOP
subgraph LOOP["Per-signal Python loop — one iteration per reaction"]
direction TB
subgraph P1["Phase 1: Python — always runs (no LLM for arithmetic)"]
T1["_get_prr(drug, reaction)\nDirect OpenSearch call"]
T2["_get_prr × comparators\nmin(comparator_prrs) → lowest"]
T3["get_signal_trend\nquarterly timeline"]
T1 --> T2 --> T3
T3 --> C1{"drug_prr / lowest_comp > 5?\n(pure Python arithmetic)"}
C1 -->|Yes| DS["DRUG_SPECIFIC"]
C1 -->|No| CE["CLASS_EFFECT"]
DS & CE --> INS["LLM: one INSIGHT sentence\n(thinking=OFF, 150 tokens)"]
end
INS --> P2CHECK
P2CHECK{"DRUG_SPECIFIC\nor ratio > 5?\n(Phase-1 gate)"}
subgraph P2["Phase 2: Free-form LLM — conditional"]
FT["Model chooses tools freely\ncompare_time_periods: EMERGING/GROWING\nAD tools: which time window spiked\nget_prr: alternative name variants\ncheck_class_effect: other drug classes"]
end
P2CHECK -->|Yes| P2
P2CHECK -->|No| OUT
P2 --> OUT
OUT["CLASSIFICATION / TREND / INSIGHT\nall numbers from Python"]
end
Why Python for Phase 1 arithmetic? Routing
min(comparator_prrs)andratio > 5through an LLM introduces failure modes independent of model quality — any model can misidentify the minimum over 3 numbers under the right context conditions. Python is deterministic. Phase 2 (open-ended tool selection, time-window reasoning, hypothesis synthesis) is where LLM reasoning genuinely adds value and stays.
The briefing table is rendered by Python — write_report passes pre-computed numbers as a JSON struct to the model, explicitly instructing it never to re-type them. The model's job is prose structure, not arithmetic.
Why: Language models hallucinate numbers, especially ratios and small-n statistics. A PRR of 82.72 must be computed, not narrated. The _DISCLAIMER on every briefing makes the LLM/Python boundary explicit to the reader.
PRR alone: correct formula, but PRR=15 on n=3 looks as confident as PRR=15 on n=3000
+ 95% CI: lower bound penalises small n — PRR=15 on n=3 gets CI crossing 1.0 (not robust)
+ BH-FDR: corrects for testing ~50 reactions simultaneously — controls false discovery rate
+ EBGM: Bayesian shrinkage (FDA MGPS) — PRR=15 on n=3 → EB05=1.1 (not flagged)
PRR=15 on n=3000 → EB05=14.5 (correctly flagged)
+ BCPNN IC: WHO Uppsala complement — IC025 > 0 is the WHO signal flag
Both GPS and BCPNN agree on strong signals; disagreement at small n is informative
EBGM (DuMouchel 1999) fits a 2-component Gamma-Poisson mixture prior across all (observed, expected) pairs for the drug. BCPNN (Bate 1998 / Norén 2006) uses a Beta-Binomial prior via the closed-form IC formula. ROR is the WHO/Uppsala Monitoring Centre complement to PRR — they agree for rare reactions and diverge for common ones, which is itself a diagnostic signal.
Semaglutide's FAERS volume grew from ~100 reports/quarter (2018) to ~15,000/quarter (2025). Naive pooling sums counts across all quarters and divides — this weights high-volume recent quarters so heavily that they dominate the denominator, creating bias when comparator reporting rates also changed over the same period.
MH stratifies by quarter: each quarter is a stratum with its own (drug_count, drug_total, comp_count, comp_total). The Mantel–Haenszel weighted estimate RR_MH = Σ(a_k·n2_k/N_k) / Σ(c_k·n1_k/N_k) weights each stratum by its information content, not its volume. In a constructed confounded scenario (comp rate 20%→1% over 8 quarters), naive pooling gives RR=7.4 while MH correctly recovers RR=2.0.
A common implementation error: apply BH only to signals that already passed the PRR≥2 filter. This understates the multiple-comparison burden — if you tested 50 reactions and only 20 passed PRR≥2, the correct m is 50, not 20. We compute χ² p-values for every reaction with n≥3, run BH across all m, then apply the PRR≥2 filter afterwards.
The ratio drug_prr / min(comparator_prrs) and the > 5 threshold are pure arithmetic over a 3-element dict. Routing them through an LLM introduces a whole class of failure (misidentified minimum, wrong threshold application) that is independent of model quality — any model can misclassify under the right context conditions. Python cannot.
Moving this to Python produced a concrete, measurable improvement: IMPAIRED GASTRIC EMPTYING (ratio 22.46x) and ERUCTATION (13.21x) were previously misclassified as CLASS_EFFECT by the LLM; Python correctly flags them DRUG_SPECIFIC, which triggers Phase 2 deep investigation on exactly the signals that warrant it.
The LLM's Phase 1 job is now exactly one sentence: the INSIGHT (clinical interpretation, grounded in the pre-computed facts). thinking=OFF, max_tokens=150. Phase 2 (free-form tool selection, hypothesis synthesis) is where reasoning genuinely earns its cost.
The first implementation used a single prompt listing all signals. Thinking models pre-reason all of them in the <think> block and extrapolate results for signals 2–3 from training knowledge. The fix: one iteration per signal in a Python for loop. Phase 1 tool calls are now direct async Python calls (no LLM context involved). Phase 2 uses a fresh React agent context per signal, preventing cross-signal contamination.
OpenSearch 3.6 (Apache 2.0) provides everything the pipeline needs in a single, self-hosted, zero-cost stack: the filters aggregation for per-reaction baseline without top-N truncation, ML Memory for cross-run signal persistence, DataDistributionTool for time-period analysis, and a built-in MCP server for free-form investigation. Running locally means no data leaves the machine — important for any work involving patient-level adverse event reports.
The original code used class_ratio = 999.0 when a reaction appeared in the drug but in zero comparators. This was statistically meaningless — 999 is not a ratio, it's an error code stored as a float. It polluted the output table and made the sort order nonsensical.
Haldane–Anscombe replaces the zero comparator count with 0.5 before computing the ratio: class_rate = 0.5 / comp_total. This gives a finite, large ratio whose 95% CI is naturally wide (because the 1/c term in the SE formula becomes 1/0.5 = 2.0). The CI gate (lower > 1.0) then decides whether the signal is robust given the uncertainty — it usually isn't for zero-comparator reactions unless the drug count is very large.
FDA label text uses clinical prose; FAERS uses MedDRA Preferred Terms. The mismatch creates false "No
Example: FDA label says "delays gastric emptying" — MedDRA PT is "IMPAIRED GASTRIC EMPTYING". A pure string match misses this. The fix: fetch official MedDRA Lower-Level Terms from openFDA (free, no license), cache them locally, and try each LLT as a matching candidate. "Delays" matches "impaired" via the synonym dictionary + LLT expansion.
LLT mappings are cached in .meddra_llt_cache.json (repo root) after the first fetch, so subsequent runs work fully offline. This file is gitignored — it is populated automatically and safe to delete to force a refresh.
All external API calls (RxNorm, openFDA, PubMed/NCBI) retry up to 3× on transient failures (429, 500, 502, 503, 504, connection errors) with delays 1s → 2s → 4s. The pipeline has seen all of these in the wild — NCBI rate-limits free-tier requests, openFDA occasionally returns 503 under load.
No external dependency (tenacity is not required): implemented as a tight async loop in each _get() helper. Falls back cleanly on httpx.HTTPStatusError for non-retryable errors (e.g. 400 bad request, 404 not found).
get_or_create_memory() uses PUT /{index}/_create/{drug} (idempotent, conflict-safe) rather than POST /_doc (auto-ID). Under concurrent uvicorn workers, two workers could race to create an ML Memory container for a new drug — the _create op returns 409 if a document already exists, and the loser re-reads the winner's memory_id. No orphaned containers, no duplicate state.
Everything runs locally via Docker. Zero external dependencies.
| Component | Technology | License |
|---|---|---|
| Database | OpenSearch 3.6.0 | Apache 2.0 |
| LLM | Gemma 4 12B MLX via Ollama (~10GB) via Docker Model Runner | Apache 2.0 |
| Agent framework | LangGraph | MIT |
| Web UI | FastAPI + SSE streaming | MIT |
| Ingestion | Polars — 3× less memory than pandas | MIT |
| Observability | Arize Phoenix (optional) | Apache 2.0 |
| Data | FDA openFDA API + PubMed + FDA FAERS ZIPs | Public domain |
- Docker Desktop with Model Runner enabled
- Python 3.11+ with uv
- 16GB RAM recommended (OpenSearch 1.5GB + Gemma 4 12B MLX ~10GB)
- Ollama installed (provides native MLX inference on Apple Silicon)
- ~10GB disk for full FAERS 2018–2026 dataset
Optional: Set GOOGLE_API_KEY (free at aistudio.google.com) to upgrade the investigator to a cloud LLM via Google AI Studio (e.g. gemini-2.0-flash). Set INVESTIGATOR_MODEL to override the local model path. If the Google API returns a 500/503/429 error, the investigator automatically falls back to the local Qwen3.5-9B for that signal — results remain valid but advisory quality may vary between runs.
# 1. Install dependencies
uv sync
# 2. Start infrastructure (pulls Qwen3.5-9B ~5.6GB on first run)
docker compose up -d
# 3. Load FAERS data
# Quick demo — 5 min via openFDA API, no download needed
uv run python -m ingestion.faers_indexer --drug semaglutide --limit 6000
uv run python -m ingestion.faers_indexer --drug rofecoxib --limit 2000
# Full dataset — 2018–2026, ~12M reports, ~1 hour + download
./ingestion/download_faers.sh
uv run python -m ingestion.faers_zip_indexer --dir ~/faers_data --all-drugs
# Full history — adds 2004–2017 (rofecoxib peak period, ~2.8GB more)
./ingestion/download_faers_historical.sh
uv run python -m ingestion.faers_zip_indexer --dir ~/faers_data --all-drugs
# 4. Compute within-class disproportionality (one-time)
# For new drugs: first auto-discover comparators via RxClass ATC, then build index
uv run python -m ingestion.discover_comparators --drug <your-drug> # optional: auto-populates config/comparators.yaml
uv run python -m ingestion.compute_class_ratio # builds faers_ml_rates for all drugs in comparators.yaml
# 5. Register OpenSearch MCP tools (one-time, enables free-form investigation)
uv run python -m ingestion.register_mcp_tools
# 6a. Web UI (SSE streaming dashboard)
uv run python -m app.server # → http://localhost:8080
# Web server endpoints:
# GET / → redirects to /static/index.html (streaming dashboard)
# GET /analyze?drug=semaglutide → SSE stream of pipeline events + final briefing
# GET /api/briefing/semaglutide → REST: full structured JSON response:
# {drug_name, drug_names, drug_total, faers_total,
# prr_signals, anomaly_signals, literature,
# investigation, signal_status, briefing, error}
# GET /health → {"status": "ok"}
# 6b. CLI
uv run python main.py semaglutide
uv run python main.py rofecoxib # retrospective: recalled 2004 for MI riska = drug reports with reaction b = drug reports without reaction
c = non-drug reports with reaction d = non-drug reports without reaction
PRR = (a/(a+b)) / (c/(c+d)) — EMA/813938/2011 standard
ROR = (a·d) / (b·c) — WHO/Uppsala standard
SE_PRR = √(1/a − 1/(a+b) + 1/c − 1/(c+d)) (Evans 2001)
SE_ROR = √(1/a + 1/b + 1/c + 1/d)
95% CI = exp(ln(estimate) ± 1.96 · SE)
Signal criteria applied in sequence:
| Criterion | Threshold | Controls |
|---|---|---|
| Count | n ≥ 3 | Single-event noise (PRR signal table) |
| PRR | ≥ 2.0 | Effect size (EMA standard) |
| CI lower | > 1.0 | Small-n instability — PRR=15 at n=4 fails |
| BH q-value | < 0.05 | Family-wise FDR (m = all reactions tested, not just PRR≥2) |
| EBGM EB05 | ≥ 2.0 | FDA MGPS threshold — Bayesian lower bound |
Investigation gate additionally requires n ≥ 10 and PRR ≥ 5 (only strong, statistically robust, unlabeled signals trigger the LLM investigator). Within-class table requires n ≥ 5.
All thresholds are from published EMA/FDA standards — none were tuned on semaglutide.
E[drug, reaction] = drug_total × (baseline / faers_total) (expected under independence)
GPS mixture prior: P(O | E) = P · NB(α₁, β₁/(β₁+E)) + (1−P) · NB(α₂, β₂/(β₂+E))
Parameters θ = (α₁, β₁, α₂, β₂, P) fitted by MLE across all (O, E) pairs for the drug.
EBGM = exp(E[ln λ | O, E]) — posterior geometric mean
EB05 = 5th percentile of posterior lambda distribution
For each quarter k: a_k = drug_count, n1_k = drug_total
c_k = comp_count, n2_k = comp_total, N_k = n1_k + n2_k
RR_MH = Σ_k (a_k · n2_k / N_k) / Σ_k (c_k · n1_k / N_k)
Var[ln RR_MH] = Σ P_k R_k / (2R²) + Σ(P_k S_k + Q_k R_k)/(2RS) + Σ Q_k S_k / (2S²)
(Robins–Breslow–Greenland 1986)
Every pipeline run classifies each signal as NEW, VALIDATED, or DISMISSED by comparing against the prior run's stored data.
NEW — reaction not seen in any prior run for this drug
VALIDATED — 95% CIs overlap: the change from prior to current PRR is within
sampling noise — the signal persists
DISMISSED — current upper CI is entirely below prior lower CI:
the signal has genuinely collapsed, not just fluctuated
The comparison uses a CI-overlap test rather than a bare percentage cliff — PRR 8.0→3.9 can still be VALIDATED if the confidence intervals overlap, while a true collapse (current upper bound < prior lower bound) triggers DISMISSED regardless of the ratio.
Per-run structured state (drug, run_ts, per-reaction PRR + CIs + effect + trend + status) is persisted to the agent-signal-runs OpenSearch index. The next run loads this via load_last_run() and passes per-reaction PRR trajectory deltas to the Phase-1 investigator prompt:
PRIOR RUN SIGNAL TRAJECTORY:
PANCREATITIS: DRUG_SPECIFIC PRR=8.2→9.1 (+11%) | PERSISTENT
NAUSEA: CLASS_EFFECT PRR=4.3→4.5 (+5%)
BLOOD_GLUCOSE_DECR: DRUG_SPECIFIC PRR=6.0 last run → RESOLVED (gone)
This lets the investigator reason about trajectory — whether a signal is newly emerging, persistently elevated, or resolving — not just its current point estimate.
The report header summarises the lifecycle for the run: NEW=2 · VALIDATED=5 · DISMISSED=1. PRR table reaction labels carry status badges: 🆕 (NEW), ✅ (VALIDATED), 📉 (DISMISSED).
| Feature | API | Purpose |
|---|---|---|
filters aggregation |
standard | Per-reaction baseline without top-N truncation |
| ML Memory | /_plugins/_ml/memory |
Human-readable text trail across runs |
agent-signal-runs index |
standard | Structured per-run signal state for CI-overlap lifecycle diff |
| Built-in MCP server | /_plugins/_ml/mcp |
Free-form investigation tools in Phase 2 |
This tool is a PRR + within-class disproportionality screener. It is comparable in statistical content to OpenVigil's PRR/ROR output, and exceeds it with EBGM/EB05 and Mantel-Haenszel. What it is not:
| Limitation | Impact |
|---|---|
| Single-variable stratification only | Stratified PRR (MH) runs by default on reporter_type (highest-yield FAERS confounder). Override with STRATIFY_PRR=age|sex|reporter_type or disable with STRATIFY_PRR=. Cross-stratification (age × sex) is not supported. |
| Age banding requires year-unit age data | ZIP-ingested data now normalises age_cod (DEC/YR/MON/WK) to years. API-ingested data uses years natively. Data ingested before this fix should be re-indexed for accurate age bands. |
| No exposure normalisation | PRR measures reporting rate, not incidence |
| FAERS structural biases | Duplicate reports, Weber effect, notoriety/litigation bias (relevant for rofecoxib), stimulated reporting, co-medication confounding — all inherent to spontaneous reporting |
| Drug's top-N reactions capped at 50 | Reactions ranked >50 in the drug's profile are not tested |
| MH stratifies by quarter only (within-class) | Comparator drugs within a quarter are pooled. Full stratification by (quarter × comparator) requires per-comparator counts in the index. |
| BH-FDR uses Yates-conservative p-values | Over-conservative (fewer false signals) — exact Fisher p-values would be more standard. |
Two independent code paths (our OpenSearch pipeline vs the FDA's own API) computing the same PRR formula on overlapping data:
uv run python scripts/benchmark_vs_openvigil.py benchmark semaglutideNote: The delta percentages below require the full ZIP-ingested 2004–2026 dataset (
faers_zip_indexer --all-drugs). The quick-demo API load (faers_indexer --limit 6000) only covers a small subset and will produce different PRR values.
| Drug | Drug-specific signals | Median PRR Δ | Within 10% | Verdict |
|---|---|---|---|---|
| semaglutide (82K reports) | 2 | 1.5% | 100% | ✅ Formula validated |
| warfarin (135K reports) | 1 | 43.1% | 0% |
Semaglutide — mechanism-specific signals (100% within 10%):
| Reaction | PRR (ours) | PRR (openFDA) | Δ% |
|---|---|---|---|
| IMPAIRED GASTRIC EMPTYING | 87.38 | ~86 | ~1.6% ✅ |
| ERUCTATION | 17.04 | ~16.8 | ~1.5% ✅ |
Warfarin — INR INCREASED delta explained:
| Reaction | PRR (ours) | PRR (openFDA) | Δ% | Note |
|---|---|---|---|---|
| INR INCREASED | 71.53 | 125.79 | 43.1% | Pre-DOAC era effect |
The 43.1% delta on INR INCREASED is a known data-coverage characteristic, not a formula error:
- Our local extract covers 2004–2026 all-drugs, including the pre-DOAC era (2004–2012) when warfarin dominated anticoagulation and INR monitoring was ubiquitous across many patient populations.
- This inflates the background (non-warfarin) INR INCREASED rate in our denominator, lowering the PRR.
- openFDA's reference computes against its own backend which may weight the 2018+ data differently.
- Warfarin-specific clinical signals (HAEMORRHAGE, EPISTAXIS, ECCHYMOSIS) show Δ < 10% as expected — the formula is correct for non-monitoring reactions.
- Web UI:
http://localhost:8080— real-time streaming briefing - OpenSearch Dashboards:
http://localhost:5601(admin / Pharma@2024!) - Phoenix traces (optional): instruments LangChain/LangGraph calls
# Enable Phoenix tracing
docker compose --profile observability up -d phoenix
uv sync --extra observability
# Traces appear at http://localhost:6006Phoenix is not started by default — docker compose up -d runs the pipeline without it. The agent silently skips tracing if Phoenix is not reachable.
- PRR — correct 2×2 table, per-reaction baseline, no rank truncation
- ROR — WHO/Uppsala standard alongside PRR
- PRR/ROR 95% confidence intervals (log-normal, Evans 2001)
- Benjamini–Hochberg FDR correction (m = all reactions tested)
- EBGM / EB05 — Gamma-Poisson Shrinker (DuMouchel 1999)
- Yates χ² significance annotation
- Within-class disproportionality — Mantel–Haenszel + Robins–Breslow–Greenland CI
- FDA label cross-reference — MedDRA LLT-expanded, negation-aware, sentence-scoped
- Three-state label match (Yes / Possible / No)
- PubMed literature evidence
- Two-phase investigation — Python Phase 1 (deterministic) + LLM Phase 2 (free-form)
- Deterministic table rendering — numbers never re-typed by model
- Signal registry — OpenSearch ML Memory (cross-run persistence)
- Polars ingestion — 3× less memory, handles AERS + FAERS formats
- Full FAERS 2004–2026 (historical + current)
- openFDA independent benchmark (1.7% median Δ on mechanism signals)
- Web UI — FastAPI + SSE streaming, dark-mode
- GitHub Actions CI — 163 pure-function tests across 12 modules + schema smoke-import on every push
- Stratified PRR — Mantel-Haenszel by reporter_type (default), age or sex (set
STRATIFY_PRRenv var) - BCPNN / IC / IC025 — WHO Uppsala standard (Bate 1998 / Norén 2006)
- Configurable comparators —
config/comparators.yaml+ auto-discovery via RxClass ATC - Cross-stratification (age × sex × reporter_type simultaneously)
Cloud / API version — built for the Google Cloud Rapid Agent Hackathon (Elastic track), uses managed cloud services and Gemini API:
→ google-cloud-rapid-agent-hackathon
For research purposes only. PRR signals are statistical associations, not causal evidence. No regulatory decisions should be made based solely on this tool's output. Requires clinical validation before any regulatory action.
Statistics: All numeric values (PRR, ROR, EBGM, 95% CI, BH q-value, MH rate ratio, counts) are computed deterministically by Python and are fully reproducible. Formulas follow EMA/813938/2011 and DuMouchel (1999).
LLM narrative: Classification labels (DRUG_SPECIFIC / CLASS_EFFECT) and Key Findings text are generated by Qwen3.5-9B. They are advisory and non-deterministic — the same data may produce different wording across runs.