Skip to content

al-Zamakhshari/drug-safety-signal-agent

Repository files navigation

Drug Safety Signal Agent — Local Edition

Detect pharmacovigilance signals from FDA FAERS adverse event data.
Runs entirely on your laptop. No API keys required. No cloud. No licenses.

git clone https://github.com/al-Zamakhshari/drug-safety-signal-agent
cd drug-safety-signal-agent
cp .env.example .env                    # step 0: copy config (defaults work out of the box)
uv sync
docker compose up -d                    # start OpenSearch
ollama pull gemma4:12b-mlx              # pull Gemma 4 12B MLX ~10GB (once)
./ingestion/download_faers.sh           # downloads FAERS 2018–2026 to ~/faers_data/
uv run python -m ingestion.faers_zip_indexer --dir ~/faers_data --all-drugs
uv run python -m ingestion.discover_comparators --drug <your-drug>  # per drug you want
uv run python -m ingestion.compute_class_ratio
uv run python -m ingestion.register_mcp_tools
uv run python -m app.server            # → http://localhost:8080

What It Does

Detects drug safety signals from FDA FAERS adverse event reports using a fully local pipeline — no cloud services, no API keys, no licenses.

Stage Method Technology
Signal detection PRR + ROR + 95% CI + BH-FDR + EBGM/EB05 OpenSearch aggregations
Within-class comparison Mantel–Haenszel stratified rate ratio + CI OpenSearch faers_ml_rates
Label cross-reference MedDRA LLT-expanded, negation-aware token-overlap openFDA API
Literature evidence PubMed search NCBI eUtils
Investigation Two-phase function calling — class effect / trend / temporal emergence Qwen3.5-9B
Signal memory Cross-run persistence OpenSearch ML Memory (3.6+)
Web interface Real-time streaming briefing FastAPI + SSE

Example output — semaglutide, 82,700 reports, 18M baseline (2004–2026):

### PRR + ROR + EBGM + BCPNN Signals (EMA/FDA/WHO standards)
| Reaction                        | PRR (95% CI)       | ROR (95% CI)       | EBGM / EB05   | IC / IC025  | n     | Label? |
|---------------------------------|--------------------|--------------------|---------------|-------------|-------|--------|
| IMPAIRED GASTRIC EMPTYING       | 84.94 (81.5–88.5)  | 88.16 (84.5–91.9)  | 57.0 / 55.2 ✓ | 5.8 / 5.8 ✓ | 3,057 | Yes    |
| GLYCOSYLATED HB INCREASED       | 11.80 (11.1–12.5)  | 11.95 (11.2–12.7)  | 11.2 / 10.7 ✓ | 3.5 / 3.4 ✓ | 1,111 | No ⚠️  |
| PANCREATITIS                    | 6.81  (6.47–7.16)  | 6.91  (6.56–7.28)  | 6.6  / 6.3  ✓ | 2.7 / 2.6 ✓ | 1,504 | Yes    |
| BLOOD GLUCOSE DECREASED         | 7.25  (6.87–7.66)  | 7.36  (6.96–7.77)  | 7.0  / 6.7  ✓ | 2.8 / 2.7 ✓ | 1,311 | No ⚠️  |

Risk: HIGH  |  Action: ESCALATE

Pipeline Flow

flowchart TD
    start([drug name]) --> RN

    RN["🔍 resolve_names\nRxNorm API → all brand/generic names"]
    RN --> LM

    LM["🧠 load_memory\nOpenSearch ML Memory\nprior run findings → PERSISTENT tags"]
    LM --> PRR

    PRR["📊 calculate_prr\nPRR + ROR + 95% CI\nBH-FDR + EBGM/EB05\nOpenSearch filters agg"]
    PRR --> AD

    AD["📈 anomaly_detection\nMantel–Haenszel rate ratio\nvs comparator class\nquarterly strata"]
    AD --> FL

    FL["📋 fetch_label\nopenFDA label text\nMedDRA LLT synonyms cached"]
    FL --> R1

    R1{"robust signal\nAND unlabeled\nAND PRR >= 3?"}
    R1 -->|Yes| SL
    R1 -->|No| R2

    SL["📚 search_lit\nPubMed API\ntop 3 unlabeled signals"]
    SL --> R2

    R2{"PRR >= 5\nAND n >= 10\nAND robust CI\nAND FDR q < 0.05\nAND unlabeled?"}
    R2 -->|Yes| INV
    R2 -->|No| CS

    INV["🔬 investigate\nPhase 1: Python (tools + ratio)\nPhase 2: LLM free-form exploration"]
    INV --> CS

    CS["🔁 classify_signals\nCross-run lifecycle diff\nNEW - VALIDATED - DISMISSED\nCI-overlap test vs prior run\nagent-signal-runs index"]
    CS --> WR

    WR["✍️ write_report\nGemma 4 12B MLX\nnarrative prose only\nnumbers never re-typed by model"]
    WR --> SM

    SM["💾 save_memory\nOpenSearch ML Memory\npersist text trail for next run"]
    SM --> out([Briefing])

    style PRR fill:#dbeafe,stroke:#3b82f6
    style AD fill:#dbeafe,stroke:#3b82f6
    style FL fill:#dbeafe,stroke:#3b82f6
    style SL fill:#dbeafe,stroke:#3b82f6
    style RN fill:#dbeafe,stroke:#3b82f6
    style LM fill:#dbeafe,stroke:#3b82f6
    style CS fill:#dbeafe,stroke:#3b82f6
    style SM fill:#dbeafe,stroke:#3b82f6
    style INV fill:#fef3c7,stroke:#f59e0b
    style WR fill:#fef3c7,stroke:#f59e0b
Loading

Blue nodes = deterministic Python — same input always produces the same output.
Yellow nodes = Gemma 4 12B MLX (via Ollama) — advisory, non-deterministic, clearly labelled in the briefing.


Architecture

flowchart TB
    subgraph INFRA["🐳 Infrastructure  (Docker Compose)"]
        OS[("OpenSearch 3.6.0\nfaers_reports — 18M docs (2004–2026)\nfaers_ml_rates — 17K class-ratio docs\nML Memory + agent-signal-runs — lifecycle registry")]
        QW["Gemma 4 12B MLX\nOllama (native MLX)\n10 GB - Apple Silicon / GPU"]
        PX["Arize Phoenix\nOTLP traces - port 4317"]
    end

    subgraph STATS["📐 Statistics Layer  (Python — deterministic)"]
        P1["PRR + ROR\nCorrect 2×2, non-exposed denominator\nper-reaction baseline via filters agg"]
        P2["95% CI + BH-FDR\nEvans 2001 log-normal\nBenjamini-Hochberg q across all m tested"]
        P3["EBGM / EB05\nGamma-Poisson Shrinker\nDuMouchel 1999 - FDA MGPS standard"]
        P4["Mantel–Haenszel\nQuarterly strata\nRobins–Breslow–Greenland variance"]
        P5["Label matching\nMedDRA LLT + negation + direction\n3-state: Yes / Possible / No ⚠️"]
    end

    subgraph LLM_LAYER["🤖 LLM Layer  (Gemma 4 12B MLX — advisory)"]
        L1["Investigation\nPhase 1: Python calls tools + ratio (deterministic)\nLLM: one INSIGHT sentence per signal\nPhase 2: free-form exploration\nTools: get_prr, check_class_effect\nget_signal_trend, compare_time_periods"]
        L2["Report writing\nKey Findings narrative only\nNo numbers re-typed\n_DISCLAIMER always appended"]
    end

    OS -->|aggregations| P1
    OS -->|per-quarter counts| P4
    P1 --> P2 --> P3
    P3 -->|signals + CI + EBGM| L1
    P4 -->|within-class ratios| L1
    P5 -->|labeled / unlabeled| L1
    L1 --> L2
    QW -.->|serves models| L1 & L2
    PX -.->|traces all LLM calls| L1 & L2
Loading

Signal Detection Ladder

Every candidate signal climbs five rungs before triggering investigation. Each rung addresses a different failure mode of the previous one.

flowchart LR
    R0["FAERS reports\n(raw counts)"]

    R0 -->|n >= 3\neliminate\nsingle-event noise| R1

    R1["PRR / ROR\npoint estimate\n2×2 table"]

    R1 -->|Evans 2001\nlog-normal CI\nHaldane +0.5 for zero cells| R2

    R2["95% CI\nPRR lower > 1.0\n= robust gate\npenalises small n"]

    R2 -->|Yates χ²\nper-test p-value| R3

    R3["χ² significance\nannotation ✓/~\nnot a gate — surfaced\nfor human review"]

    R3 -->|Benjamini-Hochberg\nm = ALL reactions tested\nnot just PRR>=2| R4

    R4["BH q-value\n< 0.05 gate\nfamily-wise FDR\nacross ~50 reactions"]

    R4 -->|DuMouchel 1999\nGPS mixture prior\nfit across all drug reactions| R5

    R5["EBGM / EB05\nEB05 >= 2 = FDA flag\nshrinks PRR=15 at n=3\ndown to EB05~1.1"]

    R5 -->|All five passed| GATE{{"Investigation\ngate"}}

    GATE -->|unlabeled\nPRR >= 5| INV["🔬 LLM investigates"]
    GATE -->|all signals| TABLE["📊 Report table"]

    style GATE fill:#dcfce7,stroke:#16a34a
Loading

Investigation Flow

For each strong unlabeled signal, the pipeline runs two sequential phases:

flowchart TD
    SIG["Strong unlabeled signal\nPRR >= 5, n >= 10, robust CI, FDR q < 0.05"]
    SIG --> LOOP

    subgraph LOOP["Per-signal Python loop — one iteration per reaction"]
        direction TB

        subgraph P1["Phase 1: Python — always runs (no LLM for arithmetic)"]
            T1["_get_prr(drug, reaction)\nDirect OpenSearch call"]
            T2["_get_prr × comparators\nmin(comparator_prrs) → lowest"]
            T3["get_signal_trend\nquarterly timeline"]
            T1 --> T2 --> T3
            T3 --> C1{"drug_prr / lowest_comp > 5?\n(pure Python arithmetic)"}
            C1 -->|Yes| DS["DRUG_SPECIFIC"]
            C1 -->|No| CE["CLASS_EFFECT"]
            DS & CE --> INS["LLM: one INSIGHT sentence\n(thinking=OFF, 150 tokens)"]
        end

        INS --> P2CHECK

        P2CHECK{"DRUG_SPECIFIC\nor ratio > 5?\n(Phase-1 gate)"}

        subgraph P2["Phase 2: Free-form LLM — conditional"]
            FT["Model chooses tools freely\ncompare_time_periods: EMERGING/GROWING\nAD tools: which time window spiked\nget_prr: alternative name variants\ncheck_class_effect: other drug classes"]
        end

        P2CHECK -->|Yes| P2
        P2CHECK -->|No| OUT
        P2 --> OUT

        OUT["CLASSIFICATION / TREND / INSIGHT\nall numbers from Python"]
    end
Loading

Why Python for Phase 1 arithmetic? Routing min(comparator_prrs) and ratio > 5 through an LLM introduces failure modes independent of model quality — any model can misidentify the minimum over 3 numbers under the right context conditions. Python is deterministic. Phase 2 (open-ended tool selection, time-window reasoning, hypothesis synthesis) is where LLM reasoning genuinely adds value and stays.


Design Decisions & Rationale

1. Python owns all statistics; LLM only writes prose

The briefing table is rendered by Python — write_report passes pre-computed numbers as a JSON struct to the model, explicitly instructing it never to re-type them. The model's job is prose structure, not arithmetic.

Why: Language models hallucinate numbers, especially ratios and small-n statistics. A PRR of 82.72 must be computed, not narrated. The _DISCLAIMER on every briefing makes the LLM/Python boundary explicit to the reader.


2. Four estimators (PRR + ROR + EBGM + BCPNN) — each fixes a failure mode of the others

PRR alone:   correct formula, but PRR=15 on n=3 looks as confident as PRR=15 on n=3000
+ 95% CI:   lower bound penalises small n — PRR=15 on n=3 gets CI crossing 1.0 (not robust)
+ BH-FDR:   corrects for testing ~50 reactions simultaneously — controls false discovery rate
+ EBGM:     Bayesian shrinkage (FDA MGPS) — PRR=15 on n=3 → EB05=1.1 (not flagged)
            PRR=15 on n=3000 → EB05=14.5 (correctly flagged)
+ BCPNN IC: WHO Uppsala complement — IC025 > 0 is the WHO signal flag
            Both GPS and BCPNN agree on strong signals; disagreement at small n is informative

EBGM (DuMouchel 1999) fits a 2-component Gamma-Poisson mixture prior across all (observed, expected) pairs for the drug. BCPNN (Bate 1998 / Norén 2006) uses a Beta-Binomial prior via the closed-form IC formula. ROR is the WHO/Uppsala Monitoring Centre complement to PRR — they agree for rare reactions and diverge for common ones, which is itself a diagnostic signal.


3. Mantel–Haenszel instead of naive count pooling

Semaglutide's FAERS volume grew from ~100 reports/quarter (2018) to ~15,000/quarter (2025). Naive pooling sums counts across all quarters and divides — this weights high-volume recent quarters so heavily that they dominate the denominator, creating bias when comparator reporting rates also changed over the same period.

MH stratifies by quarter: each quarter is a stratum with its own (drug_count, drug_total, comp_count, comp_total). The Mantel–Haenszel weighted estimate RR_MH = Σ(a_k·n2_k/N_k) / Σ(c_k·n1_k/N_k) weights each stratum by its information content, not its volume. In a constructed confounded scenario (comp rate 20%→1% over 8 quarters), naive pooling gives RR=7.4 while MH correctly recovers RR=2.0.


4. BH-FDR at m = all reactions tested, not just PRR ≥ 2

A common implementation error: apply BH only to signals that already passed the PRR≥2 filter. This understates the multiple-comparison burden — if you tested 50 reactions and only 20 passed PRR≥2, the correct m is 50, not 20. We compute χ² p-values for every reaction with n≥3, run BH across all m, then apply the PRR≥2 filter afterwards.


5. Phase 1 classification is Python, not LLM

The ratio drug_prr / min(comparator_prrs) and the > 5 threshold are pure arithmetic over a 3-element dict. Routing them through an LLM introduces a whole class of failure (misidentified minimum, wrong threshold application) that is independent of model quality — any model can misclassify under the right context conditions. Python cannot.

Moving this to Python produced a concrete, measurable improvement: IMPAIRED GASTRIC EMPTYING (ratio 22.46x) and ERUCTATION (13.21x) were previously misclassified as CLASS_EFFECT by the LLM; Python correctly flags them DRUG_SPECIFIC, which triggers Phase 2 deep investigation on exactly the signals that warrant it.

The LLM's Phase 1 job is now exactly one sentence: the INSIGHT (clinical interpretation, grounded in the pre-computed facts). thinking=OFF, max_tokens=150. Phase 2 (free-form tool selection, hypothesis synthesis) is where reasoning genuinely earns its cost.


6. Per-signal Python loop

The first implementation used a single prompt listing all signals. Thinking models pre-reason all of them in the <think> block and extrapolate results for signals 2–3 from training knowledge. The fix: one iteration per signal in a Python for loop. Phase 1 tool calls are now direct async Python calls (no LLM context involved). Phase 2 uses a fresh React agent context per signal, preventing cross-signal contamination.


7. Why OpenSearch

OpenSearch 3.6 (Apache 2.0) provides everything the pipeline needs in a single, self-hosted, zero-cost stack: the filters aggregation for per-reaction baseline without top-N truncation, ML Memory for cross-run signal persistence, DataDistributionTool for time-period analysis, and a built-in MCP server for free-form investigation. Running locally means no data leaves the machine — important for any work involving patient-level adverse event reports.


8. Haldane–Anscombe +0.5 instead of a hard 999 sentinel

The original code used class_ratio = 999.0 when a reaction appeared in the drug but in zero comparators. This was statistically meaningless — 999 is not a ratio, it's an error code stored as a float. It polluted the output table and made the sort order nonsensical.

Haldane–Anscombe replaces the zero comparator count with 0.5 before computing the ratio: class_rate = 0.5 / comp_total. This gives a finite, large ratio whose 95% CI is naturally wide (because the 1/c term in the SE formula becomes 1/0.5 = 2.0). The CI gate (lower > 1.0) then decides whether the signal is robust given the uncertainty — it usually isn't for zero-comparator reactions unless the drug count is very large.


9. MedDRA LLT expansion for label matching

FDA label text uses clinical prose; FAERS uses MedDRA Preferred Terms. The mismatch creates false "No ⚠️" flags that cascade through the pipeline: unlabeled signals trigger literature search and LLM investigation.

Example: FDA label says "delays gastric emptying" — MedDRA PT is "IMPAIRED GASTRIC EMPTYING". A pure string match misses this. The fix: fetch official MedDRA Lower-Level Terms from openFDA (free, no license), cache them locally, and try each LLT as a matching candidate. "Delays" matches "impaired" via the synonym dictionary + LLT expansion.

LLT mappings are cached in .meddra_llt_cache.json (repo root) after the first fetch, so subsequent runs work fully offline. This file is gitignored — it is populated automatically and safe to delete to force a refresh.


10. HTTP retry with exponential backoff

All external API calls (RxNorm, openFDA, PubMed/NCBI) retry up to 3× on transient failures (429, 500, 502, 503, 504, connection errors) with delays 1s → 2s → 4s. The pipeline has seen all of these in the wild — NCBI rate-limits free-tier requests, openFDA occasionally returns 503 under load.

No external dependency (tenacity is not required): implemented as a tight async loop in each _get() helper. Falls back cleanly on httpx.HTTPStatusError for non-retryable errors (e.g. 400 bad request, 404 not found).


11. Multi-worker memory container safety

get_or_create_memory() uses PUT /{index}/_create/{drug} (idempotent, conflict-safe) rather than POST /_doc (auto-ID). Under concurrent uvicorn workers, two workers could race to create an ML Memory container for a new drug — the _create op returns 409 if a document already exists, and the loser re-reads the winner's memory_id. No orphaned containers, no duplicate state.


Stack

Everything runs locally via Docker. Zero external dependencies.

Component Technology License
Database OpenSearch 3.6.0 Apache 2.0
LLM Gemma 4 12B MLX via Ollama (~10GB) via Docker Model Runner Apache 2.0
Agent framework LangGraph MIT
Web UI FastAPI + SSE streaming MIT
Ingestion Polars — 3× less memory than pandas MIT
Observability Arize Phoenix (optional) Apache 2.0
Data FDA openFDA API + PubMed + FDA FAERS ZIPs Public domain

Requirements

  • Docker Desktop with Model Runner enabled
  • Python 3.11+ with uv
  • 16GB RAM recommended (OpenSearch 1.5GB + Gemma 4 12B MLX ~10GB)
  • Ollama installed (provides native MLX inference on Apple Silicon)
  • ~10GB disk for full FAERS 2018–2026 dataset

Optional: Set GOOGLE_API_KEY (free at aistudio.google.com) to upgrade the investigator to a cloud LLM via Google AI Studio (e.g. gemini-2.0-flash). Set INVESTIGATOR_MODEL to override the local model path. If the Google API returns a 500/503/429 error, the investigator automatically falls back to the local Qwen3.5-9B for that signal — results remain valid but advisory quality may vary between runs.


Quick Start

# 1. Install dependencies
uv sync

# 2. Start infrastructure (pulls Qwen3.5-9B ~5.6GB on first run)
docker compose up -d

# 3. Load FAERS data

# Quick demo — 5 min via openFDA API, no download needed
uv run python -m ingestion.faers_indexer --drug semaglutide --limit 6000
uv run python -m ingestion.faers_indexer --drug rofecoxib --limit 2000

# Full dataset — 2018–2026, ~12M reports, ~1 hour + download
./ingestion/download_faers.sh
uv run python -m ingestion.faers_zip_indexer --dir ~/faers_data --all-drugs

# Full history — adds 2004–2017 (rofecoxib peak period, ~2.8GB more)
./ingestion/download_faers_historical.sh
uv run python -m ingestion.faers_zip_indexer --dir ~/faers_data --all-drugs

# 4. Compute within-class disproportionality (one-time)
# For new drugs: first auto-discover comparators via RxClass ATC, then build index
uv run python -m ingestion.discover_comparators --drug <your-drug>   # optional: auto-populates config/comparators.yaml
uv run python -m ingestion.compute_class_ratio                        # builds faers_ml_rates for all drugs in comparators.yaml

# 5. Register OpenSearch MCP tools (one-time, enables free-form investigation)
uv run python -m ingestion.register_mcp_tools

# 6a. Web UI  (SSE streaming dashboard)
uv run python -m app.server          # → http://localhost:8080

# Web server endpoints:
#   GET /                          → redirects to /static/index.html (streaming dashboard)
#   GET /analyze?drug=semaglutide  → SSE stream of pipeline events + final briefing
#   GET /api/briefing/semaglutide  → REST: full structured JSON response:
#                                    {drug_name, drug_names, drug_total, faers_total,
#                                     prr_signals, anomaly_signals, literature,
#                                     investigation, signal_status, briefing, error}
#   GET /health                    → {"status": "ok"}

# 6b. CLI
uv run python main.py semaglutide
uv run python main.py rofecoxib      # retrospective: recalled 2004 for MI risk

Statistical Methods

PRR / ROR — the 2×2 table

a = drug reports with reaction        b = drug reports without reaction
c = non-drug reports with reaction    d = non-drug reports without reaction

PRR = (a/(a+b)) / (c/(c+d))     — EMA/813938/2011 standard
ROR = (a·d) / (b·c)             — WHO/Uppsala standard

SE_PRR = √(1/a − 1/(a+b) + 1/c − 1/(c+d))     (Evans 2001)
SE_ROR = √(1/a + 1/b + 1/c + 1/d)
95% CI = exp(ln(estimate) ± 1.96 · SE)

Signal criteria applied in sequence:

Criterion Threshold Controls
Count n ≥ 3 Single-event noise (PRR signal table)
PRR ≥ 2.0 Effect size (EMA standard)
CI lower > 1.0 Small-n instability — PRR=15 at n=4 fails
BH q-value < 0.05 Family-wise FDR (m = all reactions tested, not just PRR≥2)
EBGM EB05 ≥ 2.0 FDA MGPS threshold — Bayesian lower bound

Investigation gate additionally requires n ≥ 10 and PRR ≥ 5 (only strong, statistically robust, unlabeled signals trigger the LLM investigator). Within-class table requires n ≥ 5.

All thresholds are from published EMA/FDA standards — none were tuned on semaglutide.

EBGM — Gamma-Poisson Shrinker

E[drug, reaction] = drug_total × (baseline / faers_total)    (expected under independence)

GPS mixture prior:   P(O | E) = P · NB(α₁, β₁/(β₁+E))  +  (1−P) · NB(α₂, β₂/(β₂+E))
Parameters θ = (α₁, β₁, α₂, β₂, P) fitted by MLE across all (O, E) pairs for the drug.

EBGM = exp(E[ln λ | O, E])    — posterior geometric mean
EB05 = 5th percentile of posterior lambda distribution

Within-class Disproportionality — Mantel–Haenszel

For each quarter k:  a_k = drug_count,  n1_k = drug_total
                     c_k = comp_count,  n2_k = comp_total,  N_k = n1_k + n2_k

RR_MH = Σ_k (a_k · n2_k / N_k)  /  Σ_k (c_k · n1_k / N_k)

Var[ln RR_MH] = Σ P_k R_k / (2R²) + Σ(P_k S_k + Q_k R_k)/(2RS) + Σ Q_k S_k / (2S²)
                (Robins–Breslow–Greenland 1986)

Cross-run Signal Lifecycle

Every pipeline run classifies each signal as NEW, VALIDATED, or DISMISSED by comparing against the prior run's stored data.

NEW        — reaction not seen in any prior run for this drug
VALIDATED  — 95% CIs overlap: the change from prior to current PRR is within
             sampling noise — the signal persists
DISMISSED  — current upper CI is entirely below prior lower CI:
             the signal has genuinely collapsed, not just fluctuated

The comparison uses a CI-overlap test rather than a bare percentage cliff — PRR 8.0→3.9 can still be VALIDATED if the confidence intervals overlap, while a true collapse (current upper bound < prior lower bound) triggers DISMISSED regardless of the ratio.

Per-run structured state (drug, run_ts, per-reaction PRR + CIs + effect + trend + status) is persisted to the agent-signal-runs OpenSearch index. The next run loads this via load_last_run() and passes per-reaction PRR trajectory deltas to the Phase-1 investigator prompt:

PRIOR RUN SIGNAL TRAJECTORY:
  PANCREATITIS: DRUG_SPECIFIC PRR=8.2→9.1 (+11%) | PERSISTENT
  NAUSEA: CLASS_EFFECT PRR=4.3→4.5 (+5%)
  BLOOD_GLUCOSE_DECR: DRUG_SPECIFIC PRR=6.0 last run → RESOLVED (gone)

This lets the investigator reason about trajectory — whether a signal is newly emerging, persistently elevated, or resolving — not just its current point estimate.

The report header summarises the lifecycle for the run: NEW=2 · VALIDATED=5 · DISMISSED=1. PRR table reaction labels carry status badges: 🆕 (NEW), ✅ (VALIDATED), 📉 (DISMISSED).


OpenSearch 3.6.0 Features Used

Feature API Purpose
filters aggregation standard Per-reaction baseline without top-N truncation
ML Memory /_plugins/_ml/memory Human-readable text trail across runs
agent-signal-runs index standard Structured per-run signal state for CI-overlap lifecycle diff
Built-in MCP server /_plugins/_ml/mcp Free-form investigation tools in Phase 2

Known Limitations

This tool is a PRR + within-class disproportionality screener. It is comparable in statistical content to OpenVigil's PRR/ROR output, and exceeds it with EBGM/EB05 and Mantel-Haenszel. What it is not:

Limitation Impact
Single-variable stratification only Stratified PRR (MH) runs by default on reporter_type (highest-yield FAERS confounder). Override with STRATIFY_PRR=age|sex|reporter_type or disable with STRATIFY_PRR=. Cross-stratification (age × sex) is not supported.
Age banding requires year-unit age data ZIP-ingested data now normalises age_cod (DEC/YR/MON/WK) to years. API-ingested data uses years natively. Data ingested before this fix should be re-indexed for accurate age bands.
No exposure normalisation PRR measures reporting rate, not incidence
FAERS structural biases Duplicate reports, Weber effect, notoriety/litigation bias (relevant for rofecoxib), stimulated reporting, co-medication confounding — all inherent to spontaneous reporting
Drug's top-N reactions capped at 50 Reactions ranked >50 in the drug's profile are not tested
MH stratifies by quarter only (within-class) Comparator drugs within a quarter are pooled. Full stratification by (quarter × comparator) requires per-comparator counts in the index.
BH-FDR uses Yates-conservative p-values Over-conservative (fewer false signals) — exact Fisher p-values would be more standard.

Validation Against openFDA (Independent Reference)

Two independent code paths (our OpenSearch pipeline vs the FDA's own API) computing the same PRR formula on overlapping data:

uv run python scripts/benchmark_vs_openvigil.py benchmark semaglutide

Note: The delta percentages below require the full ZIP-ingested 2004–2026 dataset (faers_zip_indexer --all-drugs). The quick-demo API load (faers_indexer --limit 6000) only covers a small subset and will produce different PRR values.

Results — full 18M-report dataset (2004–2026, June 2026)

Drug Drug-specific signals Median PRR Δ Within 10% Verdict
semaglutide (82K reports) 2 1.5% 100% ✅ Formula validated
warfarin (135K reports) 1 43.1% 0% ⚠️ See INR note below

Semaglutide — mechanism-specific signals (100% within 10%):

Reaction PRR (ours) PRR (openFDA) Δ%
IMPAIRED GASTRIC EMPTYING 87.38 ~86 ~1.6%
ERUCTATION 17.04 ~16.8 ~1.5%

Warfarin — INR INCREASED delta explained:

Reaction PRR (ours) PRR (openFDA) Δ% Note
INR INCREASED 71.53 125.79 43.1% Pre-DOAC era effect

The 43.1% delta on INR INCREASED is a known data-coverage characteristic, not a formula error:

  • Our local extract covers 2004–2026 all-drugs, including the pre-DOAC era (2004–2012) when warfarin dominated anticoagulation and INR monitoring was ubiquitous across many patient populations.
  • This inflates the background (non-warfarin) INR INCREASED rate in our denominator, lowering the PRR.
  • openFDA's reference computes against its own backend which may weight the 2018+ data differently.
  • Warfarin-specific clinical signals (HAEMORRHAGE, EPISTAXIS, ECCHYMOSIS) show Δ < 10% as expected — the formula is correct for non-monitoring reactions.

Observability

  • Web UI: http://localhost:8080 — real-time streaming briefing
  • OpenSearch Dashboards: http://localhost:5601 (admin / Pharma@2024!)
  • Phoenix traces (optional): instruments LangChain/LangGraph calls
# Enable Phoenix tracing
docker compose --profile observability up -d phoenix
uv sync --extra observability
# Traces appear at http://localhost:6006

Phoenix is not started by default — docker compose up -d runs the pipeline without it. The agent silently skips tracing if Phoenix is not reachable.


Roadmap

  • PRR — correct 2×2 table, per-reaction baseline, no rank truncation
  • ROR — WHO/Uppsala standard alongside PRR
  • PRR/ROR 95% confidence intervals (log-normal, Evans 2001)
  • Benjamini–Hochberg FDR correction (m = all reactions tested)
  • EBGM / EB05 — Gamma-Poisson Shrinker (DuMouchel 1999)
  • Yates χ² significance annotation
  • Within-class disproportionality — Mantel–Haenszel + Robins–Breslow–Greenland CI
  • FDA label cross-reference — MedDRA LLT-expanded, negation-aware, sentence-scoped
  • Three-state label match (Yes / Possible / No)
  • PubMed literature evidence
  • Two-phase investigation — Python Phase 1 (deterministic) + LLM Phase 2 (free-form)
  • Deterministic table rendering — numbers never re-typed by model
  • Signal registry — OpenSearch ML Memory (cross-run persistence)
  • Polars ingestion — 3× less memory, handles AERS + FAERS formats
  • Full FAERS 2004–2026 (historical + current)
  • openFDA independent benchmark (1.7% median Δ on mechanism signals)
  • Web UI — FastAPI + SSE streaming, dark-mode
  • GitHub Actions CI — 163 pure-function tests across 12 modules + schema smoke-import on every push
  • Stratified PRR — Mantel-Haenszel by reporter_type (default), age or sex (set STRATIFY_PRR env var)
  • BCPNN / IC / IC025 — WHO Uppsala standard (Bate 1998 / Norén 2006)
  • Configurable comparators — config/comparators.yaml + auto-discovery via RxClass ATC
  • Cross-stratification (age × sex × reporter_type simultaneously)

Related

Cloud / API version — built for the Google Cloud Rapid Agent Hackathon (Elastic track), uses managed cloud services and Gemini API:
google-cloud-rapid-agent-hackathon


Disclaimer

For research purposes only. PRR signals are statistical associations, not causal evidence. No regulatory decisions should be made based solely on this tool's output. Requires clinical validation before any regulatory action.

Statistics: All numeric values (PRR, ROR, EBGM, 95% CI, BH q-value, MH rate ratio, counts) are computed deterministically by Python and are fully reproducible. Formulas follow EMA/813938/2011 and DuMouchel (1999).

LLM narrative: Classification labels (DRUG_SPECIFIC / CLASS_EFFECT) and Key Findings text are generated by Qwen3.5-9B. They are advisory and non-deterministic — the same data may produce different wording across runs.

About

Local drug safety signal detection using OpenSearch + Qwen3.5. PRR/ROR/EBGM + Mantel-Haenszel. No API keys, no cloud, no licenses.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors