Skip to content

Yorkel/atlas-ed

Repository files navigation

AtlasED / EPOCH

Production NLP system that makes education-policy discourse comparable across England, Scotland and Ireland — using a deployed BERTopic pipeline as the production spine, with NMF as a convergent-validity baseline.

The headline isn't a topic model — it's the dual-pipeline design: two structurally different methods (semantic-embedding BERTopic and lexical NMF) are run over the same corpus, and they independently reproduce the same cross-national finding. That convergence is what makes the result trustworthy rather than an artefact of one model.

Core finding: the three systems foreground genuinely different things — England accountability & structures (Ofsted, academies), Scotland equity & rights, Ireland teaching, curriculum & inclusion. The difference is statistically strong (Cramér's V ≈ 0.28, p ≪ 0.001), survives controlling for who is speaking (V = 0.431 within government), and is reproduced by both methods.


Live

  • Inference API (FastAPI on Render): https://atlased-epoch-api.onrender.com/health
  • Weekly pipeline: GitHub Actions — self-healing scrape → LLM relevance gate → inference → drift monitoring
  • Dashboard: Chart.js front end served by the API (/ + /api/data)

Architecture

weekly scrape ──► Supabase (raw)
   │  self-healing watermark (GitHub Actions)
   ▼
LLM relevance gate (gpt-4o-mini, frozen rubric, cost-capped)
   │
   ▼
BERTopic inference  ──►  Supabase (epoch_* topic tables)  ──►  dashboard + /predict API
   │  cosine vs 138 frozen topic centroids                        │
   └──────────────────────────────────────────────────────────►  drift monitoring (monthly)
  • Canonical store: one Supabase (Postgres) source of truth that every component reads from.
  • Inference: documents are chunked (~100 words), embedded (MiniLM), and scored by cosine similarity to 138 frozen topic centroids — fast, CPU-only, no GPU.
  • Serving: a Dockerised FastAPI app on Render with the embedder baked into the image (no runtime model download).

The two pipelines

Production spine — BERTopic

MiniLM sentence-embeddings → UMAP → HDBSCAN → c-TF-IDF → frozen centroids → cosine inference Three per-country models (England 75 / Scotland 30 / Ireland 33 topics) plus a combined model, rolled up to a curated 20-category crosswalk. Chosen for the comparative task: a shared embedding space lets equivalents align despite different national vocabulary (England SEND ↔ Scotland ASN at cosine 0.78).

Baseline / control — NMF

Country-specific TF-IDF + NMF models. Transparent, lexical, CPU-cheap. Retained deliberately — not as a discarded first attempt — as the independent check.

Why both

NMF, sharing no architecture with BERTopic, reproduces the same national fingerprints → the cross-national finding is method-robust, not a BERTopic artefact. This is convergent validity, and it's the project's real contribution.


Key results

  • Three national registers, triple-validated (by measure, by actor, by method).
  • Statistical: χ² p ≪ 0.001, Cramér's V ≈ 0.284, bootstrap CIs that don't overlap across countries.
  • Robustness: leave-one-source-out shifts no category > ~3pp; the country effect survives within the government-only stratum (V = 0.431).
  • Diversity: > 0.93 per country; inference agrees ~71% with an LLM judge (Claude Haiku, category) and ~86% with the model's own HDBSCAN labels.
  • Monitoring: content-drift vs model-drift separated — current verdict no retrain (fit stable across 14 quarters).

Full write-up: docs/results_analysis_bertopic.md.


Production engineering

  • Self-healing weekly scrape (watermark-based; catches up after any missed run).
  • Decoupled LLM relevance gate — drains ungated rows independently; a failure just retries.
  • Idempotent Supabase upserts, versioned API I/O contract (HTTP 409 on model-version mismatch).
  • Monitoring + failure alerting (GitHub Actions opens an issue on any pipeline failure).
  • Tests + CI, SBOM, secrets baseline.

Repo structure

atlas-ed/
├── pipelines/
│   ├── bertopic_epoch/   # PRODUCTION — api, training, inference, monitoring, models, outputs, sql, tests, notebooks
│   └── nmf_baseline/     # BASELINE — NMF pipeline (convergent-validity control)
├── src/atlased/          # shared, installable package (inference core, preprocessing, path resolver)
├── ingestion/            # scrape → gate (feeds both pipelines)
├── dashboard/epoch/      # Chart.js dashboard + build_data.py (data.json generator)
├── requirements/         # api.txt, scraping.txt
├── docs/                 # architecture, governance (model card / datasheet / DPIA), methods, results
├── experiments/          # MLflow runs + scratch
├── Dockerfile · render.yaml · pyproject.toml
└── .github/workflows/    # weekly scrape · gate · inference · monthly monitor · alert

Run locally

pip install -e . -r requirements/api.txt           # install the `atlased` core + API deps
uvicorn main:app --app-dir pipelines/bertopic_epoch/api   # serves API + dashboard at /
# score a document
curl -s -X POST localhost:8000/predict -H 'Content-Type: application/json' \
  -d '{"docs":[{"doc_id":"t1","text":"Ofsted inspection of academy trusts...","country":"eng"}]}'

Quick tests: pip install -e . -r requirements/api.txt && pytest

Model/API tests load the frozen BERTopic artefacts and sentence-transformer, so they are opt-in: pytest -m model pipelines/bertopic_epoch/tests


Governance

  • Model Card · Datasheet · DPIA
  • Source discretion: the system surfaces analysis and links, never the source text or document URLs.
  • Fairness framing: representational (whose discourse is amplified); all comparison is within-country share, never raw counts.

Tech stack

Python · BERTopic · sentence-transformers (MiniLM) · UMAP / HDBSCAN · scikit-learn (NMF) · FastAPI · Supabase (Postgres) · Docker · Render · GitHub Actions · MLflow · Chart.js


Licence

MIT

UCL Institute of Education · Education Research Programme · funded by UCL Grand Challenges · Level 6 ML Engineering Apprenticeship · 2025–2026

About

Cross-jurisdictional NLP pipeline for education policy discourse analysis. Surfaces whose voices shape the debate and makes the specification choices behind the analysis visible.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors