AtlasED / EPOCH

Production NLP system that makes education-policy discourse comparable across England, Scotland and Ireland — using a deployed BERTopic pipeline as the production spine, with NMF as a convergent-validity baseline.

The headline isn't a topic model — it's the dual-pipeline design: two structurally different methods (semantic-embedding BERTopic and lexical NMF) are run over the same corpus, and they independently reproduce the same cross-national finding. That convergence is what makes the result trustworthy rather than an artefact of one model.

Core finding: the three systems foreground genuinely different things — England accountability & structures (Ofsted, academies), Scotland equity & rights, Ireland teaching, curriculum & inclusion. The difference is statistically strong (Cramér's V ≈ 0.28, p ≪ 0.001), survives controlling for who is speaking (V = 0.431 within government), and is reproduced by both methods.

Live

Inference API (FastAPI on Render): https://atlased-epoch-api.onrender.com/health
Weekly pipeline: GitHub Actions — self-healing scrape → LLM relevance gate → inference → drift monitoring
Dashboard: Chart.js front end served by the API (/ + /api/data)

Architecture

weekly scrape ──► Supabase (raw)
   │  self-healing watermark (GitHub Actions)
   ▼
LLM relevance gate (gpt-4o-mini, frozen rubric, cost-capped)
   │
   ▼
BERTopic inference  ──►  Supabase (epoch_* topic tables)  ──►  dashboard + /predict API
   │  cosine vs 138 frozen topic centroids                        │
   └──────────────────────────────────────────────────────────►  drift monitoring (monthly)

Canonical store: one Supabase (Postgres) source of truth that every component reads from.
Inference: documents are chunked (~100 words), embedded (MiniLM), and scored by cosine similarity to 138 frozen topic centroids — fast, CPU-only, no GPU.
Serving: a Dockerised FastAPI app on Render with the embedder baked into the image (no runtime model download).

The two pipelines

Production spine — BERTopic

MiniLM sentence-embeddings → UMAP → HDBSCAN → c-TF-IDF → frozen centroids → cosine inference Three per-country models (England 75 / Scotland 30 / Ireland 33 topics) plus a combined model, rolled up to a curated 20-category crosswalk. Chosen for the comparative task: a shared embedding space lets equivalents align despite different national vocabulary (England SEND ↔ Scotland ASN at cosine 0.78).

Baseline / control — NMF

Country-specific TF-IDF + NMF models. Transparent, lexical, CPU-cheap. Retained deliberately — not as a discarded first attempt — as the independent check.

Why both

NMF, sharing no architecture with BERTopic, reproduces the same national fingerprints → the cross-national finding is method-robust, not a BERTopic artefact. This is convergent validity, and it's the project's real contribution.

Key results

Three national registers, triple-validated (by measure, by actor, by method).
Statistical: χ² p ≪ 0.001, Cramér's V ≈ 0.284, bootstrap CIs that don't overlap across countries.
Robustness: leave-one-source-out shifts no category > ~3pp; the country effect survives within the government-only stratum (V = 0.431).
Diversity: > 0.93 per country; inference agrees ~71% with an LLM judge (Claude Haiku, category) and ~86% with the model's own HDBSCAN labels.
Monitoring: content-drift vs model-drift separated — current verdict no retrain (fit stable across 14 quarters).

Full write-up: docs/results_analysis_bertopic.md.

Production engineering

Self-healing weekly scrape (watermark-based; catches up after any missed run).
Decoupled LLM relevance gate — drains ungated rows independently; a failure just retries.
Idempotent Supabase upserts, versioned API I/O contract (HTTP 409 on model-version mismatch).
Monitoring + failure alerting (GitHub Actions opens an issue on any pipeline failure).
Tests + CI, SBOM, secrets baseline.

Repo structure

atlas-ed/
├── pipelines/
│   ├── bertopic_epoch/   # PRODUCTION — api, training, inference, monitoring, models, outputs, sql, tests, notebooks
│   └── nmf_baseline/     # BASELINE — NMF pipeline (convergent-validity control)
├── src/atlased/          # shared, installable package (inference core, preprocessing, path resolver)
├── ingestion/            # scrape → gate (feeds both pipelines)
├── dashboard/epoch/      # Chart.js dashboard + build_data.py (data.json generator)
├── requirements/         # api.txt, scraping.txt
├── docs/                 # architecture, governance (model card / datasheet / DPIA), methods, results
├── experiments/          # MLflow runs + scratch
├── Dockerfile · render.yaml · pyproject.toml
└── .github/workflows/    # weekly scrape · gate · inference · monthly monitor · alert

Run locally

pip install -e . -r requirements/api.txt           # install the `atlased` core + API deps
uvicorn main:app --app-dir pipelines/bertopic_epoch/api   # serves API + dashboard at /

# score a document
curl -s -X POST localhost:8000/predict -H 'Content-Type: application/json' \
  -d '{"docs":[{"doc_id":"t1","text":"Ofsted inspection of academy trusts...","country":"eng"}]}'

Quick tests: pip install -e . -r requirements/api.txt && pytest

Model/API tests load the frozen BERTopic artefacts and sentence-transformer, so they are opt-in: pytest -m model pipelines/bertopic_epoch/tests

Governance

Model Card · Datasheet · DPIA
Source discretion: the system surfaces analysis and links, never the source text or document URLs.
Fairness framing: representational (whose discourse is amplified); all comparison is within-country share, never raw counts.

Tech stack

Python · BERTopic · sentence-transformers (MiniLM) · UMAP / HDBSCAN · scikit-learn (NMF) · FastAPI · Supabase (Postgres) · Docker · Render · GitHub Actions · MLflow · Chart.js

Licence

MIT

UCL Institute of Education · Education Research Programme · funded by UCL Grand Challenges · Level 6 ML Engineering Apprenticeship · 2025–2026

Name		Name	Last commit message	Last commit date
Latest commit History 212 Commits
.github/workflows		.github/workflows
data		data
docs		docs
experiments		experiments
ingestion		ingestion
pipelines		pipelines
requirements		requirements
src/atlased		src/atlased
tests		tests
.dockerignore		.dockerignore
.gitattributes		.gitattributes
.gitignore		.gitignore
.secrets.baseline		.secrets.baseline
ARTIFACTS.md		ARTIFACTS.md
DEPLOY.md		DEPLOY.md
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
dashboarddashboard.docx		dashboarddashboard.docx
pyproject.toml		pyproject.toml
pytest.ini		pytest.ini
render.yaml		render.yaml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

AtlasED / EPOCH

Live

Architecture

The two pipelines

Production spine — BERTopic

Baseline / control — NMF

Why both

Key results

Production engineering

Repo structure

Run locally

Governance

Tech stack

Licence

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

AtlasED / EPOCH

Live

Architecture

The two pipelines

Production spine — BERTopic

Baseline / control — NMF

Why both

Key results

Production engineering

Repo structure

Run locally

Governance

Tech stack

Licence

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages