Architecture

Pipeline Overview

DOJ EFTA (DS1–DS12) / Kaggle / HuggingFace / Archive.org / justice.gov
    │
    ▼
┌──────────────────────────────────────────────────────────┐
│  OCR (multi-backend fallback chain)                      │
│  PyMuPDF → PaddleOCR → Granite-Docling → Surya → Docling │
│  Per-page confidence scoring, automatic backend selection│
└──────────────────────┬───────────────────────────────────┘
                       │
    ┌──────────────────┼──────────────────┐
    ▼                  ▼                  ▼
┌────────────┐  ┌────────────┐  ┌──────────────────┐
│ Transcribe │  │ NER        │  │ Classifier       │
│ WhisperX / │  │ spaCy trf  │  │ GLiClass-        │
│ faster-    │  │ + GLiNER   │  │ ModernBERT       │
│ whisper    │  │ + regex    │  │ (50x faster)     │
│ + pyannote │  │            │  │ 12 categories    │
└─────┬──────┘  └─────┬──────┘  └────────┬─────────┘
      │               │                  │
    ┌─┼───────────────┼──────────────────┘
    │ │               │
    ▼ ▼               ▼
┌────────────┐  ┌────────────┐  ┌──────────────────┐
│ Structured │  │ Dedup      │  │ Summarizer       │
│ Extraction │  │ Hash →     │  │ LLM-based        │
│ Instructor │  │ MinHash →  │  │ Redaction        │
│ + Pydantic │  │ Semantic   │  │ Analysis         │
└─────┬──────┘  └─────┬──────┘  └────────┬─────────┘
      │               │                  │
      └───────────────┼──────────────────┘
                      ▼
┌──────────────────────────────────────────────────────────┐
│  Semantic Chunker → Embeddings (nomic-embed-text-v2-moe) │
│  Paragraph-aware splitting, 768-dim / 256-dim Matryoshka │
└──────────────────────┬───────────────────────────────────┘
                       │
    ┌──────────────────┼──────────────────┐
    ▼                  ▼                  ▼
┌────────────┐  ┌────────────┐  ┌──────────────────┐
│ Neon PG    │  │ JSON/CSV   │  │ Knowledge Graph  │
│ + pgvector │  │ SQLite     │  │ GEXF + JSON      │
│ cosine ANN │  │ NDJSON     │  │ LLM extraction   │
└────────────┘  └────────────┘  └──────────────────┘

Directory Structure

src/epstein_pipeline/
├── cli.py                          # Click CLI entry point (all commands)
├── config.py                       # Pydantic settings (env vars, paths, model names)
├── state.py                        # Pipeline state tracking (processed files, hashes)
│
├── downloaders/                    # Data source fetchers
│   ├── doj.py                      # DOJ EFTA dataset downloads (DS1-DS12)
│   ├── kaggle.py                   # Kaggle Epstein Ranker dataset
│   ├── huggingface.py              # HuggingFace datasets (emails, filings)
│   ├── archive.py                  # Archive.org media collections
│   ├── video_depositions.py        # Video deposition downloader (justice.gov, C-SPAN, Archive.org)
│   ├── opensanctions.py            # OpenSanctions cross-reference data
│   ├── icij.py                     # ICIJ Offshore Leaks network data
│   ├── fec.py                      # FEC political donation records
│   ├── nonprofits.py               # IRS 990 tax-exempt organization data
│   ├── propublica_nonprofits.py    # ProPublica Nonprofit Explorer API (richer 990 metadata)
│   ├── courtlistener.py            # CourtListener / RECAP free-tier search API
│   ├── sec_edgar.py                # SEC EDGAR filings (JPM, Deutsche Bank, BBWI, etc.)
│   ├── house_oversight.py          # House Oversight releases (Drive + Dropbox scrapers)
│   └── archive_org.py              # Internet Archive mirror downloader (DS1-DS12 + Oversight)
│
├── processors/                     # Core processing pipeline
│   ├── ocr.py                      # Multi-backend OCR (PyMuPDF → PaddleOCR → Granite-Docling → Surya)
│   ├── pymupdf_extractor.py        # PyMuPDF-specific text/image extraction
│   ├── transcriber.py              # Audio/video transcription (faster-whisper / WhisperX + pyannote)
│   ├── entities.py                 # spaCy + GLiNER NER with person registry matching
│   ├── person_linker.py            # Fast substring person linking (rapidfuzz)
│   ├── structured_extractor.py     # LLM structured extraction (Instructor + Pydantic)
│   ├── classifier.py               # Document classification (GLiClass-ModernBERT / BART fallback)
│   ├── confidence.py               # Numeric confidence scores for entity matches
│   ├── dedup.py                    # Three-pass dedup (hash → MinHash → semantic)
│   ├── chunker.py                  # Semantic text chunking (paragraph-aware)
│   ├── embeddings.py               # nomic-embed-text-v2-moe vector generation
│   ├── knowledge_graph.py          # Entity relationship graph (JSON + GEXF + Neo4j)
│   ├── temporal_extractor.py       # LLM temporal event extraction (Instructor + Pydantic)
│   ├── entity_resolution.py        # Probabilistic entity resolution (Splink 4 + DuckDB)
│   ├── redaction.py                # Redaction detection + recovery analysis
│   ├── image_extractor.py          # PDF image extraction + optional AI description
│   ├── plist_forensics.py          # Apple Mail PLIST metadata extraction
│   └── summarizer.py               # AI document summarization (Ollama / OpenAI)
│
├── exporters/                      # Output format converters
│   ├── json_export.py              # JSON export (site-compatible camelCase)
│   ├── csv_export.py               # CSV export for researchers
│   ├── sqlite_export.py            # SQLite with FTS5 full-text search
│   ├── neon_export.py              # Neon Postgres with pgvector embeddings
│   ├── neon_schema.py              # Idempotent Neon schema migration SQL (v4)
│   ├── neo4j_export.py             # Neo4j graph database export (async MERGE)
│   └── site_sync.py                # Direct sync to epstein-index site data/
│
├── importers/                      # External data importers
│   └── sea_doughnut.py             # Import Sea_Doughnut research databases
│
├── models/                         # Pydantic data models
│   ├── document.py                 # Document, Page, Entity, Embedding models
│   ├── registry.py                 # Person registry (names, aliases, IDs)
│   ├── forensics.py                # Redaction, PLIST, image analysis models
│   └── temporal.py                 # Temporal event extraction models
│
├── validators/                     # Data quality enforcement
│   ├── schema.py                   # JSON schema validation
│   └── integrity.py                # Cross-reference integrity checks
│
└── utils/                          # Shared utilities
    ├── hashing.py                  # Content hashing (SHA-256, SimHash)
    ├── parallel.py                 # ProcessPoolExecutor wrapper
    └── progress.py                 # Rich progress bars

Key Design Decisions

Multi-Backend OCR with Fallback Chain

The pipeline supports seven OCR backends because no single engine handles all document types well:

Backend	Strengths	Weaknesses
PyMuPDF	Instant, extracts existing text layers	Cannot OCR scanned images
PaddleOCR PP-OCRv5	94.5% OmniDocBench; ~12s/page CPU; production-grade	Known Windows silent-exit bug after first call (see CLI workaround)
Granite-Docling-258M	VLM accuracy (OCRBench 500), ~500MB VRAM, structure-aware	Requires GPU for reasonable speed
SmolDocling-256M	Fast (0.35s/page), 500MB VRAM	Superseded by Granite-Docling; kept for compatibility
Surya	Fast, 90+ languages, good accuracy	Misses some complex layouts
olmOCR 2	Highest accuracy (VLM-based)	Requires 8GB+ GPU
Docling (IBM)	Understands tables/layout, no GPU	Slower than Surya

The default strategy (--backend auto) chains: PyMuPDF → PaddleOCR → Granite-Docling → Surya → Docling. Per-page confidence scoring triggers fallback when quality is low. olmOCR is excluded from auto mode due to GPU cost; select explicitly with --backend olmocr.

Windows PaddleOCR workaround: PaddlePaddle 3.2.0 + PaddleOCR 3.4.0 on Windows silently exits after the first Python-API OCR call (native oneDNN std::abort bypasses Python traceback). For reliable batch OCR, use scripts/ocr-via-cli.py — a thin wrapper that spawns the paddleocr ocr CLI in a fresh subprocess per document and writes the same .txt + .meta.json sidecar cache that scripts/ingest-featured-releases.py reads on re-run.

Three-Pass Deduplication

Duplicate detection uses three complementary approaches:

Content hash — SHA-256 of normalized text catches exact duplicates (O(1) per doc)
MinHash/LSH — Locality-sensitive hashing finds near-duplicates (O(n) total, sublinear per query)
Semantic embeddings — Cosine similarity catches OCR-variant duplicates where the same document was scanned differently

Results are stored in data/known-duplicates.json with human-reviewable match explanations.

Pydantic Models with camelCase Fields

Pydantic v2 models use camelCase field names (via alias_generator) to directly match the TypeScript interfaces on epsteinexposed.com. This means data flows from pipeline → JSON → site with zero transformation:

class Document(BaseModel):
    model_config = ConfigDict(populate_by_name=True)

    id: str
    title: str
    source: str
    personIds: list[str] = Field(alias="person_ids")  # camelCase in JSON

Person Registry

The person registry (data/persons-registry.json) contains 1,723+ known persons with:

Canonical names, aliases, and spelling variations
Unique IDs matching the site's person pages
Categories (associate, legal, political, victim, etc.)

Entity extraction matches against this registry using rapidfuzz fuzzy matching with configurable confidence thresholds.

Neon Postgres with pgvector

The Neon exporter creates a production-ready database with:

pgvector for semantic search (cosine similarity on 768-dim embeddings)
pg_trgm for fuzzy text search (trigram similarity)
FTS via tsvector/GIN indexes for full-text search
IVFFlat indexes for approximate nearest neighbor queries
Idempotent schema migration (epstein-pipeline migrate)

Knowledge Graph

The knowledge graph processor builds weighted entity-relationship graphs:

Co-occurrence edges from documents mentioning multiple persons
Co-passenger edges from flight log data
Correspondence edges from email sender/recipient pairs
Optional LLM relationship extraction for relationship labeling

Output formats: JSON (for D3.js visualization) and GEXF (for Gephi analysis).

Data Flow

1. Download     Raw PDFs from DOJ, Kaggle, HuggingFace, Archive.org
                ↓
2. OCR          Extract text (PyMuPDF → Surya → Docling fallback chain)
                ↓
3. Entities     spaCy NER + GLiNER zero-shot + regex patterns
                ↓
4. Person Link  Match entity names → canonical person IDs (rapidfuzz)
                ↓
5. Classify     Zero-shot BART → 12 legal document categories
                ↓
6. Dedup        Hash → MinHash/LSH → semantic similarity
                ↓
7. Chunk        Semantic paragraph-aware text splitting
                ↓
8. Embed        nomic-embed-text-v2-moe → 768-dim vectors
                ↓
9. Validate     Schema checks, cross-reference integrity
                ↓
10. Export      JSON, CSV, SQLite, or Neon Postgres

CI/CD

GitHub Actions Workflows

Workflow	Trigger	What It Does
`ci.yml`	Push/PR to main	Lint, test (3.10-3.13), typecheck, schema validation
`publish.yml`	Release tag	Build and publish to PyPI
`validate-data.yml`	PR with data changes	Validate contributed data files

Docker

Multi-stage build for smaller images:

Builder stage: Installs all dependencies with build tools
Runtime stage: Copies only installed packages + runtime deps
Includes spaCy en_core_web_sm model
Entry point: epstein-pipeline CLI

docker compose run pipeline ocr ./pdfs/ --output ./output/
docker compose run pipeline export neon --input-dir ./output/

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Architecture

Pipeline Overview

Directory Structure

Key Design Decisions

Multi-Backend OCR with Fallback Chain

Three-Pass Deduplication

Pydantic Models with camelCase Fields

Person Registry

Neon Postgres with pgvector

Knowledge Graph

Data Flow

CI/CD

GitHub Actions Workflows

Docker

FilesExpand file tree

ARCHITECTURE.md

Latest commit

History

ARCHITECTURE.md

File metadata and controls

Architecture

Pipeline Overview

Directory Structure

Key Design Decisions

Multi-Backend OCR with Fallback Chain

Three-Pass Deduplication

Pydantic Models with camelCase Fields

Person Registry

Neon Postgres with pgvector

Knowledge Graph

Data Flow

CI/CD

GitHub Actions Workflows

Docker