Your coding agent makes technical decisions. Paper Lantern reads hundreds of research papers per decision and turns them into implementation guidance.
We ran 9 everyday tasks - same agent, same model, same data. The only difference: one agent could call Paper Lantern before building its solution.
Test generation (+39% mutation score)
Writing tests that actually catch bugs. Paper Lantern found a technique that targets specific mutation points instead of guessing at coverage (MuTAP)
Code review (+13% F1)
Finding real issues in PRs, not hallucinated ones. Paper Lantern surfaced a multi-pass strategy - real bugs show up every time, false positives don't (Multi-Review Aggregation)
Text-to-SQL (+6% accuracy)
Generating SQL that returns the right answer. Paper Lantern surfaced 5 candidate approaches; the winner was self-consistency voting (SQL-PaLM)
Document extraction (+72% F1)
Pulling structured data from long documents. Paper Lantern found a retrieve-then-validate pipeline from papers published weeks before the experiment (BEAVER, PAVE)
npx paperlantern@latest
Setup guide and docs at paperlantern.ai/code
| Task | Baseline | With PL | Delta | What PL surfaced |
|---|---|---|---|---|
| Document extraction | 0.444 | 0.764 | +72% | BEAVER section selection + PAVE validation |
| PDF extraction | 0.318 | 0.572 | +80% | Section-level decomposition + self-verification (PARSE, Deep Reflective Reasoning) |
| Test generation | 0.625 | 0.870 | +39% | Mutation-aware prompting via AST analysis (MuTAP, MUTGEN) |
| Text classification | 0.505 | 0.666 | +32% | Retrieval-first classification (Retrieval-ICL, LLM-Select-P) |
| Prompt examples | 0.193 | 0.324 | +68% | MMR diversity + hierarchical parent-then-child prompting |
| Text-to-SQL | 0.650 | 0.690 | +6% | Self-consistency voting (SQL-PaLM, MCS-SQL) |
| Code review | 0.351 | 0.395 | +13% | Consensus aggregation - 3 passes, majority vote (Multi-Review Aggregation) |
| LLM routing | 0.744 | 0.761 | +2% | Cross-validated model selection (CARGO) |
| LLM-as-judge | 0.623 | 0.633 | +2% | Dimension-specific multi-pass evaluation (HypoEval, PAIRS) |
The biggest gains came when Paper Lantern surfaced an architecturally different approach - not a better prompt, but a different way to structure the solution. The smaller gains came on tasks where the baseline was already structurally sound. We're showing all 9 because that's the honest picture.
The baseline and with_pl prompts are identical except for one section - baseline says "use your own knowledge," with_pl says "use Paper Lantern MCP tools to research first." Same agent, same model (Gemini Flash 3), same data, same evaluation script. Everything else downstream was the agent's doing.
Every prompt, every line of agent code, every prediction is in this repo. Diff them yourself:
diff experiments/test_generation/prompts/baseline.md experiments/test_generation/prompts/with_pl.mdResults are already committed - no API keys needed to browse. But if you want to verify:
git clone https://github.com/paperlantern-ai/paper-lantern-challenges.git
cd paper-lantern-challenges
uv sync
uv run python run.py setup test_generation
uv run python run.py run test_generation baseline
uv run python run.py run test_generation with_pl
uv run python run.py compare test_generationPrerequisites
- uv -
curl -LsSf https://astral.sh/uv/install.sh | sh - Claude Code - or any coding agent
- Gemini API key -
export GEMINI_API_KEY=your_key - Paper Lantern (for
with_plruns) - setup instructions
CLI reference
uv run python run.py list # Show available challenges
uv run python run.py setup <challenge> # Download data
uv run python run.py run <challenge> <submission> # Run + evaluate
uv run python run.py run <challenge> <submission> --dataset cfpb # Run one dataset
uv run python run.py eval <challenge> <submission> # Evaluate only
uv run python run.py compare <challenge> # Side-by-side comparisonRepo structure
experiments/<challenge>/
├── CHALLENGE.md # Problem description
├── evaluate.py # Scoring script
├── setup.py # Data download
├── starter/ # Starter code
├── prompts/
│ ├── baseline.md # Exact prompt we gave the baseline agent
│ └── with_pl.md # Exact prompt we gave the PL agent
├── data/ # Challenge data (after setup)
└── submissions/
├── baseline/
│ ├── *.py # Code the agent wrote
│ ├── predictions.jsonl
│ └── approach.md # Strategy, decisions, reasoning
└── with_pl/
├── *.py
├── predictions.jsonl
└── approach.md # Strategy, decisions, papers cited
MIT