Paper Lantern Challenges

Your coding agent makes technical decisions. Paper Lantern reads hundreds of research papers per decision and turns them into implementation guidance.

We ran 9 everyday tasks - same agent, same model, same data. The only difference: one agent could call Paper Lantern before building its solution.

Test generation (+39% mutation score)
Writing tests that actually catch bugs. Paper Lantern found a technique that targets specific mutation points instead of guessing at coverage (MuTAP)

Code review (+13% F1)
Finding real issues in PRs, not hallucinated ones. Paper Lantern surfaced a multi-pass strategy - real bugs show up every time, false positives don't (Multi-Review Aggregation)

Text-to-SQL (+6% accuracy)
Generating SQL that returns the right answer. Paper Lantern surfaced 5 candidate approaches; the winner was self-consistency voting (SQL-PaLM)

Document extraction (+72% F1)
Pulling structured data from long documents. Paper Lantern found a retrieve-then-validate pipeline from papers published weeks before the experiment (BEAVER, PAVE)

Try it

npx paperlantern@latest

Setup guide and docs at paperlantern.ai/code

All results

Task	Baseline	With PL	Delta	What PL surfaced
Document extraction	0.444	0.764	+72%	BEAVER section selection + PAVE validation
PDF extraction	0.318	0.572	+80%	Section-level decomposition + self-verification (PARSE, Deep Reflective Reasoning)
Test generation	0.625	0.870	+39%	Mutation-aware prompting via AST analysis (MuTAP, MUTGEN)
Text classification	0.505	0.666	+32%	Retrieval-first classification (Retrieval-ICL, LLM-Select-P)
Prompt examples	0.193	0.324	+68%	MMR diversity + hierarchical parent-then-child prompting
Text-to-SQL	0.650	0.690	+6%	Self-consistency voting (SQL-PaLM, MCS-SQL)
Code review	0.351	0.395	+13%	Consensus aggregation - 3 passes, majority vote (Multi-Review Aggregation)
LLM routing	0.744	0.761	+2%	Cross-validated model selection (CARGO)
LLM-as-judge	0.623	0.633	+2%	Dimension-specific multi-pass evaluation (HypoEval, PAIRS)

The biggest gains came when Paper Lantern surfaced an architecturally different approach - not a better prompt, but a different way to structure the solution. The smaller gains came on tasks where the baseline was already structurally sound. We're showing all 9 because that's the honest picture.

How the experiments work

The baseline and with_pl prompts are identical except for one section - baseline says "use your own knowledge," with_pl says "use Paper Lantern MCP tools to research first." Same agent, same model (Gemini Flash 3), same data, same evaluation script. Everything else downstream was the agent's doing.

Every prompt, every line of agent code, every prediction is in this repo. Diff them yourself:

diff experiments/test_generation/prompts/baseline.md experiments/test_generation/prompts/with_pl.md

Run it yourself

Results are already committed - no API keys needed to browse. But if you want to verify:

git clone https://github.com/paperlantern-ai/paper-lantern-challenges.git
cd paper-lantern-challenges
uv sync

uv run python run.py setup test_generation
uv run python run.py run test_generation baseline
uv run python run.py run test_generation with_pl
uv run python run.py compare test_generation

Prerequisites

uv - curl -LsSf https://astral.sh/uv/install.sh | sh
Claude Code - or any coding agent
Gemini API key - export GEMINI_API_KEY=your_key
Paper Lantern (for with_pl runs) - setup instructions

CLI reference

uv run python run.py list                                          # Show available challenges
uv run python run.py setup <challenge>                             # Download data
uv run python run.py run <challenge> <submission>                  # Run + evaluate
uv run python run.py run <challenge> <submission> --dataset cfpb   # Run one dataset
uv run python run.py eval <challenge> <submission>                 # Evaluate only
uv run python run.py compare <challenge>                           # Side-by-side comparison

Repo structure

experiments/<challenge>/
├── CHALLENGE.md          # Problem description
├── evaluate.py           # Scoring script
├── setup.py              # Data download
├── starter/              # Starter code
├── prompts/
│   ├── baseline.md       # Exact prompt we gave the baseline agent
│   └── with_pl.md        # Exact prompt we gave the PL agent
├── data/                 # Challenge data (after setup)
└── submissions/
    ├── baseline/
    │   ├── *.py              # Code the agent wrote
    │   ├── predictions.jsonl
    │   └── approach.md       # Strategy, decisions, reasoning
    └── with_pl/
        ├── *.py
        ├── predictions.jsonl
        └── approach.md       # Strategy, decisions, papers cited

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
experiments		experiments
shared		shared
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
paper_lantern_logo_no_name.png		paper_lantern_logo_no_name.png
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
run.py		run.py
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Paper Lantern Challenges

Try it

All results

How the experiments work

Run it yourself

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Paper Lantern Challenges

Try it

All results

How the experiments work

Run it yourself

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages