Skip to content

paperlantern-ai/paper-lantern-challenges

Repository files navigation

Paper Lantern Challenges

Your coding agent makes technical decisions. Paper Lantern reads hundreds of research papers per decision and turns them into implementation guidance.

We ran 9 everyday tasks - same agent, same model, same data. The only difference: one agent could call Paper Lantern before building its solution.

Test generation (+39% mutation score)
Writing tests that actually catch bugs. Paper Lantern found a technique that targets specific mutation points instead of guessing at coverage (MuTAP)

Code review (+13% F1)
Finding real issues in PRs, not hallucinated ones. Paper Lantern surfaced a multi-pass strategy - real bugs show up every time, false positives don't (Multi-Review Aggregation)

Text-to-SQL (+6% accuracy)
Generating SQL that returns the right answer. Paper Lantern surfaced 5 candidate approaches; the winner was self-consistency voting (SQL-PaLM)

Document extraction (+72% F1)
Pulling structured data from long documents. Paper Lantern found a retrieve-then-validate pipeline from papers published weeks before the experiment (BEAVER, PAVE)

Try it

npx paperlantern@latest

Setup guide and docs at paperlantern.ai/code

All results

Task Baseline With PL Delta What PL surfaced
Document extraction 0.444 0.764 +72% BEAVER section selection + PAVE validation
PDF extraction 0.318 0.572 +80% Section-level decomposition + self-verification (PARSE, Deep Reflective Reasoning)
Test generation 0.625 0.870 +39% Mutation-aware prompting via AST analysis (MuTAP, MUTGEN)
Text classification 0.505 0.666 +32% Retrieval-first classification (Retrieval-ICL, LLM-Select-P)
Prompt examples 0.193 0.324 +68% MMR diversity + hierarchical parent-then-child prompting
Text-to-SQL 0.650 0.690 +6% Self-consistency voting (SQL-PaLM, MCS-SQL)
Code review 0.351 0.395 +13% Consensus aggregation - 3 passes, majority vote (Multi-Review Aggregation)
LLM routing 0.744 0.761 +2% Cross-validated model selection (CARGO)
LLM-as-judge 0.623 0.633 +2% Dimension-specific multi-pass evaluation (HypoEval, PAIRS)

The biggest gains came when Paper Lantern surfaced an architecturally different approach - not a better prompt, but a different way to structure the solution. The smaller gains came on tasks where the baseline was already structurally sound. We're showing all 9 because that's the honest picture.

How the experiments work

The baseline and with_pl prompts are identical except for one section - baseline says "use your own knowledge," with_pl says "use Paper Lantern MCP tools to research first." Same agent, same model (Gemini Flash 3), same data, same evaluation script. Everything else downstream was the agent's doing.

Every prompt, every line of agent code, every prediction is in this repo. Diff them yourself:

diff experiments/test_generation/prompts/baseline.md experiments/test_generation/prompts/with_pl.md

Run it yourself

Results are already committed - no API keys needed to browse. But if you want to verify:

git clone https://github.com/paperlantern-ai/paper-lantern-challenges.git
cd paper-lantern-challenges
uv sync

uv run python run.py setup test_generation
uv run python run.py run test_generation baseline
uv run python run.py run test_generation with_pl
uv run python run.py compare test_generation
Prerequisites
  • uv - curl -LsSf https://astral.sh/uv/install.sh | sh
  • Claude Code - or any coding agent
  • Gemini API key - export GEMINI_API_KEY=your_key
  • Paper Lantern (for with_pl runs) - setup instructions
CLI reference
uv run python run.py list                                          # Show available challenges
uv run python run.py setup <challenge>                             # Download data
uv run python run.py run <challenge> <submission>                  # Run + evaluate
uv run python run.py run <challenge> <submission> --dataset cfpb   # Run one dataset
uv run python run.py eval <challenge> <submission>                 # Evaluate only
uv run python run.py compare <challenge>                           # Side-by-side comparison
Repo structure
experiments/<challenge>/
├── CHALLENGE.md          # Problem description
├── evaluate.py           # Scoring script
├── setup.py              # Data download
├── starter/              # Starter code
├── prompts/
│   ├── baseline.md       # Exact prompt we gave the baseline agent
│   └── with_pl.md        # Exact prompt we gave the PL agent
├── data/                 # Challenge data (after setup)
└── submissions/
    ├── baseline/
    │   ├── *.py              # Code the agent wrote
    │   ├── predictions.jsonl
    │   └── approach.md       # Strategy, decisions, reasoning
    └── with_pl/
        ├── *.py
        ├── predictions.jsonl
        └── approach.md       # Strategy, decisions, papers cited

License

MIT

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages