A comprehensive respository dedicated to evaluation of academic paper search engines that provides test search queries and evaluation metrics.
The test search queries are distributed across many categories to enable fine-grained evaluations (see below).
also available as Huggingface Datasets - https://huggingface.co/paperlantern
-
test_sets/computer_science_ai_search_queries.json- 200 search queries
- covering AI, machine learning, and related computer science domains.
- including research areas like neural networks, reinforcement learning, computer vision, and natural language processing.
-
test_sets/computer_science_non_ai_search_queries.json- 200 search queries
- covering all non-AI computer science domains
- including research areas like algorithms, systems, theory, security, databases, cryptography and software engineering
The test sets systematically covers multiple dimensions of academic search:
| Dimension | Value |
|---|---|
| Query Types | Natural Language Queries |
| Specific methodologies/techniques | |
| Major concepts/theories | |
| How To | |
| Highly technical terminology | |
| Query Length | Few words |
| Sentence | |
| Multi-sentence | |
| Specificity Levels | Very specific |
| Focused | |
| Broad | |
| Research Stages | Starting research |
| Literature review | |
| Implementation focused | |
| Seeking comparisons | |
| Looking for gaps | |
| Problem Framing | Problem to solve |
| Technique to learn | |
| Comparison to make | |
| Gap to identify |
Each test set is a JSON object with query keys - each pointing to a search query and metadata:
{
"query_0": {
"settings": {
"query_type": "Niche Areas",
"length": "Multi-sentence",
"problem_framing": "Technique to learn",
"specificity_level": "Focused",
"research_stage": "Seeking comparisons"
},
"search_query": "Compare different architectural frameworks for large language model agents, focusing on their respective strengths in complex task execution. What are the trade-offs between single-LLM and multi-agent systems for real-world problem-solving, particularly concerning efficiency and robustness?"
},
"query_1": { ... },
...
}There is no existing approach for reliably evaluating the quality of academic paper search. Hence, we created an easy-to-use Query-Paper Relevance Score that combines:
🧠 Semantic Understanding - Focuses on research intent and conceptual alignment, not keyword overlap
🔍 Relationship Directionality - Distinguishes subject vs object roles ("A influences B" ≠ "B influences A")
📚 Domain Expertise - Recognizes technical terms have domain-specific meanings in academic contexts
⚖️ Nuanced Distinctions - 6-point rubric captures "directly addresses" vs "same area" vs "tangential"
🎯 Implicit Relationships - Considers both explicit mentions and implicit connections
For a given search engine and search query, we compute the Query-Paper Relevance Score on each (query, paper title, paper abstract) triplet that the search engine generates. This is done by prompting a LLM with a detailed system prompt and passing the (search_query, paper title, paper abstract) as the prompt.
The system prompt defines a 0-to-5 Likert Scale with rubrics that capture subtle but critical distinctions between papers that directly address a query versus those in the same general area, enabling more precise differentiation of retrieval quality across diverse academic search scenarios. We multiply the returned relevance score by x20 to report a more intuitive 0-100 scale.
The system prompt also asks the LLM to produce a Confidence scoring (0-10) and a one sentence summary for it's reasoning. We found that asking for these inputs significantly improved the quality and consistency of the reported Query-Paper Relevance Score.
| Score | Level | Description |
|---|---|---|
| 0 | No Relevance | No meaningful scholarly connection |
| 20 | Tangential Relevance | Minimal substantive connection |
| 40 | Peripheral Treatment | Secondary discussion of topic |
| 60 | Substantial Coverage | Significant component of paper's scope |
| 80 | Primary Topical Focus | Central theme aligns with query domain |
| 100 | Direct Correspondence | Paper directly addresses the research question |
llm: gemini-2.5-flash-lite-preview-06-17 # can change to other models
max_tokens: 4096
temperature: 0.01 # Critical: ensures valid JSON output
thinking_budget: 0 # Summary statement provides sufficient reasoning
{
"paper_query_relevance": {
"relevanceScore": 84,
"confidenceLevel": 9,
"summaryStatement": "Paper's central focus on transformer attention mechanisms directly relates to query about long sequence implementation."
}
}We welcome contributions to improve this research resource! Please:
- Submit issues for additional query types or academic domains
- Propose new metadata dimensions for richer evaluation
- Share evaluation results and insights using this dataset
- Suggest improvements to the LLM evaluation methodology
@dataset{academic_paper_search_benchmarks_2025,
title={Academic Paper Search Benchmarks 2025},
author={Paper Lantern},
year={2025},
url={https://github.com/paperlantern-ai/academic_paper_search_benchmarks}
}Apache 2.0
🏮 Paper Lantern: https://paperlantern.ai/
🔗 Huggingface Datasets: https://huggingface.co/paperlantern
📧 Contact: contact@paperlantern.ai
