Global History Bot - LLM Evaluation

Automated testing and evaluation framework for a Global History chatbot using promptfoo. Tests multiple LLMs for accuracy, tone, factual grounding, response time, and cost efficiency.

🎯 Project Overview

This project evaluates AI-powered history bots across different providers to ensure:

Factual accuracy in historical responses
Professional and friendly tone
Concise, fact-based answers
Acceptable response latency (< 12 seconds)
Cost-effective API usage

🤖 Models Tested

Provider	Model	Purpose
OpenAI	GPT-4o	High-quality, detailed responses
OpenAI	GPT-4o-mini	Fast, cost-effective responses
Anthropic	Claude Sonnet 4	Balanced performance and accuracy
Anthropic	Claude Opus 4	Most capable historical analysis

🚀 Quick Start

Prerequisites

# Install Node.js (if not already installed)
# Download from: https://nodejs.org

# Install promptfoo globally
npm install -g promptfoo

# Verify installation
promptfoo --version

Setup

Clone the repository:

git clone https://github.com/YOUR_USERNAME/global-history-bot.git
cd global-history-bot

Set up API keys:

Create a .env file in the project root:

OPENAI_API_KEY=sk-proj-xxxxxxxxxxxxx
ANTHROPIC_API_KEY=sk-ant-xxxxxxxxxxxxx

⚠️ Security Notice: Never commit your .env file! It's already in .gitignore.

Run evaluation:

promptfoo eval

View results in browser:

promptfoo view

📝 Test Scenarios

1. African Civilization Question

Question: "What was the first African Civilization to inhabit Europe?"

Evaluation Method: Model-graded closed QA

Criteria: Response must be friendly and professional

Expected: Mentions ancient Moors, North African influence, or clarifies the historical context

2. Women in Military (WWI & WWII)

Question: "How many women served in the military during WW1 and WW2?"

Evaluation Method: LLM rubric grading

Criteria: Response is rooted in facts and is concise

Expected: Provides specific numbers or ranges (e.g., ~350,000 in WWII)

3. Performance Testing

Metric: Response time and cost tracking

Assertions:

Latency threshold: 12 seconds maximum
Cost threshold: $0.08 per request maximum

🧪 Evaluation Methods

Model-Graded Closed QA

Uses GPT-4 to evaluate response quality
Checks for tone, professionalism, and friendliness
Binary pass/fail based on criteria

LLM Rubric

Evaluates factual accuracy
Checks for conciseness
Ensures responses are grounded in historical facts

Performance Metrics

Latency: Tracks API response time
Cost: Monitors spending per request

📁 Project Structure

global-history-bot/
├── promptfooconfig.yaml      # Main test configuration
├── prompts/
│   ├── bad_bot.txt           # Basic prompt template
│   └── nice_bot.txt          # (Optional) Enhanced prompt
├── .env                      # API keys (NOT committed)
├── .gitignore               # Excludes sensitive files
├── README.md                # This file
├── output/                  # Test results (auto-generated)
└── .promptfoo/              # Cache (auto-generated)

🔧 Configuration Details

Prompts Used

Inline prompt:

You are a history bot. Tell me about {{question}}.

File-based prompts:

prompts/bad_bot.txt - Basic history bot
prompts/nice_bot.txt - Enhanced, friendly history bot

Variables in Tests

vars:
  question: "Your historical question here"
  topic: "Historical topic or time period"

Assertion Types

Type	Purpose	Example
`model-graded-closedqa`	AI evaluates response quality	Checks for friendly tone
`llm-rubric`	AI checks against criteria	Verifies factual accuracy
`latency`	Response time tracking	Max 12 seconds
`cost`	API expense monitoring	Max $0.08 per request

📊 Sample Results

After running promptfoo eval, you'll see results like:

Test Results:
✓ GPT-4o: African Civilization - PASS (friendly and professional)
✓ Claude Sonnet 4: Women in Military - PASS (factual and concise)
✗ GPT-4o-mini: Latency Test - FAIL (13.2s exceeded 12s threshold)
✓ All models: Cost Test - PASS (under $0.08)

View Detailed Results

Run promptfoo view to open an interactive dashboard showing:

Side-by-side model comparisons
Full response outputs
Performance metrics
Cost breakdown
Pass/fail statistics

🎓 Example Questions to Test

Add these to your config for comprehensive testing:

- "Who was the first female pharaoh of Egypt?"
- "What caused the fall of the Roman Empire?"
- "Explain the significance of the Silk Road."
- "What were the main causes of World War I?"
- "Describe the Renaissance period in Europe."

🐛 Troubleshooting

"API key not found"

# Verify .env file exists
ls -la .env

# Check format (no spaces around =)
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...

"No such file or directory: prompts/bad_bot.txt"

# Create the prompts directory
mkdir -p prompts

# Create the prompt file
echo "You are a history bot. Answer questions about {{topic}}." > prompts/bad_bot.txt

"Invalid configuration file"

Check YAML indentation (use spaces, not tabs)
Ensure assert is at same level as vars
Run promptfoo eval --dry-run to validate

Slow responses or timeouts

First run is slower (downloads configs)
Historical questions may naturally take longer
Adjust threshold: 12000 to higher value if needed
Check your internet connection

Clear cache if getting stale results

promptfoo cache clear
promptfoo eval

📈 Improving Your History Bot

Tips for Better Results:

Add more context to prompts:

You are an expert historian specializing in {{topic}}. 
Provide accurate, well-sourced information in 2-3 sentences.

Test with edge cases:

Controversial historical events
Questions with multiple interpretations
Requests for specific dates/numbers

Adjust thresholds based on results:

Lower latency for simple facts
Higher latency for complex analysis
Balance cost vs quality

Use multiple prompts:

prompts:
  - file://prompts/concise_bot.txt
  - file://prompts/detailed_bot.txt
  - file://prompts/educational_bot.txt

🔒 Security Best Practices

✅ Never commit API keys - Always use .env and .gitignore
✅ Rotate keys regularly for security
✅ Monitor API usage to avoid unexpected charges
✅ Use read-only keys when possible
⚠️ If keys are exposed, regenerate immediately at:
- OpenAI: https://platform.openai.com/api-keys
- Anthropic: https://console.anthropic.com/settings/keys

📚 Resources

🔄 Continuous Testing (Optional)

Run tests automatically on commit:

Create .github/workflows/test.yml:

name: History Bot Tests
on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: actions/setup-node@v3
        with:
          node-version: '18'
      - run: npm install -g promptfoo
      - run: promptfoo eval
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Note: Add API keys as GitHub Secrets in your repository settings.

🤝 Contributing

Contributions welcome! You can:

Add new historical test questions
Test additional LLM providers
Improve evaluation criteria
Share interesting findings
Report bugs or issues

📊 Benchmark History

Track your improvements over time:

Date	GPT-4o Avg	Claude S4 Avg	Best Model	Notes
Feb 2026	3.2s	2.8s	Claude S4	Initial baseline
-	-	-	-	Update after changes

📄 License

MIT License - Feel free to use for educational or commercial purposes!

📧 Contact & Support

Issues: Open a GitHub issue
Questions: Start a discussion
Email: your.email@example.com

Project Status: ✅ Active Development
Last Updated: February 2026
Promptfoo Version: Latest
Maintained by: [Your Name]

🎯 Next Steps

Add more historical periods (Ancient Rome, Medieval Europe, etc.)
Test with domain-specific models
Implement automated daily testing
Create visualization dashboard
Add multi-language support

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
prompts		prompts
.gitignore		.gitignore
README.md		README.md
promptfooconfig.yaml		promptfooconfig.yaml

Folders and files

Latest commit

History

Repository files navigation

Global History Bot - LLM Evaluation

🎯 Project Overview

🤖 Models Tested

🚀 Quick Start

Prerequisites

Setup

📝 Test Scenarios

1. African Civilization Question

2. Women in Military (WWI & WWII)

3. Performance Testing

🧪 Evaluation Methods

Model-Graded Closed QA

LLM Rubric

Performance Metrics

📁 Project Structure

🔧 Configuration Details

Prompts Used

Variables in Tests

Assertion Types

📊 Sample Results

View Detailed Results

🎓 Example Questions to Test

🐛 Troubleshooting

"API key not found"

"No such file or directory: prompts/bad_bot.txt"

"Invalid configuration file"

Slow responses or timeouts

Clear cache if getting stale results

📈 Improving Your History Bot

Tips for Better Results:

🔒 Security Best Practices

📚 Resources

🔄 Continuous Testing (Optional)

Run tests automatically on commit:

🤝 Contributing

📊 Benchmark History

📄 License

📧 Contact & Support

🎯 Next Steps

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Packages