Skip to content

stevenleon30/global-history-bot

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

1 Commit
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

Global History Bot - LLM Evaluation

Automated testing and evaluation framework for a Global History chatbot using promptfoo. Tests multiple LLMs for accuracy, tone, factual grounding, response time, and cost efficiency.

🎯 Project Overview

This project evaluates AI-powered history bots across different providers to ensure:

  • Factual accuracy in historical responses
  • Professional and friendly tone
  • Concise, fact-based answers
  • Acceptable response latency (< 12 seconds)
  • Cost-effective API usage

πŸ€– Models Tested

Provider Model Purpose
OpenAI GPT-4o High-quality, detailed responses
OpenAI GPT-4o-mini Fast, cost-effective responses
Anthropic Claude Sonnet 4 Balanced performance and accuracy
Anthropic Claude Opus 4 Most capable historical analysis

πŸš€ Quick Start

Prerequisites

# Install Node.js (if not already installed)
# Download from: https://nodejs.org

# Install promptfoo globally
npm install -g promptfoo

# Verify installation
promptfoo --version

Setup

  1. Clone the repository:
git clone https://github.com/YOUR_USERNAME/global-history-bot.git
cd global-history-bot
  1. Set up API keys:

Create a .env file in the project root:

OPENAI_API_KEY=sk-proj-xxxxxxxxxxxxx
ANTHROPIC_API_KEY=sk-ant-xxxxxxxxxxxxx

⚠️ Security Notice: Never commit your .env file! It's already in .gitignore.

  1. Run evaluation:
promptfoo eval
  1. View results in browser:
promptfoo view

πŸ“ Test Scenarios

1. African Civilization Question

Question: "What was the first African Civilization to inhabit Europe?"

Evaluation Method: Model-graded closed QA

Criteria: Response must be friendly and professional

Expected: Mentions ancient Moors, North African influence, or clarifies the historical context


2. Women in Military (WWI & WWII)

Question: "How many women served in the military during WW1 and WW2?"

Evaluation Method: LLM rubric grading

Criteria: Response is rooted in facts and is concise

Expected: Provides specific numbers or ranges (e.g., ~350,000 in WWII)


3. Performance Testing

Metric: Response time and cost tracking

Assertions:

  • Latency threshold: 12 seconds maximum
  • Cost threshold: $0.08 per request maximum

πŸ§ͺ Evaluation Methods

Model-Graded Closed QA

  • Uses GPT-4 to evaluate response quality
  • Checks for tone, professionalism, and friendliness
  • Binary pass/fail based on criteria

LLM Rubric

  • Evaluates factual accuracy
  • Checks for conciseness
  • Ensures responses are grounded in historical facts

Performance Metrics

  • Latency: Tracks API response time
  • Cost: Monitors spending per request

πŸ“ Project Structure

global-history-bot/
β”œβ”€β”€ promptfooconfig.yaml      # Main test configuration
β”œβ”€β”€ prompts/
β”‚   β”œβ”€β”€ bad_bot.txt           # Basic prompt template
β”‚   └── nice_bot.txt          # (Optional) Enhanced prompt
β”œβ”€β”€ .env                      # API keys (NOT committed)
β”œβ”€β”€ .gitignore               # Excludes sensitive files
β”œβ”€β”€ README.md                # This file
β”œβ”€β”€ output/                  # Test results (auto-generated)
└── .promptfoo/              # Cache (auto-generated)

πŸ”§ Configuration Details

Prompts Used

Inline prompt:

You are a history bot. Tell me about {{question}}.

File-based prompts:

  • prompts/bad_bot.txt - Basic history bot
  • prompts/nice_bot.txt - Enhanced, friendly history bot

Variables in Tests

vars:
  question: "Your historical question here"
  topic: "Historical topic or time period"

Assertion Types

Type Purpose Example
model-graded-closedqa AI evaluates response quality Checks for friendly tone
llm-rubric AI checks against criteria Verifies factual accuracy
latency Response time tracking Max 12 seconds
cost API expense monitoring Max $0.08 per request

πŸ“Š Sample Results

After running promptfoo eval, you'll see results like:

Test Results:
βœ“ GPT-4o: African Civilization - PASS (friendly and professional)
βœ“ Claude Sonnet 4: Women in Military - PASS (factual and concise)
βœ— GPT-4o-mini: Latency Test - FAIL (13.2s exceeded 12s threshold)
βœ“ All models: Cost Test - PASS (under $0.08)

View Detailed Results

Run promptfoo view to open an interactive dashboard showing:

  • Side-by-side model comparisons
  • Full response outputs
  • Performance metrics
  • Cost breakdown
  • Pass/fail statistics

πŸŽ“ Example Questions to Test

Add these to your config for comprehensive testing:

- "Who was the first female pharaoh of Egypt?"
- "What caused the fall of the Roman Empire?"
- "Explain the significance of the Silk Road."
- "What were the main causes of World War I?"
- "Describe the Renaissance period in Europe."

πŸ› Troubleshooting

"API key not found"

# Verify .env file exists
ls -la .env

# Check format (no spaces around =)
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...

"No such file or directory: prompts/bad_bot.txt"

# Create the prompts directory
mkdir -p prompts

# Create the prompt file
echo "You are a history bot. Answer questions about {{topic}}." > prompts/bad_bot.txt

"Invalid configuration file"

  • Check YAML indentation (use spaces, not tabs)
  • Ensure assert is at same level as vars
  • Run promptfoo eval --dry-run to validate

Slow responses or timeouts

  • First run is slower (downloads configs)
  • Historical questions may naturally take longer
  • Adjust threshold: 12000 to higher value if needed
  • Check your internet connection

Clear cache if getting stale results

promptfoo cache clear
promptfoo eval

πŸ“ˆ Improving Your History Bot

Tips for Better Results:

  1. Add more context to prompts:
You are an expert historian specializing in {{topic}}. 
Provide accurate, well-sourced information in 2-3 sentences.
  1. Test with edge cases:
  • Controversial historical events
  • Questions with multiple interpretations
  • Requests for specific dates/numbers
  1. Adjust thresholds based on results:
  • Lower latency for simple facts
  • Higher latency for complex analysis
  • Balance cost vs quality
  1. Use multiple prompts:
prompts:
  - file://prompts/concise_bot.txt
  - file://prompts/detailed_bot.txt
  - file://prompts/educational_bot.txt

πŸ”’ Security Best Practices

πŸ“š Resources

πŸ”„ Continuous Testing (Optional)

Run tests automatically on commit:

Create .github/workflows/test.yml:

name: History Bot Tests
on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v3
      - uses: actions/setup-node@v3
        with:
          node-version: '18'
      - run: npm install -g promptfoo
      - run: promptfoo eval
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

Note: Add API keys as GitHub Secrets in your repository settings.

🀝 Contributing

Contributions welcome! You can:

  • Add new historical test questions
  • Test additional LLM providers
  • Improve evaluation criteria
  • Share interesting findings
  • Report bugs or issues

πŸ“Š Benchmark History

Track your improvements over time:

Date GPT-4o Avg Claude S4 Avg Best Model Notes
Feb 2026 3.2s 2.8s Claude S4 Initial baseline
- - - - Update after changes

πŸ“„ License

MIT License - Feel free to use for educational or commercial purposes!

πŸ“§ Contact & Support


Project Status: βœ… Active Development
Last Updated: February 2026
Promptfoo Version: Latest
Maintained by: [Your Name]

🎯 Next Steps

  • Add more historical periods (Ancient Rome, Medieval Europe, etc.)
  • Test with domain-specific models
  • Implement automated daily testing
  • Create visualization dashboard
  • Add multi-language support

About

Comprehensive evaluation framework for AI history chatbots using promptfoo. Tests multiple LLMs (GPT-4o, GPT-4o-mini, Claude Sonnet 4, Claude Opus 4) for historical accuracy, professional tone, factual grounding, response latency, and API cost efficiency. Implements model-graded closed QA and LLM rubric assertions to ensure high-quality historical

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors