Automated testing and evaluation framework for a Global History chatbot using promptfoo. Tests multiple LLMs for accuracy, tone, factual grounding, response time, and cost efficiency.
This project evaluates AI-powered history bots across different providers to ensure:
- Factual accuracy in historical responses
- Professional and friendly tone
- Concise, fact-based answers
- Acceptable response latency (< 12 seconds)
- Cost-effective API usage
| Provider | Model | Purpose |
|---|---|---|
| OpenAI | GPT-4o | High-quality, detailed responses |
| OpenAI | GPT-4o-mini | Fast, cost-effective responses |
| Anthropic | Claude Sonnet 4 | Balanced performance and accuracy |
| Anthropic | Claude Opus 4 | Most capable historical analysis |
# Install Node.js (if not already installed)
# Download from: https://nodejs.org
# Install promptfoo globally
npm install -g promptfoo
# Verify installation
promptfoo --version- Clone the repository:
git clone https://github.com/YOUR_USERNAME/global-history-bot.git
cd global-history-bot- Set up API keys:
Create a .env file in the project root:
OPENAI_API_KEY=sk-proj-xxxxxxxxxxxxx
ANTHROPIC_API_KEY=sk-ant-xxxxxxxxxxxxx.env file! It's already in .gitignore.
- Run evaluation:
promptfoo eval- View results in browser:
promptfoo viewQuestion: "What was the first African Civilization to inhabit Europe?"
Evaluation Method: Model-graded closed QA
Criteria: Response must be friendly and professional
Expected: Mentions ancient Moors, North African influence, or clarifies the historical context
Question: "How many women served in the military during WW1 and WW2?"
Evaluation Method: LLM rubric grading
Criteria: Response is rooted in facts and is concise
Expected: Provides specific numbers or ranges (e.g., ~350,000 in WWII)
Metric: Response time and cost tracking
Assertions:
- Latency threshold: 12 seconds maximum
- Cost threshold: $0.08 per request maximum
- Uses GPT-4 to evaluate response quality
- Checks for tone, professionalism, and friendliness
- Binary pass/fail based on criteria
- Evaluates factual accuracy
- Checks for conciseness
- Ensures responses are grounded in historical facts
- Latency: Tracks API response time
- Cost: Monitors spending per request
global-history-bot/
βββ promptfooconfig.yaml # Main test configuration
βββ prompts/
β βββ bad_bot.txt # Basic prompt template
β βββ nice_bot.txt # (Optional) Enhanced prompt
βββ .env # API keys (NOT committed)
βββ .gitignore # Excludes sensitive files
βββ README.md # This file
βββ output/ # Test results (auto-generated)
βββ .promptfoo/ # Cache (auto-generated)
Inline prompt:
You are a history bot. Tell me about {{question}}.
File-based prompts:
prompts/bad_bot.txt- Basic history botprompts/nice_bot.txt- Enhanced, friendly history bot
vars:
question: "Your historical question here"
topic: "Historical topic or time period"| Type | Purpose | Example |
|---|---|---|
model-graded-closedqa |
AI evaluates response quality | Checks for friendly tone |
llm-rubric |
AI checks against criteria | Verifies factual accuracy |
latency |
Response time tracking | Max 12 seconds |
cost |
API expense monitoring | Max $0.08 per request |
After running promptfoo eval, you'll see results like:
Test Results:
β GPT-4o: African Civilization - PASS (friendly and professional)
β Claude Sonnet 4: Women in Military - PASS (factual and concise)
β GPT-4o-mini: Latency Test - FAIL (13.2s exceeded 12s threshold)
β All models: Cost Test - PASS (under $0.08)
Run promptfoo view to open an interactive dashboard showing:
- Side-by-side model comparisons
- Full response outputs
- Performance metrics
- Cost breakdown
- Pass/fail statistics
Add these to your config for comprehensive testing:
- "Who was the first female pharaoh of Egypt?"
- "What caused the fall of the Roman Empire?"
- "Explain the significance of the Silk Road."
- "What were the main causes of World War I?"
- "Describe the Renaissance period in Europe."# Verify .env file exists
ls -la .env
# Check format (no spaces around =)
OPENAI_API_KEY=sk-...
ANTHROPIC_API_KEY=sk-ant-...# Create the prompts directory
mkdir -p prompts
# Create the prompt file
echo "You are a history bot. Answer questions about {{topic}}." > prompts/bad_bot.txt- Check YAML indentation (use spaces, not tabs)
- Ensure
assertis at same level asvars - Run
promptfoo eval --dry-runto validate
- First run is slower (downloads configs)
- Historical questions may naturally take longer
- Adjust
threshold: 12000to higher value if needed - Check your internet connection
promptfoo cache clear
promptfoo eval- Add more context to prompts:
You are an expert historian specializing in {{topic}}.
Provide accurate, well-sourced information in 2-3 sentences.
- Test with edge cases:
- Controversial historical events
- Questions with multiple interpretations
- Requests for specific dates/numbers
- Adjust thresholds based on results:
- Lower latency for simple facts
- Higher latency for complex analysis
- Balance cost vs quality
- Use multiple prompts:
prompts:
- file://prompts/concise_bot.txt
- file://prompts/detailed_bot.txt
- file://prompts/educational_bot.txt- β
Never commit API keys - Always use
.envand.gitignore - β Rotate keys regularly for security
- β Monitor API usage to avoid unexpected charges
- β Use read-only keys when possible
β οΈ If keys are exposed, regenerate immediately at:- OpenAI: https://platform.openai.com/api-keys
- Anthropic: https://console.anthropic.com/settings/keys
- Promptfoo Documentation
- Promptfoo Assertions Guide
- OpenAI API Documentation
- Anthropic API Documentation
- YAML Syntax Guide
Create .github/workflows/test.yml:
name: History Bot Tests
on: [push, pull_request]
jobs:
test:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- uses: actions/setup-node@v3
with:
node-version: '18'
- run: npm install -g promptfoo
- run: promptfoo eval
env:
OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}Note: Add API keys as GitHub Secrets in your repository settings.
Contributions welcome! You can:
- Add new historical test questions
- Test additional LLM providers
- Improve evaluation criteria
- Share interesting findings
- Report bugs or issues
Track your improvements over time:
| Date | GPT-4o Avg | Claude S4 Avg | Best Model | Notes |
|---|---|---|---|---|
| Feb 2026 | 3.2s | 2.8s | Claude S4 | Initial baseline |
| - | - | - | - | Update after changes |
MIT License - Feel free to use for educational or commercial purposes!
- Issues: Open a GitHub issue
- Questions: Start a discussion
- Email: your.email@example.com
Project Status: β
Active Development
Last Updated: February 2026
Promptfoo Version: Latest
Maintained by: [Your Name]
- Add more historical periods (Ancient Rome, Medieval Europe, etc.)
- Test with domain-specific models
- Implement automated daily testing
- Create visualization dashboard
- Add multi-language support