A focused portfolio of AI evaluation work: search-quality style ratings, LLM response grading, safety evaluation, and dataset labeling artifacts—organized like a real rater/evaluation ops project.
If you’re reviewing this repo for hiring: start with “Start Here” and the Case Study below.
- Search Quality Rater Simulation (rubric + templates + rated examples)
https://github.com/Parker-Bakken/search-quality-rater-simulation - Search Intent Dataset (taxonomy + ambiguous set + calibration artifacts)
https://github.com/Parker-Bakken/search-intent-dataset - AI Search Quality Evaluator (Python) (reproducible scoring baseline)
https://github.com/Parker-Bakken/ai-search-quality-evaluator
- LLM Hallucination Detection Benchmark (Mini) (labels + guidelines + tooling)
https://github.com/Parker-Bakken/llm-hallucination-detection-benchmark
- AI Response Evaluation (rubric + templates + examples)
https://github.com/Parker-Bakken/ai-response-evaluation - AI Evaluation Examples (prompt quality / safety / hallucination cases)
https://github.com/Parker-Bakken/ai-evaluation-examples
- Dataset Labeling Examples (ML-ready CSVs + guidelines patterns)
https://github.com/Parker-Bakken/dataset-labeling-examples
- Rubric design for consistent, repeatable evaluation
- Intent interpretation (what the user is actually trying to do)
- Correctness + grounding (supported vs unverifiable vs fabricated claims)
- Safety evaluation (appropriate refusals, risk-aware grading)
- Operational thinking: calibration, QA logging, adjudication, and reporting
- Identify the task type (informational / navigational / transactional / local)
- Note constraints (format, tone, scope, safety, time sensitivity)
- Does the response solve the user’s problem?
- Is anything missing that a high-quality answer should include?
- Are claims supported by the prompt/context?
- If not verifiable, is uncertainty clearly stated (instead of guessing)?
Typical dimensions:
- Intent match
- Helpfulness / completeness
- Correctness
- Grounding / evidence quality
- Safety / refusal quality
- Clarity / structure
- Double-pass review on a subset
- Log disagreements + reasons
- Adjudicate edge cases
- Convert recurring disagreements into rubric clarifications + gold set examples
This repository is intended as an index + packaging layer for the portfolio.
If present, these folders support a rater ops workflow:
CASE_STUDY.md— deeper narrative of my evaluation processrubrics/— reusable rubrics and scoring guidancetemplates/— copy/paste evaluation templatesexamples/— curated “best of” evaluation writeupsqa/qa_log.csv— disagreement/adjudication trackinggold/gold_set.csv— finalized edge-case examples for calibrationreports/scorecard.md— metrics snapshot for quick review
- This portfolio is designed to be auditable: clear criteria, labeled examples, and reproducible artifacts.
- Where uncertainty exists, I prefer transparent reasoning over confident guessing.
- I treat “edge cases” as first-class: they’re documented, adjudicated, and used to improve rubric clarity.
- GitHub: https://github.com/Parker-Bakken
- LinkedIn: linkedin.com/in/parkerbakken
- Email: Parkerbakken117@gmail.com