llm-redteam-harness

A reproducible harness for measuring how published LLM defences change the attack success rate of published adversarial prompts — and for measuring how much to trust those numbers. Built as a measurement tool, not a weapon — see ETHICS.md.

Headline finding

Across 2 target models, 2 benchmark families, and up to 4 composable defence configurations — 12 evaluation cells in total — published adversarial prompts succeed between 0% and 4% of the time, and a paranoid prompt-only defence stack does not measurably move that number.

Every attack-success verdict is scored by an LLM judge and then independently re-scored by a second judge model. Judge agreement on attack success is perfect — Cohen's κ = +1.00 in all 12 cells.

The honest reading: 2026-era instruction tuning already neutralises these published, static attacks on both a frontier model (Claude Sonnet 4.6) and a small local model (Llama 3.1 8B). Prompt-only defences change the style of a refusal, not the safety outcome. The open risk these static benchmarks under-measure is the full agentic loop — interactive, multi-step tool use — which is named explicitly as future work, not quietly omitted.

AdvBench — direct attacks (n = 100 per cell)

Target	Baseline ASR	Full-stack ASR
Claude Sonnet 4.6	0% [0, 0]	0% [0, 0]
Llama 3.1 8B (local)	1% [0, 3]	0% [0, 0]

AgentDojo — static indirect injection (n = 50 per cell)

Target	Baseline	+ Spotlighting	+ SecAlign	Full prompt stack
Claude Sonnet 4.6	0%	0%	0%	0%
Llama 3.1 8B (local)	4% [0, 10]	0%	0%	0%

All figures are attack success rate (ASR) from the LLM judge; brackets are 95% percentile-bootstrap confidence intervals. Cross-judge κ on ASR = +1.00 for every cell. Full numbers, validation, and limits in METHODOLOGY.md.

Why this reports ASR and not refusal rate

The harness was built to report how trustworthy its own metrics are. The cross-judge layer found that ASR is well-posed and refusal_rate is not: the two judges agree perfectly on whether an attack succeeded, but disagree — sometimes worse than chance — on whether a response was a "refusal", because an indirect-injection task has two things that can be refused (the user's request and the injected instruction). refusal_rate is therefore reported as a descriptive signal of response style only. This is documented, not hidden — see METHODOLOGY.md §7.

What this is

A reproducible benchmark that:

Loads published adversarial prompts from AdvBench, JailbreakBench, HarmBench, and AgentDojo, each pinned to an upstream commit.
Sends them through 2 target LLMs — Claude Sonnet 4.6 (frontier) and Llama 3.1 8B (local, via Ollama).
Toggles composable defence stacks on and off — paranoid system prompt, Constitutional critique-and-revise, Spotlighting, SecAlign-style structured queries (Llama Guard 4 pre/post-filters are implemented but excluded from the v1 matrix; see METHODOLOGY.md §4).
Scores responses with a rule-based pre-screen, then an LLM judge, then an independent cross-judge for validation.
Reports ASR with bootstrap confidence intervals, inter-judge agreement (Cohen's κ, Krippendorff's α), and real API cost per run.

Status

The v1 evaluation matrix is complete and the harness is feature-complete for v1 scope. The next tracks — the full AgentDojo agent loop, multi-turn attacks, and Inspect AI export — are listed in METHODOLOGY.md §12.

Getting started

git clone https://github.com/rosscyking1115/llm-redteam-harness.git
cd llm-redteam-harness
uv venv --python 3.13
source .venv/bin/activate           # macOS/Linux
# .venv\Scripts\activate            # Windows PowerShell
uv pip install -e ".[dev]"
cp .env.example .env                # fill in ANTHROPIC_API_KEY
pre-commit install
pytest tests/unit                   # should pass green
python -m redteam version           # prints 0.1.0

The CLI is invoked as python -m redteam .... Each run enforces a hard USD budget cap (set per config in configs/); set a matching console cap before the first run.

python -m redteam corpora download           # fetch + pin corpora
python -m redteam run --config configs/run_anthropic_baseline.yaml
python -m redteam score --run results/<run>.json
python -m redteam cross-judge --run results/<run>.judged.json

Inspect AI compatibility

Any run exports to a UK AI Security Institute Inspect eval log, so results open in inspect view or load with read_eval_log():

uv pip install -e ".[inspect]"     # optional extra
python -m redteam export-inspect --run results/<run>.json

Each case maps to an Inspect sample scored by the LLM-judge ASR verdict; cross-judge agreement, confidence intervals, and cost travel in the metadata.

Running CI locally before pushing

scripts/ci_local.ps1 (Windows) and scripts/ci_local.sh (Linux/macOS) run the exact same checks as .github/workflows/ci.yml — ruff lint, ruff format check, mypy, pytest. Activate the venv, then:

scripts\ci_local.ps1

If it exits green, CI on the PR will be green too.

Repository layout

See PROJECT-1-KIT.md §6 for the target layout. src/redteam/ holds the schemas, corpus loaders, target adapters, defences, orchestrator, scorers, and CLI; configs/ holds run configs and pinned dataset versions; results/ holds run artifacts (gitignored — re-creatable from configs).

Ethics

This project uses only published adversarial prompts. Excluded categories (CSAM, weapons-of-mass-destruction synthesis, detailed self-harm methods) are filtered at corpus-load time and verified by a CI test. Results are aggregate — no raw harmful outputs are committed to this repo.

If you are a model provider whose model is included and want example transcripts removed, email rosscyking@gmail.com — 24-hour removal commitment. See ETHICS.md.

Citation

@software{llm_redteam_harness_2026,
  title  = {llm-redteam-harness: Reproducible LLM defence evaluation},
  author = {Ross},
  year   = {2026},
  url    = {https://github.com/rosscyking1115/llm-redteam-harness}
}

Licence

MIT — see LICENSE.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

llm-redteam-harness

Headline finding

AdvBench — direct attacks (n = 100 per cell)

AgentDojo — static indirect injection (n = 50 per cell)

Why this reports ASR and not refusal rate

What this is

Status

Getting started

Inspect AI compatibility

Running CI locally before pushing

Repository layout

Ethics

Citation

Licence

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
.github/workflows		.github/workflows
configs		configs
docs		docs
scripts		scripts
src/redteam		src/redteam
tests		tests
.env.example		.env.example
.gitattributes		.gitattributes
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
ETHICS.md		ETHICS.md
LICENSE		LICENSE
METHODOLOGY.md		METHODOLOGY.md
README.md		README.md
pyproject.toml		pyproject.toml

Folders and files

Latest commit

History

Repository files navigation

llm-redteam-harness

Headline finding

AdvBench — direct attacks (n = 100 per cell)

AgentDojo — static indirect injection (n = 50 per cell)

Why this reports ASR and not refusal rate

What this is

Status

Getting started

Inspect AI compatibility

Running CI locally before pushing

Repository layout

Ethics

Citation

Licence

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages