Skip to content

sushmasai1704-web/llm-jailbreak-detector

Repository files navigation

LLM Jailbreak Detector — Formal Methods Approach

A hybrid jailbreak detection system that applies formal verification thinking to LLM prompt safety.

Architecture

Three-layer detection pipeline:

Layer 1 — Property Checker (BMC-style) Checks explicit invariant violations using regex assertions. Returns a counterexample trace when a property is violated. Handles ~80% of traffic instantly with zero API cost.

Layer 2 — Safe Allowlist Fast-path for obviously normal prompts — skip model entirely.

Layer 3 — Semantic Reasoner (k-induction-style) LLM call via Groq API for ambiguous prompts that rules cannot resolve. Only triggered for truly uncertain cases.

Formal Methods Connection

In hardware verification (my background), BMC checks bounded reachability and returns counterexamples when properties are violated. This project applies the same principle to prompt safety:

  • Each regex pattern = one SVA assertion
  • A match = property violated = counterexample found
  • Layer 3 = deeper inductive reasoning for ambiguous cases

Results

  • Accuracy: 10/10 (100%) on test suite
  • No-model rate: 100% of obvious cases handled free
  • Layer 3 (Groq) only activates for ambiguous prompts

Setup

export GROQ_API_KEY="your_key_here"
python3 detector_hybrid.py

Author

M Sai Sushma — ECE final year, RGMCET Published researcher | Formal verification background

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages