A hybrid jailbreak detection system that applies formal verification thinking to LLM prompt safety.
Three-layer detection pipeline:
Layer 1 — Property Checker (BMC-style) Checks explicit invariant violations using regex assertions. Returns a counterexample trace when a property is violated. Handles ~80% of traffic instantly with zero API cost.
Layer 2 — Safe Allowlist Fast-path for obviously normal prompts — skip model entirely.
Layer 3 — Semantic Reasoner (k-induction-style) LLM call via Groq API for ambiguous prompts that rules cannot resolve. Only triggered for truly uncertain cases.
In hardware verification (my background), BMC checks bounded reachability and returns counterexamples when properties are violated. This project applies the same principle to prompt safety:
- Each regex pattern = one SVA assertion
- A match = property violated = counterexample found
- Layer 3 = deeper inductive reasoning for ambiguous cases
- Accuracy: 10/10 (100%) on test suite
- No-model rate: 100% of obvious cases handled free
- Layer 3 (Groq) only activates for ambiguous prompts
export GROQ_API_KEY="your_key_here"
python3 detector_hybrid.pyM Sai Sushma — ECE final year, RGMCET Published researcher | Formal verification background