Skip to content

Commit 1ca543d

Browse files
feat: adversarial hardening — expand invisible chars, homoglyphs, and NFD bypass defense (#7)
* ci: add Grippy code review workflow Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: install grippy-code-review from GitHub repo Package is not yet published to PyPI, so install directly from the Project-Navi/grippy-code-review repository. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: address code review feedback on Grippy workflow - Add concurrency group to cancel superseded runs - Skip job on fork PRs (secrets unavailable) - Set persist-credentials: false on checkout - Pin grippy-code-review to commit SHA - Use python -I to prevent module shadowing - Fix action version comments to match repo convention Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: strip C0/C1 control characters (terminal injection defense) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: strip 19 high-confidence invisible chars (math, deprecated, braille, hangul, mongolian) Add to the stripping pipeline: - U+2061-U+2064: invisible math operators (function application, times, separator, plus) - U+206A-U+206F: deprecated Unicode format controls - U+2800: braille pattern blank - U+1680: Ogham space mark - U+115F-U+1160: Hangul Choseong/Jungseong fillers - U+3164, U+FFA0: Hangul fillers (pre-NFKC forms that normalize to U+1160) - U+180B-U+180D, U+180F: Mongolian free variation selectors - U+061C: Arabic letter mark Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: NFD decompose before homoglyph scan to defeat NFKC composition bypass Combining marks could hide mapped homoglyph base characters when NFKC composed them into precomposed forms not in the map (e.g., Cyrillic U+0430 + breve -> U+04D1). Now _replace_homoglyphs decomposes to NFD first to expose base characters, then recomposes to NFC to maintain idempotency. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * feat: expand homoglyph map — 12 new pairs (Greek lowercase, Cyrillic extended, dotless i) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: update counts — invisible chars 411→492, homoglyphs 54→66, tests 309→357 Update all documentation to reflect adversarial hardening (Tasks 1-5): - README.md: comparison table and key differences - CLAUDE.md: pipeline description and data file descriptions - whitepaper: abstract, pipeline table, curation section, verification, comparison, limitations, and commit hash on title page - _invisible.py: module docstring now lists all 8 categories Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: add .gitkeep to docs/plans for future collaboration Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * chore: bump version to 0.2.0 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: add CHANGELOG entry for 0.2.0 Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: add persona-targeted example scripts (LLM, FastAPI, log sanitizer) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: restructure README — add invisible attack demo, reorder scenarios, link examples Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: address audit findings — docs accuracy and packaging - Fix decode_evasion intermediate comment ("../etc/passwd" → "../../etc/passwd") - Qualify "never errors" claim with TypeError exception note - Fix warning log format (remove extra space after logger name) - Add coverage.xml to .gitignore - Add Changelog/Documentation URLs to pyproject.toml - Fix stage comment numbering (Stage 5→6 for escaper, add Stage 5 for re-NFKC) - Update stale homoglyph counts in test comments (42→66) - Fix tag block example to encode full "ignore previous instructions" - Fix examples/README.md dependency comment Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: address PR review feedback — deterministic regex, docstring, example - Sort set joins in regex construction for deterministic pattern order - Fix docstring: "ASCII equivalents" → "Latin equivalents" - Use `prompt` variable in llm_pipeline.py example (print final prompt) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * test: add regex coverage regression test and Unicode version assertion - Verify INVISIBLE_RE matches all 492 intended codepoints (guards against silent regex regressions from merge conflicts or range edits) - Verify regex does not false-positive on printable ASCII, TAB, LF, CR, NUL - Assert Unicode version >= 15 to catch normalization behavior changes - 361 tests total (up from 357) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: final docs/code audit — fix counts, comments, and test accuracy Whitepaper: - Test count 357→361, "eight categories"→"nine categories" (4 locations) Source code: - _pipeline.py: module docstring "Five stages"→"Six stages", expand Stage 2 and Stage 4 descriptions to match actual coverage - _invisible.py: fix Tag block comment (U+E0001 is LANGUAGE TAG, not U+E0000), fix "no logic" claim, fix Mongolian FVS "identical"→"analogous" - path_escaper: document backslash normalization in docstring Tests: - Fix stage numbering: "stage_5"→"stage_6" for escaper (2 locations) - Fix stale "54-pair"→"66-pair" in test_adversarial_homoglyphs.py - Update "Gaps" class docstrings — chars now covered by map - Fix "all six zero-width"→"all eight" and add U+200E/U+200F to test strings - Fix Tag block comment U+E0001→U+E0000 range start Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 664f3ad commit 1ca543d

23 files changed

Lines changed: 737 additions & 64 deletions

.gitignore

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,5 +15,7 @@ grippy-data/
1515

1616
.benchmarks/
1717

18+
coverage.xml
19+
1820
# Internal docs — operational docs not for public tracking
1921
docs/internal/

CHANGELOG.md

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,32 @@ All notable changes to this project will be documented in this file.
55
This changelog is automatically generated by [git-cliff](https://git-cliff.org/)
66
from [conventional commits](https://www.conventionalcommits.org/).
77

8+
## [0.2.0] - 2026-03-07
9+
10+
### Features
11+
12+
- Strip C0/C1 control characters — terminal injection defense (BS, ESC, ANSI sequences)
13+
- Strip 19 high-confidence invisible chars (math invisible operators, deprecated format controls, braille blank, ogham space, hangul fillers, mongolian FVS, arabic letter mark)
14+
- Expand homoglyph map — 12 new pairs (Greek lowercase, Cyrillic extended, Latin dotless i)
15+
16+
### Bug Fixes
17+
18+
- NFD decompose before homoglyph scan to defeat NFKC composition bypass
19+
20+
### Documentation
21+
22+
- Update counts across all docs — invisible chars 411 to 492, homoglyphs 54 to 66, tests 309 to 357
23+
- Add NFKC side effects and Latin Small Capitals limitation to wiki Threat Model
24+
- Add C0/C1 controls and Mongolian FVS to wiki Pipeline Architecture and Character Reference
25+
26+
### CI/CD
27+
28+
- Add Grippy code review workflow
29+
30+
### Testing
31+
32+
- Add 48 adversarial tests for invisible char gaps, NFKC composition bypass, and new homoglyph pairs
33+
834
## [0.1.1] - 2026-03-02
935

1036
### Documentation

CLAUDE.md

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@ This file provides guidance to Claude Code (claude.ai/code) when working with co
44

55
## Project
66

7-
navi-sanitize is a standalone, zero-dependency Python library extracted from navi-bootstrap. It provides deterministic input sanitization for untrusted text — no ML, no false positives. Python 3.12+, stdlib only.
7+
navi-sanitize is a standalone, zero-dependency Python library extracted from navi-bootstrap. It provides deterministic input sanitization for untrusted text — no ML, legitimate Unicode preserved by design. Python 3.12+, stdlib only.
88

99
## Commands
1010

@@ -70,16 +70,16 @@ Six stages in strict order — reordering breaks security:
7070
1. **Null byte removal** — strip `\x00` (prevents C-extension truncation)
7171
2. **Invisible character stripping** — single compiled regex covering zero-width chars, format/control chars, variation selectors, Unicode Tag block (`U+E0000`-`U+E007F`), and bidi overrides
7272
3. **NFKC normalization** — collapses fullwidth ASCII and compatibility forms
73-
4. **Homoglyph replacement** — character-by-character scan against 54-pair map in `_homoglyphs.py`
73+
4. **Homoglyph replacement**NFD decomposition then character-by-character scan against 66-pair map in `_homoglyphs.py`
7474
5. **Re-NFKC** (conditional) — re-normalize after homoglyph replacement to ensure idempotency
7575
6. **Escaper** (optional) — pluggable `Callable[[str], str]` runs last
7676

7777
Each stage returns `(cleaned_string, changed: bool)`. Stages have no side effects — the orchestrator logs.
7878

7979
### Data files
8080

81-
- `_homoglyphs.py`54 pairs: Cyrillic, Greek, Armenian, Cherokee, and typographic lookalikes
82-
- `_invisible.py` — zero-width, format/control (soft hyphen, thin/hair space, line/paragraph separators, etc.), variation selectors, Tag block, and bidi character sets
81+
- `_homoglyphs.py`66 pairs: Cyrillic, Greek, Armenian, Cherokee, and typographic lookalikes
82+
- `_invisible.py` — zero-width, format/control (soft hyphen, thin/hair space, line/paragraph separators, etc.), variation selectors, variation selector supplement, Mongolian FVS, Unicode Tag block, bidirectional controls, C0 controls, and C1 controls
8383

8484
### Escapers (`escapers/`)
8585

README.md

Lines changed: 19 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@
1010
[![Python 3.12+](https://img.shields.io/badge/python-3.12%2B-blue.svg)](https://www.python.org/downloads/)
1111
[![Ruff](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/astral-sh/ruff/main/assets/badge/v2.json)](https://github.com/astral-sh/ruff)
1212

13-
Deterministic input sanitization for untrusted text. Zero dependencies, zero false positives.
13+
Deterministic input sanitization for untrusted text. Zero dependencies. Legitimate Unicode preserved by design.
1414

1515
```python
1616
from navi_sanitize import clean
@@ -20,6 +20,14 @@ clean("price:\u200b 0") # "price: 0" — zero-width space stripped
2020
clean("file\x00.txt") # "file.txt" — null byte removed
2121
```
2222

23+
See the invisible:
24+
25+
```python
26+
evil = "system\u200b\u200cprompt" # looks like "systemprompt" but has 2 hidden chars
27+
len(evil) # 14 (not 12!)
28+
clean(evil) # "systemprompt" — hidden chars stripped
29+
```
30+
2331
Opt-in utilities for deeper analysis: `decode_evasion()` peels nested URL/HTML/hex encodings, `detect_scripts()` and `is_mixed_script()` flag mixed-script spoofing.
2432

2533
## Why This Matters
@@ -32,11 +40,11 @@ navi-sanitize fixes the text before it reaches your application. It doesn't dete
3240

3341
**Web applications** — Jinja2 SSTI, path traversal, and fullwidth encoding bypasses are well-known but tedious to cover manually. A single `clean(user_input, escaper=jinja2_escaper)` call handles homoglyph-disguised payloads like `{{ cоnfig }}` (Cyrillic `о`) that naive escaping misses.
3442

35-
**Config and data ingestion**YAML, TOML, and JSON parsed from untrusted sources can carry null bytes that truncate C-extension processing, zero-width characters that break key matching, and homoglyphs that create near-duplicate keys. `walk(parsed_config)` sanitizes every string in a nested structure in one call.
43+
**Identity and anti-phishing**`pаypal.com` (Cyrillic `а`) renders identically to `paypal.com` in most fonts. Homoglyph replacement normalizes display names, URLs, and email addresses to catch spoofing that visual inspection misses.
3644

3745
**Log analysis and SIEM** — Attackers embed bidi overrides and zero-width characters in log entries to hide indicators of compromise from analysts and pattern-matching tools. Sanitizing log data on ingest ensures what you search is what's actually there.
3846

39-
**Identity and anti-phishing**`pаypal.com` (Cyrillic `а`) renders identically to `paypal.com` in most fonts. Homoglyph replacement normalizes display names, URLs, and email addresses to catch spoofing that visual inspection misses.
47+
**Config and data ingestion**YAML, TOML, and JSON parsed from untrusted sources can carry null bytes that truncate C-extension processing, zero-width characters that break key matching, and homoglyphs that create near-duplicate keys. `walk(parsed_config)` sanitizes every string in a nested structure in one call.
4048

4149
## How It Compares
4250

@@ -45,8 +53,8 @@ navi-sanitize is the only library that combines invisible character stripping, h
4553
| | navi-sanitize | Unidecode / anyascii | confusable_homoglyphs | ftfy | MarkupSafe / nh3 |
4654
|---|---|---|---|---|---|
4755
| **Purpose** | Security sanitization | ASCII transliteration | Homoglyph detection | Encoding repair | HTML escaping |
48-
| **Invisible chars** | Strips 411 (bidi, tag block, ZW, VS) | Incidental | No | Partial (preserves bidi, ZW, VS) | No |
49-
| **Homoglyphs** | Replaces 54 curated pairs | Transliterates all non-ASCII | Detects only (no replace) | No | No |
56+
| **Invisible chars** | Strips 492 (bidi, tag block, ZW, VS, C0/C1) | Incidental | No | Partial (preserves bidi, ZW, VS) | No |
57+
| **Homoglyphs** | Replaces 66 curated pairs | Transliterates all non-ASCII | Detects only (no replace) | No | No |
5058
| **NFKC** | Yes | No | No | NFC (NFKC optional) | No |
5159
| **Null bytes** | Yes | No | No | No | No |
5260
| **Preserves Unicode** | Yes (CJK, Arabic, emoji intact) | No (destroys all non-ASCII) | Yes | Yes | Yes |
@@ -55,7 +63,7 @@ navi-sanitize is the only library that combines invisible character stripping, h
5563

5664
**Key differences:**
5765

58-
- **Unidecode / anyascii** transliterate *all* non-ASCII to Latin. They turn `"` into `"Zhong"` and Cyrillic sentences into gibberish. navi-sanitize normalizes only the 54 highest-risk lookalikes and leaves legitimate Unicode intact.
66+
- **Unidecode / anyascii** transliterate *all* non-ASCII to Latin. They turn `"` into `"Zhong"` and Cyrillic sentences into gibberish. navi-sanitize normalizes only the 66 highest-risk lookalikes and leaves legitimate Unicode intact.
5967
- **confusable_homoglyphs** uses the full Unicode Consortium confusables dataset (thousands of pairs) but only *detects* — you'd need to write your own replacement layer. It's also archived.
6068
- **ftfy** is complementary, not competing. It fixes encoding corruption and explicitly *preserves* bidi overrides and zero-width characters that navi-sanitize strips. Different threat model.
6169
- **MarkupSafe / nh3** handle HTML structure; navi-sanitize handles the character-level content *inside* that structure. They compose naturally.
@@ -128,6 +136,8 @@ safe_context = {k: clean(v, escaper=jinja2_escaper) for k, v in user_data.items(
128136
template.render(**safe_context)
129137
```
130138

139+
See [examples/](examples/) for runnable scripts covering LLM pipelines, FastAPI/Pydantic, and log sanitization.
140+
131141
## Install
132142

133143
```
@@ -154,7 +164,7 @@ from navi_sanitize import decode_evasion, clean, detect_scripts, is_mixed_script
154164
raw = "%252e%252e%252fetc%252fpasswd"
155165

156166
# 1. Peel nested encodings (URL → HTML entities → hex escapes)
157-
peeled = decode_evasion(raw) # "../etc/passwd"
167+
peeled = decode_evasion(raw) # "../../etc/passwd"
158168

159169
# 2. Sanitize through the universal pipeline
160170
cleaned = clean(peeled, escaper=path_escaper) # "etc/passwd"
@@ -183,14 +193,14 @@ These are different problems with mature, purpose-built solutions. navi-sanitize
183193

184194
## Warnings
185195

186-
The pipeline never errors. It always produces output. When it changes something, it logs a warning.
196+
The pipeline never errors on valid string input. It always produces output. Non-string arguments raise `TypeError`. When it changes something, it logs a warning.
187197

188198
```python
189199
import logging
190200
logging.basicConfig()
191201

192202
clean("pаypal.com")
193-
# WARNING:navi_sanitize: Replaced 1 homoglyph(s) in value
203+
# WARNING:navi_sanitize:Replaced 1 homoglyph(s) in value
194204
# Returns: "paypal.com"
195205
```
196206

docs/plans/.gitkeep

Whitespace-only changes.

0 commit comments

Comments
 (0)