Skip to content

Commit 28d136c

Browse files
fix: close bypass vectors from adversarial audit and harden pipeline
Security fixes (2 CRITICAL, 3 HIGH, 4 MEDIUM): - Strip supplementary variation selectors U+E0100-U+E01EF (invisible bypass) - Strip U+E0000 LANGUAGE TAG (expand tag block range to U+E0000-U+E007F) - Strip LRM U+200E and RLM U+200F (zero-width directional marks) - Add Greek homoglyphs: iota U+03B9→i, nu U+03BD→v, rho U+03C1→p - Normalize backslashes in path escaper (Windows traversal bypass) - Sanitize dictionary keys in walk() (previously only values) - Add runtime TypeError on non-str input to clean() - Validate escaper return type (must be str) - Null byte warnings now include count per project convention Tests: 44 new regression tests (240 total), placeholder tests filled, weak assertions fixed. Classifiers and project.urls added to pyproject.toml. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent f3f81f0 commit 28d136c

9 files changed

Lines changed: 444 additions & 38 deletions

File tree

CLAUDE.md

Lines changed: 6 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -43,12 +43,15 @@ pre-commit run --all-files
4343

4444
## CI
4545

46-
GitHub Actions runs on push to `main` and all PRs. Four parallel jobs:
46+
GitHub Actions runs on push to `main` and all PRs. Five parallel jobs:
4747

4848
- **lint**`ruff check` + `ruff format --check`
4949
- **typecheck**`mypy --strict src/navi_sanitize/`
5050
- **test** — pytest across Python 3.12 + 3.13, `--benchmark-disable`
51-
- **build** — gates on all three above; builds wheel, smoke-tests public API, uploads artifact
51+
- **security**`pip-audit` dependency vulnerability scan
52+
- **build** — gates on all four above; builds wheel, smoke-tests public API, uploads artifact
53+
54+
Additional security workflows: Semgrep SAST, CodeQL (`python` + `actions`), OpenSSF Scorecard.
5255

5356
Benchmarks run via manual dispatch only (`.github/workflows/benchmark.yml`).
5457

@@ -96,7 +99,7 @@ Each stage returns `(cleaned_string, changed: bool)`. Stages have no side effect
9699
## Gotchas
97100

98101
- **`ruff` rules `RUF001`/`RUF003`** fire on intentional Cyrillic/Greek/Armenian/Cherokee in test and data files — use `# ruff: noqa: RUF001, RUF003` or `# ruff: noqa: RUF003` at top of those files
99-
- **Tag block range starts at `U+E0001`**, not `U+E0000`
102+
- **Tag block range starts at `U+E0000`** (includes the deprecated LANGUAGE TAG character)
100103
- **pytest-benchmark `pedantic()`** required for large payloads (100KB) — standard mode runs too many iterations
101104
- **No CLI, no config files, no framework dependencies** — this is a library only
102105
- **No LLM prompt escaper** — vendor syntax moves too fast; pluggable design lets users build their own

pyproject.toml

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,21 @@ authors = [
1818
maintainers = [
1919
{ name = "Nelson Spence" }
2020
]
21+
classifiers = [
22+
"Development Status :: 3 - Alpha",
23+
"Intended Audience :: Developers",
24+
"License :: OSI Approved :: MIT License",
25+
"Programming Language :: Python :: 3",
26+
"Programming Language :: Python :: 3.12",
27+
"Programming Language :: Python :: 3.13",
28+
"Topic :: Security",
29+
"Topic :: Text Processing :: Filters",
30+
"Typing :: Typed",
31+
]
32+
33+
[project.urls]
34+
Repository = "https://github.com/Project-Navi/navi-sanitize"
35+
Issues = "https://github.com/Project-Navi/navi-sanitize/issues"
2136

2237
[dependency-groups]
2338
dev = [

src/navi_sanitize/_homoglyphs.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,10 @@
5858
"\u03a7": "X",
5959
# Greek → Latin (lowercase)
6060
"\u03b1": "a",
61+
"\u03b9": "i", # iota ι
62+
"\u03bd": "v", # nu ν
6163
"\u03bf": "o",
64+
"\u03c1": "p", # rho ρ
6265
# Typographic
6366
"\u2212": "-", # minus sign
6467
"\u2013": "-", # en dash

src/navi_sanitize/_invisible.py

Lines changed: 17 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,8 @@
1414
"\u200b", # zero-width space
1515
"\u200c", # zero-width non-joiner
1616
"\u200d", # zero-width joiner
17+
"\u200e", # left-to-right mark
18+
"\u200f", # right-to-left mark
1719
"\u2060", # word joiner
1820
"\ufeff", # BOM / zero-width no-break space
1921
"\u180e", # Mongolian vowel separator
@@ -34,14 +36,17 @@
3436
"\ufffc", # object replacement character
3537
}
3638

37-
# --- Variation selectors (U+FE00-U+FE0F) ---
38-
# Invisible modifiers that change glyph presentation.
39+
# --- Variation selectors ---
40+
# BMP range (U+FE00-U+FE0F) = VS1-VS16.
41+
# Supplementary range (U+E0100-U+E01EF) = VS17-VS256.
42+
# Both are invisible modifiers that change glyph presentation.
3943
VARIATION_SELECTOR_RANGE = (0xFE00, 0xFE0F)
44+
VARIATION_SELECTOR_SUPPLEMENT_RANGE = (0xE0100, 0xE01EF)
4045

41-
# --- Unicode Tag block (U+E0001-U+E007F) ---
42-
# Encodes invisible ASCII that tokenizers read but humans can't see.
43-
# Used in tag smuggling attacks against LLMs.
44-
TAG_BLOCK_RANGE = (0xE0001, 0xE007F)
46+
# --- Unicode Tag block (U+E0000-U+E007F) ---
47+
# U+E0000 is the deprecated LANGUAGE TAG; U+E0001-U+E007F encode invisible
48+
# ASCII that tokenizers read but humans can't see (tag smuggling attacks).
49+
TAG_BLOCK_RANGE = (0xE0000, 0xE007F)
4550

4651
# --- Bidirectional override/isolate characters ---
4752
# Used to reorder displayed text, hiding malicious content.
@@ -79,6 +84,12 @@
7984
+ "-"
8085
+ chr(TAG_BLOCK_RANGE[1])
8186
+ "]"
87+
# Variation selectors supplement (range)
88+
+ "|["
89+
+ chr(VARIATION_SELECTOR_SUPPLEMENT_RANGE[0])
90+
+ "-"
91+
+ chr(VARIATION_SELECTOR_SUPPLEMENT_RANGE[1])
92+
+ "]"
8293
# Bidi controls (individual chars)
8394
+ "|["
8495
+ "".join(BIDI_CONTROL_CHARS)

src/navi_sanitize/_pipeline.py

Lines changed: 19 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -24,11 +24,12 @@
2424
Escaper = Callable[[str], str]
2525

2626

27-
def _strip_null_bytes(s: str) -> tuple[str, bool]:
28-
"""Strip null bytes. Returns (cleaned, changed)."""
29-
if "\x00" in s:
30-
return s.replace("\x00", ""), True
31-
return s, False
27+
def _strip_null_bytes(s: str) -> tuple[str, int]:
28+
"""Strip null bytes. Returns (cleaned, count_removed)."""
29+
count = s.count("\x00")
30+
if count:
31+
return s.replace("\x00", ""), count
32+
return s, 0
3233

3334

3435
def _strip_invisible(s: str) -> tuple[str, int]:
@@ -68,10 +69,13 @@ def clean(text: str, *, escaper: Escaper | None = None) -> str:
6869
6970
Always returns output. Logs warnings when input is modified.
7071
"""
72+
if not isinstance(text, str):
73+
raise TypeError(f"clean() requires str, got {type(text).__name__}")
74+
7175
# Stage 1: Null bytes
72-
text, had_nulls = _strip_null_bytes(text)
73-
if had_nulls:
74-
logger.warning("Removed null byte(s) from value")
76+
text, null_count = _strip_null_bytes(text)
77+
if null_count:
78+
logger.warning("Removed %d null byte(s) from value", null_count)
7579

7680
# Stage 2: Invisible characters
7781
text, invis_count = _strip_invisible(text)
@@ -91,6 +95,8 @@ def clean(text: str, *, escaper: Escaper | None = None) -> str:
9195
# Stage 5: Escaper
9296
if escaper is not None:
9397
text = escaper(text)
98+
if not isinstance(text, str):
99+
raise TypeError(f"Escaper must return str, got {type(text).__name__}")
94100

95101
return text
96102

@@ -107,8 +113,12 @@ def walk[T](data: T, *, escaper: Escaper | None = None) -> T:
107113
def _walk_inner(obj: object, *, escaper: Escaper | None = None) -> object:
108114
"""Walk and sanitize in place on the deep-copied structure."""
109115
if isinstance(obj, dict):
116+
new_dict: dict[object, object] = {}
110117
for k, v in obj.items():
111-
obj[k] = _walk_inner(v, escaper=escaper)
118+
clean_key = clean(k, escaper=escaper) if isinstance(k, str) else k
119+
new_dict[clean_key] = _walk_inner(v, escaper=escaper)
120+
obj.clear()
121+
obj.update(new_dict)
112122
return obj
113123
if isinstance(obj, list):
114124
for i, item in enumerate(obj):

src/navi_sanitize/escapers/_path.py

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -10,6 +10,7 @@ def path_escaper(text: str) -> str:
1010
Strips ../ and ./ segments, leading /, and embedded .. within segments
1111
(which can appear when earlier pipeline stages concatenate fragments).
1212
"""
13+
text = text.replace("\\", "/")
1314
text = text.lstrip("/")
1415
parts = text.split("/")
1516
clean_parts: list[str] = []

0 commit comments

Comments
 (0)