Skip to content

Commit 6cac6a5

Browse files
feat: add decode_evasion, detect_scripts, is_mixed_script (#2)
* feat: add decode_evasion(), detect_scripts(), and is_mixed_script() Opt-in primitives for multi-encoding evasion decode and mixed-script detection. Core clean() pipeline is unchanged. - decode_evasion() iteratively peels URL, HTML entity, and hex escape layers (up to max_layers, default 3). Never logs decoded content. - detect_scripts() returns script buckets (latin, cyrillic, greek, etc.) present in text. Unknown scripts silently ignored. - is_mixed_script() returns True when 2+ scripts are detected. - 62 new tests (34 scripts, 28 decode) — 303 total, all passing. - Zero new dependencies (stdlib only: html, urllib.parse, unicodedata). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * docs: add opt-in utilities to README Document decode_evasion(), detect_scripts(), and is_mixed_script() with usage examples and composition pattern. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> * fix: preserve malformed percent-encoded bytes in decode_evasion Use unquote_to_bytes() with surrogateescape to keep invalid UTF-8 sequences (e.g. %FF, lone %80) as literal %XX instead of replacing them with U+FFFD. Also adds test for unnamed-but-alphabetic chars in detect_scripts to close the codecov gap. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 77260ae commit 6cac6a5

7 files changed

Lines changed: 528 additions & 1 deletion

File tree

README.md

Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,8 @@ clean("price:\u200b 0") # "price: 0" — zero-width space stripped
1818
clean("file\x00.txt") # "file.txt" — null byte removed
1919
```
2020

21+
Opt-in utilities for deeper analysis: `decode_evasion()` peels nested URL/HTML/hex encodings, `detect_scripts()` and `is_mixed_script()` flag mixed-script spoofing.
22+
2123
## Why This Matters
2224

2325
Untrusted text contains invisible attacks: homoglyph substitution, zero-width characters, null bytes, fullwidth encoding, template/prompt injection delimiters. These bypass validation, poison templates, and fool humans.
@@ -106,6 +108,33 @@ from navi_sanitize import walk
106108
spec = walk(untrusted_json)
107109
```
108110

111+
## Opt-in Utilities
112+
113+
**These utilities are not part of `clean()` and are never run automatically.** You must call them explicitly.
114+
115+
```python
116+
from navi_sanitize import decode_evasion, clean, detect_scripts, is_mixed_script, path_escaper
117+
118+
# Double-encoded path traversal
119+
raw = "%252e%252e%252fetc%252fpasswd"
120+
121+
# 1. Peel nested encodings (URL → HTML entities → hex escapes)
122+
peeled = decode_evasion(raw) # "../etc/passwd"
123+
124+
# 2. Sanitize through the universal pipeline
125+
cleaned = clean(peeled, escaper=path_escaper) # "etc/passwd"
126+
127+
# 3. Check for mixed-script spoofing (useful on raw or pre-clean input)
128+
if is_mixed_script(raw) or is_mixed_script(peeled):
129+
flag_for_review(raw)
130+
```
131+
132+
- **`decode_evasion(text, *, max_layers=3)`** — iterative URL/HTML/hex decoding; stops when a pass produces no change
133+
- **`detect_scripts(text)`** — returns script buckets present in text (`latin`, `cyrillic`, `greek`, etc.)
134+
- **`is_mixed_script(text)`**`True` when 2+ scripts detected
135+
136+
Script detection can be applied pre-clean too — most useful on raw input for phishing detection.
137+
109138
## Warnings
110139

111140
The pipeline never errors. It always produces output. When it changes something, it logs a warning.

src/navi_sanitize/__init__.py

Lines changed: 12 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,13 +6,24 @@
66
import logging
77
from collections.abc import Callable
88

9+
from navi_sanitize._decode import decode_evasion
910
from navi_sanitize._pipeline import clean, walk
11+
from navi_sanitize._scripts import detect_scripts, is_mixed_script
1012
from navi_sanitize.escapers import jinja2_escaper, path_escaper
1113

1214
Escaper = Callable[[str], str]
1315

1416
__version__ = "0.1.0"
15-
__all__ = ["Escaper", "clean", "jinja2_escaper", "path_escaper", "walk"]
17+
__all__ = [
18+
"Escaper",
19+
"clean",
20+
"decode_evasion",
21+
"detect_scripts",
22+
"is_mixed_script",
23+
"jinja2_escaper",
24+
"path_escaper",
25+
"walk",
26+
]
1627

1728
# Library logging best practice: NullHandler
1829
logging.getLogger("navi_sanitize").addHandler(logging.NullHandler())

src/navi_sanitize/_decode.py

Lines changed: 97 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,97 @@
1+
# SPDX-License-Identifier: MIT
2+
"""Multi-encoding evasion decoder for untrusted text.
3+
4+
Opt-in pre-processor — not part of the default ``clean()`` pipeline.
5+
Callers compose it with ``clean()`` explicitly::
6+
7+
text = decode_evasion(user_input) # peel encoding layers
8+
cleaned = clean(text, escaper=...) # sanitize
9+
10+
Iteratively decodes URL encoding, HTML entities, and hex escapes
11+
(``\\xHH``). Stops when a full pass produces no changes or
12+
*max_layers* is reached.
13+
"""
14+
15+
from __future__ import annotations
16+
17+
import html
18+
import logging
19+
import re
20+
import urllib.parse
21+
22+
logger = logging.getLogger("navi_sanitize")
23+
24+
MAX_DECODE_LAYERS: int = 3
25+
26+
_HEX_RE = re.compile(r"\\x([0-9a-fA-F]{2})")
27+
28+
29+
def _decode_url(s: str) -> str:
30+
"""Decode URL percent-encoding, preserving malformed byte sequences.
31+
32+
Valid UTF-8 percent-encoded sequences decode normally.
33+
Invalid byte sequences (e.g. ``%FF``, lone ``%80``) are kept
34+
as literal percent-encoded text instead of being replaced with
35+
U+FFFD.
36+
"""
37+
raw = urllib.parse.unquote_to_bytes(s)
38+
# Decode valid UTF-8; map invalid bytes to surrogates so we can
39+
# re-encode them back to %XX in the next step.
40+
text = raw.decode("utf-8", errors="surrogateescape")
41+
# Re-encode any lone surrogates (from invalid bytes) back to %XX
42+
parts: list[str] = []
43+
for ch in text:
44+
if "\udc80" <= ch <= "\udcff":
45+
parts.append(f"%{ord(ch) & 0xFF:02X}")
46+
else:
47+
parts.append(ch)
48+
return "".join(parts)
49+
50+
51+
def _decode_html_entities(s: str) -> str:
52+
"""Decode HTML/XML character entities."""
53+
return html.unescape(s)
54+
55+
56+
def _decode_hex_escapes(s: str) -> str:
57+
r"""Decode literal ``\xHH`` escape sequences."""
58+
59+
def _replace(m: re.Match[str]) -> str:
60+
return chr(int(m.group(1), 16))
61+
62+
return _HEX_RE.sub(_replace, s)
63+
64+
65+
def decode_evasion(text: str, *, max_layers: int = MAX_DECODE_LAYERS) -> str:
66+
"""Iteratively decode nested encodings from *text*.
67+
68+
Runs URL decoding, HTML entity unescaping, and hex escape decoding
69+
in sequence as a single pass. A pass counts as one layer if the
70+
output differs from the input. Stops when a pass produces no
71+
changes or *max_layers* is reached.
72+
73+
Logs a warning with the layer count when decoding occurs. Never
74+
includes decoded content in log messages.
75+
76+
Never errors on invalid or partial encodings — they pass through
77+
unchanged.
78+
"""
79+
if max_layers <= 0:
80+
return text
81+
82+
layers = 0
83+
for _ in range(max_layers):
84+
# Run all three decoders in sequence (one pass)
85+
decoded = _decode_url(text)
86+
decoded = _decode_html_entities(decoded)
87+
decoded = _decode_hex_escapes(decoded)
88+
# "changed" = output differs from input for this pass
89+
if decoded == text:
90+
break
91+
text = decoded
92+
layers += 1
93+
94+
if layers:
95+
logger.warning("Decoded %d encoding layer(s) from value", layers)
96+
97+
return text

src/navi_sanitize/_scripts.py

Lines changed: 63 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,63 @@
1+
# SPDX-License-Identifier: MIT
2+
"""Mixed-script detection for untrusted text.
3+
4+
Opt-in analysis primitive — no transformation, no blocking. Callers use the
5+
results to decide whether to warn, block, or confirm. Not part of the default
6+
``clean()`` pipeline.
7+
8+
Only known script buckets are returned; characters whose Unicode name doesn't
9+
match any known prefix are silently ignored.
10+
"""
11+
12+
from __future__ import annotations
13+
14+
import unicodedata
15+
16+
# First token of unicodedata.name() → bucket
17+
_SCRIPT_PREFIXES: dict[str, str] = {
18+
"LATIN": "latin",
19+
"CYRILLIC": "cyrillic",
20+
"GREEK": "greek",
21+
"ARABIC": "arabic",
22+
"HEBREW": "hebrew",
23+
"ARMENIAN": "armenian",
24+
"CHEROKEE": "cherokee",
25+
"CJK": "cjk",
26+
"HIRAGANA": "cjk",
27+
"KATAKANA": "cjk",
28+
"HANGUL": "cjk",
29+
}
30+
31+
32+
def detect_scripts(text: str) -> set[str]:
33+
"""Return the set of script buckets present in *text*.
34+
35+
Only alphabetic characters are considered; digits, punctuation, emoji,
36+
and characters with no Unicode name are skipped. Unknown scripts (not in
37+
the prefix map) are silently ignored.
38+
39+
Buckets: ``latin``, ``cyrillic``, ``greek``, ``arabic``, ``hebrew``,
40+
``armenian``, ``cherokee``, ``cjk`` (covers CJK Unified, Hiragana,
41+
Katakana, and Hangul).
42+
"""
43+
scripts: set[str] = set()
44+
for ch in text:
45+
if not ch.isalpha():
46+
continue
47+
name = unicodedata.name(ch, "")
48+
if not name:
49+
continue
50+
head = name.split(" ", 1)[0]
51+
bucket = _SCRIPT_PREFIXES.get(head)
52+
if bucket is not None:
53+
scripts.add(bucket)
54+
return scripts
55+
56+
57+
def is_mixed_script(text: str) -> bool:
58+
"""Return ``True`` if *text* contains characters from two or more scripts.
59+
60+
Non-alphabetic characters (digits, punctuation, emoji) are not counted,
61+
so ``"hello 123"`` is *not* considered mixed.
62+
"""
63+
return len(detect_scripts(text)) >= 2

tests/test_audit_remediation.py

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -348,6 +348,9 @@ def test_public_exports(self) -> None:
348348
assert hasattr(navi_sanitize, "__version__")
349349
assert set(navi_sanitize.__all__) == {
350350
"clean",
351+
"decode_evasion",
352+
"detect_scripts",
353+
"is_mixed_script",
351354
"walk",
352355
"jinja2_escaper",
353356
"path_escaper",

0 commit comments

Comments
 (0)