feat: add decode_evasion, detect_scripts, is_mixed_script#2
Conversation
Opt-in primitives for multi-encoding evasion decode and mixed-script detection. Core clean() pipeline is unchanged. - decode_evasion() iteratively peels URL, HTML entity, and hex escape layers (up to max_layers, default 3). Never logs decoded content. - detect_scripts() returns script buckets (latin, cyrillic, greek, etc.) present in text. Unknown scripts silently ignored. - is_mixed_script() returns True when 2+ scripts are detected. - 62 new tests (34 scripts, 28 decode) — 303 total, all passing. - Zero new dependencies (stdlib only: html, urllib.parse, unicodedata). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Welcome to Codecov 🎉Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests. ℹ️ You can also turn on project coverage checks and project coverage reporting on Pull Request comment Thanks for integrating Codecov - We've got you covered ☂️ |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 518537aceb
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "Codex (@codex) review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "Codex (@codex) address that feedback".
|
|
||
| def _decode_url(s: str) -> str: | ||
| """Decode URL percent-encoding.""" | ||
| return urllib.parse.unquote(s) |
There was a problem hiding this comment.
Preserve malformed percent-encodings in URL decode
decode_evasion() documents that invalid encodings should pass through unchanged, but _decode_url() calls urllib.parse.unquote() with default UTF-8 errors='replace', so malformed byte sequences are mutated (for example, %FF becomes �). This means attacker-controlled malformed inputs are silently rewritten instead of preserved, which can corrupt values and change downstream sanitization/audit behavior in flows that rely on exact pre-clean text.
Useful? React with 👍 / 👎.
Document decode_evasion(), detect_scripts(), and is_mixed_script() with usage examples and composition pattern. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use unquote_to_bytes() with surrogateescape to keep invalid UTF-8 sequences (e.g. %FF, lone %80) as literal %XX instead of replacing them with U+FFFD. Also adds test for unnamed-but-alphabetic chars in detect_scripts to close the codecov gap. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
decode_evasion(text, *, max_layers=3)— Standalone pre-processor that iteratively peels URL encoding, HTML entities, and hex escape layers from untrusted text. Stops when a full pass produces no changes ormax_layersis reached. Never logs decoded content.detect_scripts(text)— Returns the set of script buckets (latin,cyrillic,greek,arabic,hebrew,armenian,cherokee,cjk) present in text. Unknown scripts silently ignored.is_mixed_script(text)— ReturnsTruewhen 2+ scripts are detected. Pure analysis — no transformation, no blocking.All three are opt-in primitives — the core
clean()pipeline is unchanged. Zero new dependencies (stdlib only:html,urllib.parse,unicodedata).Design decisions
cjkbucket — Chinese, Japanese (Hiragana/Katakana), and Korean all map to one bucketmax_layers <= 0fast path — early return, no surprises with negative valuesname.split(" ", 1)[0]— dict lookup on first token ofunicodedata.name(), not prefix loopTest coverage
pаypal.com,gооgle.com,аpple.com(Cyrillic lookalikes)decode_evasion → clean → detect_scriptsfull pipeline compositionTest plan
uv run pytest tests/ -v --benchmark-disable— 303 passeduv run mypy --strict src/navi_sanitize/— no issuesuv run ruff check src/ tests/— all checks passeduv run ruff format --check src/ tests/— all formatted🤖 Generated with Claude Code