feat: add decode_evasion, detect_scripts, is_mixed_script by Fieldnote-Echo · Pull Request #2 · Project-Navi/navi-sanitize

Fieldnote-Echo · 2026-03-01T18:25:59Z

Summary

decode_evasion(text, *, max_layers=3) — Standalone pre-processor that iteratively peels URL encoding, HTML entities, and hex escape layers from untrusted text. Stops when a full pass produces no changes or max_layers is reached. Never logs decoded content.
detect_scripts(text) — Returns the set of script buckets (latin, cyrillic, greek, arabic, hebrew, armenian, cherokee, cjk) present in text. Unknown scripts silently ignored.
is_mixed_script(text) — Returns True when 2+ scripts are detected. Pure analysis — no transformation, no blocking.

All three are opt-in primitives — the core clean() pipeline is unchanged. Zero new dependencies (stdlib only: html, urllib.parse, unicodedata).

Design decisions

No base64 in v1 — too high false-positive risk; URL + HTML + hex covers 90% of bypass vectors
Single cjk bucket — Chinese, Japanese (Hiragana/Katakana), and Korean all map to one bucket
Layer counting per pass — URL→HTML→hex in sequence = one layer, not one per decoder
max_layers <= 0 fast path — early return, no surprises with negative values
Script lookup via name.split(" ", 1)[0] — dict lookup on first token of unicodedata.name(), not prefix loop

Test coverage

62 new tests (34 script detection, 28 decode evasion) — 303 total, all passing
Phishing scenarios: pаypal.com, gооgle.com, аpple.com (Cyrillic lookalikes)
Multi-layer nesting: double/triple/quadruple URL encoding, URL+HTML nesting
Edge cases: invalid encodings, partial hex, negative max_layers, empty strings
Integration: decode_evasion → clean → detect_scripts full pipeline composition
Log safety: decoded content never appears in warning messages

Test plan

uv run pytest tests/ -v --benchmark-disable — 303 passed
uv run mypy --strict src/navi_sanitize/ — no issues
uv run ruff check src/ tests/ — all checks passed
uv run ruff format --check src/ tests/ — all formatted
Pre-commit hooks pass (trailing whitespace, ruff, mypy)

🤖 Generated with Claude Code

Opt-in primitives for multi-encoding evasion decode and mixed-script detection. Core clean() pipeline is unchanged. - decode_evasion() iteratively peels URL, HTML entity, and hex escape layers (up to max_layers, default 3). Never logs decoded content. - detect_scripts() returns script buckets (latin, cyrillic, greek, etc.) present in text. Unknown scripts silently ignored. - is_mixed_script() returns True when 2+ scripts are detected. - 62 new tests (34 scripts, 28 decode) — 303 total, all passing. - Zero new dependencies (stdlib only: html, urllib.parse, unicodedata). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

codecov · 2026-03-01T18:27:07Z

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

ℹ️ You can also turn on project coverage checks and project coverage reporting on Pull Request comment

Thanks for integrating Codecov - We've got you covered ☂️

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 518537aceb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "Codex (@codex) review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "Codex (@codex) address that feedback".

chatgpt-codex-connector · 2026-03-01T18:28:56Z

+
+def _decode_url(s: str) -> str:
+    """Decode URL percent-encoding."""
+    return urllib.parse.unquote(s)


Preserve malformed percent-encodings in URL decode

decode_evasion() documents that invalid encodings should pass through unchanged, but _decode_url() calls urllib.parse.unquote() with default UTF-8 errors='replace', so malformed byte sequences are mutated (for example, %FF becomes �). This means attacker-controlled malformed inputs are silently rewritten instead of preserved, which can corrupt values and change downstream sanitization/audit behavior in flows that rely on exact pre-clean text.

Useful? React with 👍 / 👎.

Document decode_evasion(), detect_scripts(), and is_mixed_script() with usage examples and composition pattern. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Use unquote_to_bytes() with surrogateescape to keep invalid UTF-8 sequences (e.g. %FF, lone %80) as literal %XX instead of replacing them with U+FFFD. Also adds test for unnamed-but-alphabetic chars in detect_scripts to close the codecov gap. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

chatgpt-codex-connector Bot reviewed Mar 1, 2026

View reviewed changes

Nelson Spence (Fieldnote-Echo) and others added 2 commits March 1, 2026 12:37

docs: add opt-in utilities to README

27a069d

Document decode_evasion(), detect_scripts(), and is_mixed_script() with usage examples and composition pattern. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Nelson Spence (Fieldnote-Echo) merged commit 6cac6a5 into main Mar 1, 2026
13 checks passed

Nelson Spence (Fieldnote-Echo) deleted the feat/scripts-decode branch March 1, 2026 18:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: add decode_evasion, detect_scripts, is_mixed_script#2

feat: add decode_evasion, detect_scripts, is_mixed_script#2
Nelson Spence (Fieldnote-Echo) merged 3 commits into
mainfrom
feat/scripts-decode

Fieldnote-Echo commented Mar 1, 2026

Uh oh!

codecov Bot commented Mar 1, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Mar 1, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

Fieldnote-Echo commented Mar 1, 2026

Summary

Design decisions

Test coverage

Test plan

Uh oh!

codecov Bot commented Mar 1, 2026

Welcome to Codecov 🎉

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 1, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant