Skip to content

feat: add decode_evasion, detect_scripts, is_mixed_script#2

Merged
Nelson Spence (Fieldnote-Echo) merged 3 commits into
mainfrom
feat/scripts-decode
Mar 1, 2026
Merged

feat: add decode_evasion, detect_scripts, is_mixed_script#2
Nelson Spence (Fieldnote-Echo) merged 3 commits into
mainfrom
feat/scripts-decode

Conversation

@Fieldnote-Echo

Copy link
Copy Markdown
Member

Summary

  • decode_evasion(text, *, max_layers=3) — Standalone pre-processor that iteratively peels URL encoding, HTML entities, and hex escape layers from untrusted text. Stops when a full pass produces no changes or max_layers is reached. Never logs decoded content.
  • detect_scripts(text) — Returns the set of script buckets (latin, cyrillic, greek, arabic, hebrew, armenian, cherokee, cjk) present in text. Unknown scripts silently ignored.
  • is_mixed_script(text) — Returns True when 2+ scripts are detected. Pure analysis — no transformation, no blocking.

All three are opt-in primitives — the core clean() pipeline is unchanged. Zero new dependencies (stdlib only: html, urllib.parse, unicodedata).

Design decisions

  • No base64 in v1 — too high false-positive risk; URL + HTML + hex covers 90% of bypass vectors
  • Single cjk bucket — Chinese, Japanese (Hiragana/Katakana), and Korean all map to one bucket
  • Layer counting per pass — URL→HTML→hex in sequence = one layer, not one per decoder
  • max_layers <= 0 fast path — early return, no surprises with negative values
  • Script lookup via name.split(" ", 1)[0] — dict lookup on first token of unicodedata.name(), not prefix loop

Test coverage

  • 62 new tests (34 script detection, 28 decode evasion) — 303 total, all passing
  • Phishing scenarios: pаypal.com, gооgle.com, аpple.com (Cyrillic lookalikes)
  • Multi-layer nesting: double/triple/quadruple URL encoding, URL+HTML nesting
  • Edge cases: invalid encodings, partial hex, negative max_layers, empty strings
  • Integration: decode_evasion → clean → detect_scripts full pipeline composition
  • Log safety: decoded content never appears in warning messages

Test plan

  • uv run pytest tests/ -v --benchmark-disable — 303 passed
  • uv run mypy --strict src/navi_sanitize/ — no issues
  • uv run ruff check src/ tests/ — all checks passed
  • uv run ruff format --check src/ tests/ — all formatted
  • Pre-commit hooks pass (trailing whitespace, ruff, mypy)

🤖 Generated with Claude Code

Opt-in primitives for multi-encoding evasion decode and mixed-script
detection. Core clean() pipeline is unchanged.

- decode_evasion() iteratively peels URL, HTML entity, and hex escape
  layers (up to max_layers, default 3). Never logs decoded content.
- detect_scripts() returns script buckets (latin, cyrillic, greek, etc.)
  present in text. Unknown scripts silently ignored.
- is_mixed_script() returns True when 2+ scripts are detected.
- 62 new tests (34 scripts, 28 decode) — 303 total, all passing.
- Zero new dependencies (stdlib only: html, urllib.parse, unicodedata).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@codecov

codecov Bot commented Mar 1, 2026

Copy link
Copy Markdown

Welcome to Codecov 🎉

Once you merge this PR into your default branch, you're all set! Codecov will compare coverage reports and display results in all future pull requests.

ℹ️ You can also turn on project coverage checks and project coverage reporting on Pull Request comment

Thanks for integrating Codecov - We've got you covered ☂️

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 518537aceb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "Codex (@codex) review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "Codex (@codex) address that feedback".

Comment thread src/navi_sanitize/_decode.py Outdated

def _decode_url(s: str) -> str:
"""Decode URL percent-encoding."""
return urllib.parse.unquote(s)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve malformed percent-encodings in URL decode

decode_evasion() documents that invalid encodings should pass through unchanged, but _decode_url() calls urllib.parse.unquote() with default UTF-8 errors='replace', so malformed byte sequences are mutated (for example, %FF becomes ). This means attacker-controlled malformed inputs are silently rewritten instead of preserved, which can corrupt values and change downstream sanitization/audit behavior in flows that rely on exact pre-clean text.

Useful? React with 👍 / 👎.

Document decode_evasion(), detect_scripts(), and is_mixed_script()
with usage examples and composition pattern.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Use unquote_to_bytes() with surrogateescape to keep invalid UTF-8
sequences (e.g. %FF, lone %80) as literal %XX instead of replacing
them with U+FFFD. Also adds test for unnamed-but-alphabetic chars
in detect_scripts to close the codecov gap.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@Fieldnote-Echo Nelson Spence (Fieldnote-Echo) merged commit 6cac6a5 into main Mar 1, 2026
13 checks passed
@Fieldnote-Echo Nelson Spence (Fieldnote-Echo) deleted the feat/scripts-decode branch March 1, 2026 18:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant