You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
docs: add Why This Matters and comparison sections to README
Cover use cases (LLM pipelines, web apps, config ingestion, log
analysis, anti-phishing) and comparison table vs Unidecode, ftfy,
confusable_homoglyphs, MarkupSafe, and pydantic.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Untrusted text contains invisible attacks: homoglyph substitution, zero-width characters, null bytes, fullwidth encoding, template/prompt injection delimiters. These bypass validation, poison templates, and fool humans.
24
24
25
25
navi-sanitize fixes the text before it reaches your application. It doesn't detect attacks — it removes them.
26
26
27
+
**LLM prompt pipelines** — User input flows into system prompts, RAG context, and tool calls. Invisible Unicode (tag block characters, bidi overrides) encodes instructions that tokenizers read but humans can't see. Homoglyphs bypass keyword filters. navi-sanitize strips these vectors before text reaches the model, and the pluggable escaper lets you add vendor-specific prompt escaping on top.
28
+
29
+
**Web applications** — Jinja2 SSTI, path traversal, and fullwidth encoding bypasses are well-known but tedious to cover manually. A single `clean(user_input, escaper=jinja2_escaper)` call handles homoglyph-disguised payloads like `{{ cоnfig }}` (Cyrillic `о`) that naive escaping misses.
30
+
31
+
**Config and data ingestion** — YAML, TOML, and JSON parsed from untrusted sources can carry null bytes that truncate C-extension processing, zero-width characters that break key matching, and homoglyphs that create near-duplicate keys. `walk(parsed_config)` sanitizes every string in a nested structure in one call.
32
+
33
+
**Log analysis and SIEM** — Attackers embed bidi overrides and zero-width characters in log entries to hide indicators of compromise from analysts and pattern-matching tools. Sanitizing log data on ingest ensures what you search is what's actually there.
34
+
35
+
**Identity and anti-phishing** — `pаypal.com` (Cyrillic `а`) renders identically to `paypal.com` in most fonts. Homoglyph replacement normalizes display names, URLs, and email addresses to catch spoofing that visual inspection misses.
36
+
37
+
## How It Compares
38
+
39
+
navi-sanitize is the only library that combines invisible character stripping, homoglyph replacement, NFKC normalization, and pluggable escaping in a single zero-dependency pipeline. Existing tools solve pieces of this problem:
|**Invisible chars**| Strips 411 (bidi, tag block, ZW, VS) | Incidental | No | Partial (preserves bidi, ZW, VS) | No |
45
+
|**Homoglyphs**| Replaces 51 curated pairs | Transliterates all non-ASCII | Detects only (no replace) | No | No |
46
+
|**NFKC**| Yes | No | No | NFC (NFKC optional) | No |
47
+
|**Null bytes**| Yes | No | No | No | No |
48
+
|**Preserves Unicode**| Yes (CJK, Arabic, emoji intact) | No (destroys all non-ASCII) | Yes | Yes | Yes |
49
+
|**Pluggable escaper**| Yes | No | No | No | N/A (HTML-specific) |
50
+
|**Dependencies**| Zero | Zero | Zero | wcwidth | C ext / Rust ext |
51
+
52
+
**Key differences:**
53
+
54
+
-**Unidecode / anyascii** transliterate *all* non-ASCII to Latin. They turn `"` into `"Zhong"` and Cyrillic sentences into gibberish. navi-sanitize normalizes only the 51 highest-risk lookalikes and leaves legitimate Unicode intact.
55
+
-**confusable_homoglyphs** uses the full Unicode Consortium confusables dataset (thousands of pairs) but only *detects* — you'd need to write your own replacement layer. It's also archived.
56
+
-**ftfy** is complementary, not competing. It fixes encoding corruption and explicitly *preserves* bidi overrides and zero-width characters that navi-sanitize strips. Different threat model.
57
+
-**MarkupSafe / nh3** handle HTML structure; navi-sanitize handles the character-level content *inside* that structure. They compose naturally.
58
+
-**pydantic / cerberus** are validation frameworks — call `navi_sanitize.clean()` inside a pydantic `AfterValidator` or cerberus coercion chain for validated, sanitized output.
59
+
27
60
## Pipeline
28
61
29
62
Every string passes through stages in order. Each stage returns clean output and a warning if it changed anything.
0 commit comments