Skip to content

Commit 77260ae

Browse files
docs: add Why This Matters and comparison sections to README
Cover use cases (LLM pipelines, web apps, config ingestion, log analysis, anti-phishing) and comparison table vs Unidecode, ftfy, confusable_homoglyphs, MarkupSafe, and pydantic. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
1 parent 451e39f commit 77260ae

1 file changed

Lines changed: 34 additions & 1 deletion

File tree

README.md

Lines changed: 34 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -18,12 +18,45 @@ clean("price:\u200b 0") # "price: 0" — zero-width space stripped
1818
clean("file\x00.txt") # "file.txt" — null byte removed
1919
```
2020

21-
## Why
21+
## Why This Matters
2222

2323
Untrusted text contains invisible attacks: homoglyph substitution, zero-width characters, null bytes, fullwidth encoding, template/prompt injection delimiters. These bypass validation, poison templates, and fool humans.
2424

2525
navi-sanitize fixes the text before it reaches your application. It doesn't detect attacks — it removes them.
2626

27+
**LLM prompt pipelines** — User input flows into system prompts, RAG context, and tool calls. Invisible Unicode (tag block characters, bidi overrides) encodes instructions that tokenizers read but humans can't see. Homoglyphs bypass keyword filters. navi-sanitize strips these vectors before text reaches the model, and the pluggable escaper lets you add vendor-specific prompt escaping on top.
28+
29+
**Web applications** — Jinja2 SSTI, path traversal, and fullwidth encoding bypasses are well-known but tedious to cover manually. A single `clean(user_input, escaper=jinja2_escaper)` call handles homoglyph-disguised payloads like `{{ cоnfig }}` (Cyrillic `о`) that naive escaping misses.
30+
31+
**Config and data ingestion** — YAML, TOML, and JSON parsed from untrusted sources can carry null bytes that truncate C-extension processing, zero-width characters that break key matching, and homoglyphs that create near-duplicate keys. `walk(parsed_config)` sanitizes every string in a nested structure in one call.
32+
33+
**Log analysis and SIEM** — Attackers embed bidi overrides and zero-width characters in log entries to hide indicators of compromise from analysts and pattern-matching tools. Sanitizing log data on ingest ensures what you search is what's actually there.
34+
35+
**Identity and anti-phishing**`pаypal.com` (Cyrillic `а`) renders identically to `paypal.com` in most fonts. Homoglyph replacement normalizes display names, URLs, and email addresses to catch spoofing that visual inspection misses.
36+
37+
## How It Compares
38+
39+
navi-sanitize is the only library that combines invisible character stripping, homoglyph replacement, NFKC normalization, and pluggable escaping in a single zero-dependency pipeline. Existing tools solve pieces of this problem:
40+
41+
| | navi-sanitize | Unidecode / anyascii | confusable_homoglyphs | ftfy | MarkupSafe / nh3 |
42+
|---|---|---|---|---|---|
43+
| **Purpose** | Security sanitization | ASCII transliteration | Homoglyph detection | Encoding repair | HTML escaping |
44+
| **Invisible chars** | Strips 411 (bidi, tag block, ZW, VS) | Incidental | No | Partial (preserves bidi, ZW, VS) | No |
45+
| **Homoglyphs** | Replaces 51 curated pairs | Transliterates all non-ASCII | Detects only (no replace) | No | No |
46+
| **NFKC** | Yes | No | No | NFC (NFKC optional) | No |
47+
| **Null bytes** | Yes | No | No | No | No |
48+
| **Preserves Unicode** | Yes (CJK, Arabic, emoji intact) | No (destroys all non-ASCII) | Yes | Yes | Yes |
49+
| **Pluggable escaper** | Yes | No | No | No | N/A (HTML-specific) |
50+
| **Dependencies** | Zero | Zero | Zero | wcwidth | C ext / Rust ext |
51+
52+
**Key differences:**
53+
54+
- **Unidecode / anyascii** transliterate *all* non-ASCII to Latin. They turn `"` into `"Zhong"` and Cyrillic sentences into gibberish. navi-sanitize normalizes only the 51 highest-risk lookalikes and leaves legitimate Unicode intact.
55+
- **confusable_homoglyphs** uses the full Unicode Consortium confusables dataset (thousands of pairs) but only *detects* — you'd need to write your own replacement layer. It's also archived.
56+
- **ftfy** is complementary, not competing. It fixes encoding corruption and explicitly *preserves* bidi overrides and zero-width characters that navi-sanitize strips. Different threat model.
57+
- **MarkupSafe / nh3** handle HTML structure; navi-sanitize handles the character-level content *inside* that structure. They compose naturally.
58+
- **pydantic / cerberus** are validation frameworks — call `navi_sanitize.clean()` inside a pydantic `AfterValidator` or cerberus coercion chain for validated, sanitized output.
59+
2760
## Pipeline
2861

2962
Every string passes through stages in order. Each stage returns clean output and a warning if it changed anything.

0 commit comments

Comments
 (0)