fix(relevance,dedupe): CJK-aware tokenization for Chinese sources#547
Conversation
Relevance scoring and near-duplicate detection tokenize by splitting on
whitespace (`str.split()`). Chinese text has no spaces between words, so a
whole Chinese sentence collapses into a single token: token-overlap relevance
and Jaccard dedup effectively stop working for Chinese-language content. This
already affects the existing Xiaohongshu source and any future CJK source.
Add lib/cjk.py with `segment(text)`, which splits text into CJK and non-CJK
runs:
- Non-CJK (ASCII/Latin) runs keep the original `\w+` word behavior — English
is unchanged.
- CJK runs are segmented by jieba when it is installed, falling back to
character bigrams when it is not. Bigrams are dictionary-free and still give
robust overlap signal (e.g. query "大模型" -> {大模, 模型} overlaps text
"国产大模型评测").
jieba stays OPTIONAL — present: used; absent: bigram fallback. It is never
added to the hard dependency set, so the skill keeps its zero-dependency,
install-anywhere property (pyproject `dependencies = []`).
Wired into relevance.tokenize() and dedupe (_tokenize / token_jaccard); both
also union a small Chinese stopword set into their existing English stopwords.
Tests: tests/test_cjk.py covers segmentation, Chinese relevance match/no-match,
English non-regression, and Chinese near-duplicate detection.
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Greptile SummaryThis PR fixes a long-standing bug where Chinese text from Xiaohongshu/Bilibili was tokenized as a single whitespace-split token, rendering token-overlap relevance scoring and Jaccard deduplication effectively non-functional for CJK sources.
Confidence Score: 5/5Safe to merge — the English tokenization path is functionally unchanged, and the new CJK paths are well-isolated behind has_cjk checks with a zero-dependency bigram fallback. The implementation is careful: thread safety is handled by eagerly importing jieba at module load time (commented and tested), the bigram fallback preserves the zero-dependency contract, and the has_cjk gate prevents any behavioral change on English text. The one note is a test assertion that is trivially true and does not exercise the guard it describes, but this is a test quality gap rather than a defect in the production code. No files require special attention. The tests/test_cjk.py has one weak assertion worth strengthening, but the production code in all four files is correct. Important Files Changed
|
| except Exception: | ||
| _jieba = None |
There was a problem hiding this comment.
The
except Exception clause silently swallows all errors — including ImportError for a missing package (expected) but also real errors like a corrupted jieba install or missing data files. Using ImportError is the conventional and safer narrowing here.
| except Exception: | |
| _jieba = None | |
| except ImportError: | |
| _jieba = None |
| def _cjk_tokens(run: str) -> List[str]: | ||
| jieba = _get_jieba() | ||
| if jieba is not None: | ||
| return [w for w in jieba.cut(run) if w.strip() and _CJK_RE.search(w)] | ||
| # Dictionary-free fallback: character bigrams (single char if run length 1). | ||
| if len(run) <= 1: | ||
| return [run] if run else [] | ||
| return [run[i:i + 2] for i in range(len(run) - 1)] |
There was a problem hiding this comment.
Stopword-only bigrams inflate similarity for very short CJK strings
When jieba is absent, adjacent stopword characters produce bigrams that aren't in CHINESE_STOPWORDS. For example, 我们的 → {我们, 们的}, and both bigrams survive the len > 1 filter and the stopword check (only single-char stopwords like 的 are in the set). Two nearly-empty Chinese strings built entirely from function words would get artificially high token Jaccard similarity. This is mainly an edge case for very short texts (1–4 chars), but worth noting if the skill encounters user tags or very short titles from Xiaohongshu/Bilibili.
…rministic tests - Resolve jieba once at module import instead of a lazy initializer with mutable globals. The pipeline scores relevance inside a ThreadPoolExecutor, so the lazy path had a benign-but-real init race; binding at import removes it. Kept the broad `except` deliberately (any jieba load failure must fall back to bigrams, never crash) and documented why. - Phrase bonus now fires for multi-token Chinese queries. `_normalize_phrase` joins tokens with spaces, so a query like "国产大模型 测评" never matched the continuous source text verbatim and the 0.12–0.16 bonus was dead for Chinese. Retry the containment check with spaces removed, gated on has_cjk so English stays space-sensitive (no spurious "reacthooks" matches). - Make the bigram-path tests force cjk._jieba = None so they're deterministic regardless of whether jieba is installed in CI (previously they silently assumed jieba's absence). Added phrase-bonus and import-binding tests. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
|
Thanks @greptile-apps — went through all five. Three fixed in e0a4e90, two I'm intentionally leaving (with reasoning): 1. jieba lazy-init thread-safety — fixed. Replaced the lazy 2. Phrase bonus dead for Chinese — fixed (with a small correction). It wasn't dead for every Chinese query — a single contiguous query ( 3. Tests assumed jieba absent — fixed. Good catch. The bigram-path tests now force 4. 5. Stopword-only short-string bigram inflation — acknowledged as a known limitation, not fixing here. Correct that (Also noted #5's sub-point about |
|
All five responses are solid — the reasoning holds up across the board. The three fixes address the most impactful issues: thread safety was real (pipeline On the two you're holding:
Bigram stopword inflation — the query-driven matching argument is sound. Short function-word-only strings causing false positives would require them to appear in the query, which is unlikely in practice. The PR description's Limitations section now documents it clearly. Good call deferring until there's evidence from real Xiaohongshu/Bilibili data. LGTM. The core fix (CJK segmentation at all) and the follow-on correctness issues are all addressed or appropriately deferred with documented rationale. |
Summary
Relevance scoring and near-duplicate detection tokenize by splitting on whitespace (
str.split()). Chinese has no spaces between words, so a whole Chinese sentence collapses into a single token — token-overlap relevance and Jaccard dedup effectively stop working for Chinese-language content. This already affects the existing Xiaohongshu source (and any future CJK source).Changes
lib/cjk.pywithsegment(text), which splits text into CJK and non-CJK runs:\w+word behavior — English is unchanged.大模型→{大模, 模型}overlaps text国产大模型评测).pyproject dependencies = []).relevance.tokenize()anddedupe(_tokenize/token_jaccard); both also union a small Chinese stopword set into their existing English stopwords.Testing
uv run pytest tests/test_cjk.py tests/test_relevance.py tests/test_dedupe_v3.py -q— all greentests/test_cjk.py: segmentation, Chinese relevance match/no-match, English non-regression, Chinese near-duplicate detection.Quick before/after (default build, no jieba installed → bigram path):
国产大模型 测评vs国产大模型最新测评国产大模型 测评vs今天天气很好react hooksvsguide to react hooks(English)Limitations (stopword behavior in the bigram fallback)
The Chinese stopword set is fully effective only on the jieba path, which segments real words. On the zero-dependency bigram fallback it is partial, by construction:
品的,的真) survive as mild noise tokens.This does not affect correctness: matching is driven by the query's tokens, so stopword-bearing noise bigrams only appear on the text side and slightly dilute precision — they don't cause false matches. Verified:
产品 测评vs这个产品的真实测评→ 0.87 (correct match), unrelated text → 0.0. The core win is that Chinese is segmented at all (whole-sentence-as-one-token was the actual bug); the stopword union is a secondary refinement that mainly pays off when jieba is installed.Related
Complements #514 (Bilibili adapter) and the existing Xiaohongshu source — both produce Chinese items that currently can't be deduped or ranked correctly. This is an independent, source-agnostic fix to the tokenization layer.
Related Issues
None filed.