Skip to content

fix(relevance,dedupe): CJK-aware tokenization for Chinese sources#547

Merged
tmchow merged 2 commits into
mvanhorn:mainfrom
An-idd:fix/cjk-tokenization
Jun 16, 2026
Merged

fix(relevance,dedupe): CJK-aware tokenization for Chinese sources#547
tmchow merged 2 commits into
mvanhorn:mainfrom
An-idd:fix/cjk-tokenization

Conversation

@An-idd

@An-idd An-idd commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Summary

Relevance scoring and near-duplicate detection tokenize by splitting on whitespace (str.split()). Chinese has no spaces between words, so a whole Chinese sentence collapses into a single token — token-overlap relevance and Jaccard dedup effectively stop working for Chinese-language content. This already affects the existing Xiaohongshu source (and any future CJK source).

Changes

  • Add lib/cjk.py with segment(text), which splits text into CJK and non-CJK runs:
    • Non-CJK (ASCII/Latin) runs keep the original \w+ word behavior — English is unchanged.
    • CJK runs are segmented by jieba when installed, falling back to character bigrams when it is not. Bigrams are dictionary-free and still give robust overlap signal (e.g. query 大模型{大模, 模型} overlaps text 国产大模型评测).
  • jieba stays optional — present: used; absent: bigram fallback. It is never added to the hard dependency set, so the skill keeps its zero-dependency, install-anywhere property (pyproject dependencies = []).
  • Wire into relevance.tokenize() and dedupe (_tokenize / token_jaccard); both also union a small Chinese stopword set into their existing English stopwords.

Testing

  • uv run pytest tests/test_cjk.py tests/test_relevance.py tests/test_dedupe_v3.py -q — all green
  • Full suite passes (1627 passed, 4 skipped)
  • New tests/test_cjk.py: segmentation, Chinese relevance match/no-match, English non-regression, Chinese near-duplicate detection.

Quick before/after (default build, no jieba installed → bigram path):

query vs text before after
国产大模型 测评 vs 国产大模型最新测评 ~0.0 (one giant token) 0.93
国产大模型 测评 vs 今天天气很好 0.0 (correctly rejected)
react hooks vs guide to react hooks (English) 1.0 1.0 (unchanged)

Limitations (stopword behavior in the bigram fallback)

The Chinese stopword set is fully effective only on the jieba path, which segments real words. On the zero-dependency bigram fallback it is partial, by construction:

  • Single-character stopwords (的 / 了 / 是 / 在 …) are never emitted as tokens (every CJK token is a 2-char bigram), and bigrams that contain a stopword char (e.g. 品的, 的真) survive as mild noise tokens.
  • Two-character stopwords (这个 / 什么 / 可以 …) are filtered only when they happen to align to a bigram boundary.

This does not affect correctness: matching is driven by the query's tokens, so stopword-bearing noise bigrams only appear on the text side and slightly dilute precision — they don't cause false matches. Verified: 产品 测评 vs 这个产品的真实测评 → 0.87 (correct match), unrelated text → 0.0. The core win is that Chinese is segmented at all (whole-sentence-as-one-token was the actual bug); the stopword union is a secondary refinement that mainly pays off when jieba is installed.

Related

Complements #514 (Bilibili adapter) and the existing Xiaohongshu source — both produce Chinese items that currently can't be deduped or ranked correctly. This is an independent, source-agnostic fix to the tokenization layer.

Related Issues

None filed.

Relevance scoring and near-duplicate detection tokenize by splitting on
whitespace (`str.split()`). Chinese text has no spaces between words, so a
whole Chinese sentence collapses into a single token: token-overlap relevance
and Jaccard dedup effectively stop working for Chinese-language content. This
already affects the existing Xiaohongshu source and any future CJK source.

Add lib/cjk.py with `segment(text)`, which splits text into CJK and non-CJK
runs:
- Non-CJK (ASCII/Latin) runs keep the original `\w+` word behavior — English
  is unchanged.
- CJK runs are segmented by jieba when it is installed, falling back to
  character bigrams when it is not. Bigrams are dictionary-free and still give
  robust overlap signal (e.g. query "大模型" -> {大模, 模型} overlaps text
  "国产大模型评测").

jieba stays OPTIONAL — present: used; absent: bigram fallback. It is never
added to the hard dependency set, so the skill keeps its zero-dependency,
install-anywhere property (pyproject `dependencies = []`).

Wired into relevance.tokenize() and dedupe (_tokenize / token_jaccard); both
also union a small Chinese stopword set into their existing English stopwords.

Tests: tests/test_cjk.py covers segmentation, Chinese relevance match/no-match,
English non-regression, and Chinese near-duplicate detection.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@greptile-apps

greptile-apps Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR fixes a long-standing bug where Chinese text from Xiaohongshu/Bilibili was tokenized as a single whitespace-split token, rendering token-overlap relevance scoring and Jaccard deduplication effectively non-functional for CJK sources.

  • lib/cjk.py (new): Adds CJK-aware segment() that routes CJK runs through jieba (optional) or a bigram fallback; jieba is eagerly imported at module load to avoid per-call threading races in the ThreadPoolExecutor pipeline. English paths are unchanged.
  • relevance.py / dedupe.py: Minimal wiring — tokenize(), token_jaccard(), and _tokenize() now call cjk.segment() instead of .split(), and both STOPWORDS sets are unioned with CHINESE_STOPWORDS.
  • CJK phrase bonus: A new has_cjk-gated containment retry in token_overlap_relevance strips inter-word spaces before the substring check so the 0.12–0.16 phrase bonus is no longer permanently dead for Chinese queries.

Confidence Score: 5/5

Safe to merge — the English tokenization path is functionally unchanged, and the new CJK paths are well-isolated behind has_cjk checks with a zero-dependency bigram fallback.

The implementation is careful: thread safety is handled by eagerly importing jieba at module load time (commented and tested), the bigram fallback preserves the zero-dependency contract, and the has_cjk gate prevents any behavioral change on English text. The one note is a test assertion that is trivially true and does not exercise the guard it describes, but this is a test quality gap rather than a defect in the production code.

No files require special attention. The tests/test_cjk.py has one weak assertion worth strengthening, but the production code in all four files is correct.

Important Files Changed

Filename Overview
skills/last30days/scripts/lib/cjk.py New CJK tokenization module with eager jieba import (thread-safe), bigram fallback, and Chinese stopwords — well-structured with thorough inline comments explaining design decisions.
skills/last30days/scripts/lib/relevance.py Wires in cjk.segment() for tokenization and adds a CJK-aware phrase bonus retry with a correct has_cjk gate; English path unchanged.
skills/last30days/scripts/lib/dedupe.py Minimal, correct changes: replaces .split() with cjk.segment() in token_jaccard and _tokenize, and unions Chinese stopwords into the existing English STOPWORDS set.
tests/test_cjk.py Solid coverage of bigram path, Chinese relevance, and dedup; one test assertion is trivially weak and does not actually exercise the has_cjk gate it is described as guarding.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["segment(text)"] --> B{has_cjk?}
    B -- No --> C["_LATIN_RE.findall(text)\n(unchanged English path)"]
    B -- Yes --> D["Iterate CJK runs via _CJK_RUN_RE"]
    D --> E["Latin inter-run slices\n→ _LATIN_RE.findall()"]
    D --> F["CJK run → _cjk_tokens(run)"]
    F --> G{_jieba bound?}
    G -- Yes --> H["jieba.cut(run)\nfilter non-CJK tokens"]
    G -- No --> I["Character bigrams\nlen≤1 → single char"]
    E & H & I --> J["Flat token list"]
    J --> K["relevance.tokenize()\nstopwords + len>1 filter + synonyms"]
    J --> L["dedupe.token_jaccard()\nstopwords + len>1 filter"]
    J --> M["dedupe._tokenize()\nstopwords + len>1 filter"]
Loading

Fix All in Codex Fix All in Claude Code Fix All in Cursor Fix All in Conductor

Reviews (2): Last reviewed commit: "fix(cjk): address review — eager jieba i..." | Re-trigger Greptile

Comment thread skills/last30days/scripts/lib/cjk.py Outdated
Comment thread skills/last30days/scripts/lib/cjk.py Outdated
Comment on lines +64 to +65
except Exception:
_jieba = None

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 The except Exception clause silently swallows all errors — including ImportError for a missing package (expected) but also real errors like a corrupted jieba install or missing data files. Using ImportError is the conventional and safer narrowing here.

Suggested change
except Exception:
_jieba = None
except ImportError:
_jieba = None

Fix in Codex Fix in Claude Code Fix in Cursor Fix in Conductor

Comment on lines +74 to +81
def _cjk_tokens(run: str) -> List[str]:
jieba = _get_jieba()
if jieba is not None:
return [w for w in jieba.cut(run) if w.strip() and _CJK_RE.search(w)]
# Dictionary-free fallback: character bigrams (single char if run length 1).
if len(run) <= 1:
return [run] if run else []
return [run[i:i + 2] for i in range(len(run) - 1)]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Stopword-only bigrams inflate similarity for very short CJK strings

When jieba is absent, adjacent stopword characters produce bigrams that aren't in CHINESE_STOPWORDS. For example, 我们的{我们, 们的}, and both bigrams survive the len > 1 filter and the stopword check (only single-char stopwords like are in the set). Two nearly-empty Chinese strings built entirely from function words would get artificially high token Jaccard similarity. This is mainly an edge case for very short texts (1–4 chars), but worth noting if the skill encounters user tags or very short titles from Xiaohongshu/Bilibili.

Fix in Codex Fix in Claude Code Fix in Cursor Fix in Conductor

…rministic tests

- Resolve jieba once at module import instead of a lazy initializer with mutable
  globals. The pipeline scores relevance inside a ThreadPoolExecutor, so the lazy
  path had a benign-but-real init race; binding at import removes it. Kept the
  broad `except` deliberately (any jieba load failure must fall back to bigrams,
  never crash) and documented why.
- Phrase bonus now fires for multi-token Chinese queries. `_normalize_phrase`
  joins tokens with spaces, so a query like "国产大模型 测评" never matched the
  continuous source text verbatim and the 0.12–0.16 bonus was dead for Chinese.
  Retry the containment check with spaces removed, gated on has_cjk so English
  stays space-sensitive (no spurious "reacthooks" matches).
- Make the bigram-path tests force cjk._jieba = None so they're deterministic
  regardless of whether jieba is installed in CI (previously they silently
  assumed jieba's absence). Added phrase-bonus and import-binding tests.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@An-idd

An-idd commented Jun 12, 2026

Copy link
Copy Markdown
Contributor Author

Thanks @greptile-apps — went through all five. Three fixed in e0a4e90, two I'm intentionally leaving (with reasoning):

1. jieba lazy-init thread-safety — fixed. Replaced the lazy _get_jieba() + mutable globals with a one-time eager import at module load. Relevance scoring does run inside the pipeline's ThreadPoolExecutor, so the race was real; binding _jieba at import settles it before any worker thread runs. jieba guards its own dictionary build with an internal lock, so concurrent cut() is fine after that.

2. Phrase bonus dead for Chinese — fixed (with a small correction). It wasn't dead for every Chinese query — a single contiguous query (国产大模型) still substring-matches. It was dead specifically for multi-token queries (国产大模型 测评), where the space never appears in continuous source text. Fixed by retrying the containment with spaces stripped, gated on has_cjk so English stays space-sensitive (no spurious reacthooks match). Verified: contiguous CJK match now scores above the scattered case.

3. Tests assumed jieba absent — fixed. Good catch. The bigram-path tests now force cjk._jieba = None via a base class, so they're deterministic whether or not jieba is installed in CI. Added phrase-bonus and import-binding tests.

4. except Exception too broad — keeping it, intentionally. jieba is an optional enhancement; the contract is that any failure to load it (missing package, corrupted install, missing data files, a setLogLevel signature change across versions) must degrade to the bigram path rather than crash the skill. Narrowing to ImportError would let a corrupted-but-importable jieba throw AttributeError and take the whole run down. I documented this reasoning in a comment.

5. Stopword-only short-string bigram inflation — acknowledged as a known limitation, not fixing here. Correct that 我们的{我们, 们的} and single-char stopwords don't filter in the bigram path. It only matters for very short (1–4 char) function-word-only strings, and matching is query-driven so it doesn't cause false positives in relevance. I've written this up in the PR's "Limitations" section. A real fix needs stopword-aware bigram filtering, which is better as a follow-up if Xiaohongshu/Bilibili tags turn out to surface it in practice.

(Also noted #5's sub-point about normalize_text splitting CJK at full-width punctuation — agreed that's correct behavior; the double-lower() is harmless and I left it for minimal diff.)

@greptile-apps

greptile-apps Bot commented Jun 12, 2026

Copy link
Copy Markdown
Contributor

All five responses are solid — the reasoning holds up across the board.

The three fixes address the most impactful issues: thread safety was real (pipeline ThreadPoolExecutor), the phrase bonus dead-zone for multi-token queries was a meaningful correctness gap, and deterministic tests are essential for a path that's supposed to test the no-jieba fallback.

On the two you're holding:

except Exception — the argument is correct. ImportError alone doesn't cover a corrupted-but-importable jieba. The documented comment makes the intent clear for future readers. Fine to keep.

Bigram stopword inflation — the query-driven matching argument is sound. Short function-word-only strings causing false positives would require them to appear in the query, which is unlikely in practice. The PR description's Limitations section now documents it clearly. Good call deferring until there's evidence from real Xiaohongshu/Bilibili data.

LGTM. The core fix (CJK segmentation at all) and the follow-on correctness issues are all addressed or appropriately deferred with documented rationale.

@tmchow tmchow merged commit 50df263 into mvanhorn:main Jun 16, 2026
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants