fix(relevance,dedupe): CJK-aware tokenization for Chinese sources by An-idd · Pull Request #547 · mvanhorn/last30days-skill

An-idd · 2026-06-11T14:55:48Z

Summary

Relevance scoring and near-duplicate detection tokenize by splitting on whitespace (str.split()). Chinese has no spaces between words, so a whole Chinese sentence collapses into a single token — token-overlap relevance and Jaccard dedup effectively stop working for Chinese-language content. This already affects the existing Xiaohongshu source (and any future CJK source).

Changes

Add lib/cjk.py with segment(text), which splits text into CJK and non-CJK runs:
- Non-CJK (ASCII/Latin) runs keep the original \w+ word behavior — English is unchanged.
- CJK runs are segmented by jieba when installed, falling back to character bigrams when it is not. Bigrams are dictionary-free and still give robust overlap signal (e.g. query 大模型 → {大模, 模型} overlaps text 国产大模型评测).
jieba stays optional — present: used; absent: bigram fallback. It is never added to the hard dependency set, so the skill keeps its zero-dependency, install-anywhere property (pyproject dependencies = []).
Wire into relevance.tokenize() and dedupe (_tokenize / token_jaccard); both also union a small Chinese stopword set into their existing English stopwords.

Testing

uv run pytest tests/test_cjk.py tests/test_relevance.py tests/test_dedupe_v3.py -q — all green
Full suite passes (1627 passed, 4 skipped)
New tests/test_cjk.py: segmentation, Chinese relevance match/no-match, English non-regression, Chinese near-duplicate detection.

Quick before/after (default build, no jieba installed → bigram path):

query vs text	before	after
`国产大模型测评` vs `国产大模型最新测评`	~0.0 (one giant token)	0.93
`国产大模型测评` vs `今天天气很好`	—	0.0 (correctly rejected)
`react hooks` vs `guide to react hooks` (English)	1.0	1.0 (unchanged)

Limitations (stopword behavior in the bigram fallback)

The Chinese stopword set is fully effective only on the jieba path, which segments real words. On the zero-dependency bigram fallback it is partial, by construction:

Single-character stopwords (的 / 了 / 是 / 在 …) are never emitted as tokens (every CJK token is a 2-char bigram), and bigrams that contain a stopword char (e.g. 品的, 的真) survive as mild noise tokens.
Two-character stopwords (这个 / 什么 / 可以 …) are filtered only when they happen to align to a bigram boundary.

This does not affect correctness: matching is driven by the query's tokens, so stopword-bearing noise bigrams only appear on the text side and slightly dilute precision — they don't cause false matches. Verified: 产品测评 vs 这个产品的真实测评 → 0.87 (correct match), unrelated text → 0.0. The core win is that Chinese is segmented at all (whole-sentence-as-one-token was the actual bug); the stopword union is a secondary refinement that mainly pays off when jieba is installed.

Related Issues

None filed.

Relevance scoring and near-duplicate detection tokenize by splitting on whitespace (`str.split()`). Chinese text has no spaces between words, so a whole Chinese sentence collapses into a single token: token-overlap relevance and Jaccard dedup effectively stop working for Chinese-language content. This already affects the existing Xiaohongshu source and any future CJK source. Add lib/cjk.py with `segment(text)`, which splits text into CJK and non-CJK runs: - Non-CJK (ASCII/Latin) runs keep the original `\w+` word behavior — English is unchanged. - CJK runs are segmented by jieba when it is installed, falling back to character bigrams when it is not. Bigrams are dictionary-free and still give robust overlap signal (e.g. query "大模型" -> {大模, 模型} overlaps text "国产大模型评测"). jieba stays OPTIONAL — present: used; absent: bigram fallback. It is never added to the hard dependency set, so the skill keeps its zero-dependency, install-anywhere property (pyproject `dependencies = []`). Wired into relevance.tokenize() and dedupe (_tokenize / token_jaccard); both also union a small Chinese stopword set into their existing English stopwords. Tests: tests/test_cjk.py covers segmentation, Chinese relevance match/no-match, English non-regression, and Chinese near-duplicate detection. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

greptile-apps · 2026-06-11T15:06:42Z

Greptile Summary

This PR fixes a long-standing bug where Chinese text from Xiaohongshu/Bilibili was tokenized as a single whitespace-split token, rendering token-overlap relevance scoring and Jaccard deduplication effectively non-functional for CJK sources.

lib/cjk.py (new): Adds CJK-aware segment() that routes CJK runs through jieba (optional) or a bigram fallback; jieba is eagerly imported at module load to avoid per-call threading races in the ThreadPoolExecutor pipeline. English paths are unchanged.
relevance.py / dedupe.py: Minimal wiring — tokenize(), token_jaccard(), and _tokenize() now call cjk.segment() instead of .split(), and both STOPWORDS sets are unioned with CHINESE_STOPWORDS.
CJK phrase bonus: A new has_cjk-gated containment retry in token_overlap_relevance strips inter-word spaces before the substring check so the 0.12–0.16 phrase bonus is no longer permanently dead for Chinese queries.

Confidence Score: 5/5

Safe to merge — the English tokenization path is functionally unchanged, and the new CJK paths are well-isolated behind has_cjk checks with a zero-dependency bigram fallback.

The implementation is careful: thread safety is handled by eagerly importing jieba at module load time (commented and tested), the bigram fallback preserves the zero-dependency contract, and the has_cjk gate prevents any behavioral change on English text. The one note is a test assertion that is trivially true and does not exercise the guard it describes, but this is a test quality gap rather than a defect in the production code.

No files require special attention. The tests/test_cjk.py has one weak assertion worth strengthening, but the production code in all four files is correct.

Important Files Changed

Filename	Overview
skills/last30days/scripts/lib/cjk.py	New CJK tokenization module with eager jieba import (thread-safe), bigram fallback, and Chinese stopwords — well-structured with thorough inline comments explaining design decisions.
skills/last30days/scripts/lib/relevance.py	Wires in cjk.segment() for tokenization and adds a CJK-aware phrase bonus retry with a correct has_cjk gate; English path unchanged.
skills/last30days/scripts/lib/dedupe.py	Minimal, correct changes: replaces .split() with cjk.segment() in token_jaccard and _tokenize, and unions Chinese stopwords into the existing English STOPWORDS set.
tests/test_cjk.py	Solid coverage of bigram path, Chinese relevance, and dedup; one test assertion is trivially weak and does not actually exercise the has_cjk gate it is described as guarding.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A["segment(text)"] --> B{has_cjk?}
    B -- No --> C["_LATIN_RE.findall(text)\n(unchanged English path)"]
    B -- Yes --> D["Iterate CJK runs via _CJK_RUN_RE"]
    D --> E["Latin inter-run slices\n→ _LATIN_RE.findall()"]
    D --> F["CJK run → _cjk_tokens(run)"]
    F --> G{_jieba bound?}
    G -- Yes --> H["jieba.cut(run)\nfilter non-CJK tokens"]
    G -- No --> I["Character bigrams\nlen≤1 → single char"]
    E & H & I --> J["Flat token list"]
    J --> K["relevance.tokenize()\nstopwords + len>1 filter + synonyms"]
    J --> L["dedupe.token_jaccard()\nstopwords + len>1 filter"]
    J --> M["dedupe._tokenize()\nstopwords + len>1 filter"]

_{Reviews (2): Last reviewed commit: "fix(cjk): address review — eager jieba i..." | Re-trigger Greptile}

greptile-apps · 2026-06-11T15:06:49Z

+        except Exception:
+            _jieba = None


The except Exception clause silently swallows all errors — including ImportError for a missing package (expected) but also real errors like a corrupted jieba install or missing data files. Using ImportError is the conventional and safer narrowing here.

Suggested change

except Exception:

_jieba = None

except ImportError:

_jieba = None

greptile-apps · 2026-06-11T15:06:50Z

+def _cjk_tokens(run: str) -> List[str]:
+    jieba = _get_jieba()
+    if jieba is not None:
+        return [w for w in jieba.cut(run) if w.strip() and _CJK_RE.search(w)]
+    # Dictionary-free fallback: character bigrams (single char if run length 1).
+    if len(run) <= 1:
+        return [run] if run else []
+    return [run[i:i + 2] for i in range(len(run) - 1)]


Stopword-only bigrams inflate similarity for very short CJK strings

When jieba is absent, adjacent stopword characters produce bigrams that aren't in CHINESE_STOPWORDS. For example, 我们的 → {我们, 们的}, and both bigrams survive the len > 1 filter and the stopword check (only single-char stopwords like 的 are in the set). Two nearly-empty Chinese strings built entirely from function words would get artificially high token Jaccard similarity. This is mainly an edge case for very short texts (1–4 chars), but worth noting if the skill encounters user tags or very short titles from Xiaohongshu/Bilibili.

…rministic tests - Resolve jieba once at module import instead of a lazy initializer with mutable globals. The pipeline scores relevance inside a ThreadPoolExecutor, so the lazy path had a benign-but-real init race; binding at import removes it. Kept the broad `except` deliberately (any jieba load failure must fall back to bigrams, never crash) and documented why. - Phrase bonus now fires for multi-token Chinese queries. `_normalize_phrase` joins tokens with spaces, so a query like "国产大模型测评" never matched the continuous source text verbatim and the 0.12–0.16 bonus was dead for Chinese. Retry the containment check with spaces removed, gated on has_cjk so English stays space-sensitive (no spurious "reacthooks" matches). - Make the bigram-path tests force cjk._jieba = None so they're deterministic regardless of whether jieba is installed in CI (previously they silently assumed jieba's absence). Added phrase-bonus and import-binding tests. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

An-idd · 2026-06-12T00:13:37Z

Thanks @greptile-apps — went through all five. Three fixed in e0a4e90, two I'm intentionally leaving (with reasoning):

1. jieba lazy-init thread-safety — fixed. Replaced the lazy _get_jieba() + mutable globals with a one-time eager import at module load. Relevance scoring does run inside the pipeline's ThreadPoolExecutor, so the race was real; binding _jieba at import settles it before any worker thread runs. jieba guards its own dictionary build with an internal lock, so concurrent cut() is fine after that.

2. Phrase bonus dead for Chinese — fixed (with a small correction). It wasn't dead for every Chinese query — a single contiguous query (国产大模型) still substring-matches. It was dead specifically for multi-token queries (国产大模型测评), where the space never appears in continuous source text. Fixed by retrying the containment with spaces stripped, gated on has_cjk so English stays space-sensitive (no spurious reacthooks match). Verified: contiguous CJK match now scores above the scattered case.

3. Tests assumed jieba absent — fixed. Good catch. The bigram-path tests now force cjk._jieba = None via a base class, so they're deterministic whether or not jieba is installed in CI. Added phrase-bonus and import-binding tests.

4. except Exception too broad — keeping it, intentionally. jieba is an optional enhancement; the contract is that any failure to load it (missing package, corrupted install, missing data files, a setLogLevel signature change across versions) must degrade to the bigram path rather than crash the skill. Narrowing to ImportError would let a corrupted-but-importable jieba throw AttributeError and take the whole run down. I documented this reasoning in a comment.

5. Stopword-only short-string bigram inflation — acknowledged as a known limitation, not fixing here. Correct that 我们的 → {我们, 们的} and single-char stopwords don't filter in the bigram path. It only matters for very short (1–4 char) function-word-only strings, and matching is query-driven so it doesn't cause false positives in relevance. I've written this up in the PR's "Limitations" section. A real fix needs stopword-aware bigram filtering, which is better as a follow-up if Xiaohongshu/Bilibili tags turn out to surface it in practice.

(Also noted #5's sub-point about normalize_text splitting CJK at full-width punctuation — agreed that's correct behavior; the double-lower() is harmless and I left it for minimal diff.)

greptile-apps · 2026-06-12T00:14:09Z

All five responses are solid — the reasoning holds up across the board.

The three fixes address the most impactful issues: thread safety was real (pipeline ThreadPoolExecutor), the phrase bonus dead-zone for multi-token queries was a meaningful correctness gap, and deterministic tests are essential for a path that's supposed to test the no-jieba fallback.

On the two you're holding:

except Exception — the argument is correct. ImportError alone doesn't cover a corrupted-but-importable jieba. The documented comment makes the intent clear for future readers. Fine to keep.

Bigram stopword inflation — the query-driven matching argument is sound. Short function-word-only strings causing false positives would require them to appear in the query, which is unlikely in practice. The PR description's Limitations section now documents it clearly. Good call deferring until there's evidence from real Xiaohongshu/Bilibili data.

LGTM. The core fix (CJK segmentation at all) and the follow-on correctness issues are all addressed or appropriately deferred with documented rationale.

greptile-apps Bot reviewed Jun 11, 2026

View reviewed changes

tmchow merged commit 50df263 into mvanhorn:main Jun 16, 2026
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(relevance,dedupe): CJK-aware tokenization for Chinese sources#547

fix(relevance,dedupe): CJK-aware tokenization for Chinese sources#547
tmchow merged 2 commits into
mvanhorn:mainfrom
An-idd:fix/cjk-tokenization

An-idd commented Jun 11, 2026 •

edited

Loading

Uh oh!

greptile-apps Bot commented Jun 11, 2026 •

edited

Loading

Flowchart

Uh oh!

Uh oh!

greptile-apps Bot Jun 11, 2026

Uh oh!

greptile-apps Bot Jun 11, 2026

Uh oh!

An-idd commented Jun 12, 2026

Uh oh!

greptile-apps Bot commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

An-idd commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Changes

Testing

Limitations (stopword behavior in the bigram fallback)

Related

Related Issues

Uh oh!

greptile-apps Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

greptile-apps Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

An-idd commented Jun 12, 2026

Uh oh!

greptile-apps Bot commented Jun 12, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

An-idd commented Jun 11, 2026 •

edited

Loading

greptile-apps Bot commented Jun 11, 2026 •

edited

Loading