Skip to content

fix(prepare): use true raw bytes for BPB token byte count calculation#607

Open
aniruddhaadak80 wants to merge 1 commit into
karpathy:masterfrom
aniruddhaadak80:fix/issue-384-bpb-utf8-inflation
Open

fix(prepare): use true raw bytes for BPB token byte count calculation#607
aniruddhaadak80 wants to merge 1 commit into
karpathy:masterfrom
aniruddhaadak80:fix/issue-384-bpb-utf8-inflation

Conversation

@aniruddhaadak80

Copy link
Copy Markdown

Computes the BPE token byte length using decode_single_token_bytes() instead of len(token_str.encode('utf-8')) to prevent metric inflation from UTF-8 replacement characters on non-UTF-8 tokens (Fixes #384).

Copilot AI review requested due to automatic review settings June 13, 2026 04:30

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Updates tokenizer preprocessing to compute token byte-lengths using the tokenizer’s decoded raw bytes rather than re-encoding the token string representation.

Changes:

  • Compute per-token byte length via enc.decode_single_token_bytes(token_id) instead of len(token_str.encode("utf-8")).

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread prepare.py
token_bytes_list.append(0)
else:
token_bytes_list.append(len(token_str.encode("utf-8")))
token_bytes_list.append(len(enc.decode_single_token_bytes(token_id)))
Comment thread prepare.py
token_bytes_list.append(0)
else:
token_bytes_list.append(len(token_str.encode("utf-8")))
token_bytes_list.append(len(enc.decode_single_token_bytes(token_id)))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: BPB metric inflated by UTF-8 replacement characters in token byte count

2 participants