Skip to content

refactor(spec-decode): simplify Qwen3.5 NextN attention path for #217 (2/3)#429

Open
rjzhb wants to merge 28 commits into
lightseekorg:mainfrom
rjzhb:refactor/qwen-attention-hooks
Open

refactor(spec-decode): simplify Qwen3.5 NextN attention path for #217 (2/3)#429
rjzhb wants to merge 28 commits into
lightseekorg:mainfrom
rjzhb:refactor/qwen-attention-hooks

Conversation

@rjzhb

@rjzhb rjzhb commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Summary

This is the second PR in the series refactoring the spec-decode attention path from #217. It applies the same base-hook pattern introduced in #390 (1/3) for Llama Eagle3, now to Qwen3.5 NextN.

Qwen3_5DraftForCausalLM / Qwen3_5DraftAttentionDecoderLayer collapse the bespoke draft forward path into a two-method subclass of the base Qwen3_5AttentionDecoderLayer:

  • _attn overrides the draft first-step dispatch path: correction + q-slice + DECODE.
  • Inactive steps delegate to super()._attn.

The correction logic, spec_num_tokens - accept_lengths trimming of draft_seq_lens_buf, now lives in a single _apply_correction method next to its only consumer, and is plumbed through ForwardContext, mirroring #390 (1/3). _maybe_narrow_residual handles the NextN residual narrowing.

This is restricted to single-layer drafts for now, asserted in __init__. _apply_correction mutates per-layer state, so multi-layer NextN support needs the trim to be hoisted before this restriction can be relaxed.

rjzhb and others added 17 commits June 9, 2026 04:20
Signed-off-by: rjzhb <rjzhb222@163.com>
Signed-off-by: rjzhb <rjzhb222@163.com>
Signed-off-by: rjzhb <rjzhb222@163.com>
…tion-hooks

# Conflicts:
#	python/tokenspeed/runtime/execution/drafter/eagle.py
#	python/tokenspeed/runtime/models/llama_eagle3.py
Signed-off-by: rjzhb <rjzhb222@163.com>
Signed-off-by: rjzhb <rjzhb222@163.com>
@rjzhb rjzhb marked this pull request as ready for review June 12, 2026 23:26
@rjzhb rjzhb requested a review from a team as a code owner June 12, 2026 23:27
Comment thread python/tokenspeed/runtime/models/qwen3_5_nextn.py

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4d52f2f7bc

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment thread python/tokenspeed/runtime/models/qwen3_5_nextn.py Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants