feat: support trtllm backend features in mha backend by borontion · Pull Request #445 · lightseekorg/tokenspeed

borontion · 2026-06-14T18:32:53Z

Summary

The goal of this PR is making the defualt mha backend covers all features and reach the same performance as trtllm backend, for models including GPT-OSS and Qwen3.5, specifically:

Support FP8 kv cache and FP8 attention: now the registered flashinfer (trtllm) kernel is registered to allow FP8 QKV type. For FP8 kv cache, re-use the existing triton kernel fused_fp8_set_kv_buffer, which fuse downcast and kv cache write.
Add attention plan: Introduce attn_plan which is used to decide 2 ways for a prefill step: prewrite which means first write to KV cache and then run mha_extend using cached KV; postwrite which means directly use QKV for mha_prefill and then write to KV cache. Selecting which path to go depending registered kernel - currently we will only use postwrite if there is high-priority mha_prefill kernel + not using FP8 attention.
Resolve Qwen3.5 nuemrics issue: previously when using mha backend to run Qwen3.5 there is a numerics, so we always use trtllm backend. It seems the root cause is that the current version of FA4 can not correctly non-contiguous KV and head dim 256. Forcing KV to be contiguous can be expansive. For now, this PR just remove head dim 256 from FA4's trait.
Remove the split extend path (from feat: use split prefill for prefix cache in mha backend #178): previously there is an optional split extend path to use a prefill kernel + extend kernel for prefix cache. This can creates many extra conditions to handle in mha backend. This RP removes this optional split extend path - now it always do prewrite kv cache + extend kernel. Later PR will add an optimized Gluon kernel for extend kernel on AMD.
Remove FA3 scheduler metadata (from perf(qwen3): cut H100 decode kernel time -8% with fused stride-aware kernels #81): the decode scheduler metadata for FA3 on Hopper is only used for eager mode, which is not useful. This PR removes it from mha backend and remove exposed apis in kernel.

Next steps:

Align the handling of spec decode step.
Add optimized mha_extend Gluon kernel.

Test Plan

# kernel test
pytest tokenspeed-kernel/test/ops/test_attention.py
# model test
python test/runtime/models/test_generation_models.py

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b566d7dfab

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2f4ebabac2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-15T18:22:27Z

        solution="flashinfer",
        capability=CapabilityRequirement(
-            min_arch_version=ArchVersion(9, 0),
+            min_arch_version=ArchVersion(10, 0),


Restore Hopper support for flashinfer MHA

When users force --attention-backend flashinfer on Hopper/H100, MHAAttnBackend passes solution='flashinfer', but this registration is now gated at SM 10.0. I checked the attention flashinfer module (rg 'solution="flashinfer"' tokenspeed-kernel/python/tokenspeed_kernel/ops/attention) and the only MHA flashinfer registrations are these extend/decode TRTLLM ones, both with the same 10.0 capability, so select_kernel has no Hopper candidate and the request errors instead of using the previously registered FlashInfer MHA path.

Useful? React with 👍 / 👎.

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 54193ff48e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9938db492b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

antiagainst

Nice unification!

borontion added 19 commits June 13, 2026 15:02

remote split extend

8b8836e

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

streamline import

d506bbc

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

cleanup

2fbd1e2

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

update flashinfer prefill api

2e9eafe

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

add comment splitter

feedffb

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

remove scheduler metadata

017d9e3

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

remove the registered cudnn impl

20aabd6

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

add extend fallback

5fdc7df

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

support fp8

76e98fc

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

update attention test

803734b

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

remove

351d492

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

gate flashinfer kernel and cuda on blackwell

baade05

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

move kv cache management kernel to tokenspeed-kernel

d7fc387

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

support fp8 kv cache

b5b5c04

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

drop moe backend

3d35211

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

add cu len for kv

bba4c8c

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

use auto moe and attn backend for qwen

90789b7

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

remove try-catch import

49c37b0

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

fix fa4 numerics

b566d7d

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

borontion marked this pull request as ready for review June 15, 2026 18:01

borontion requested a review from a team as a code owner June 15, 2026 18:01

borontion added 2 commits June 15, 2026 11:03

fix test

55931a7

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

drop dead code

6187d9d

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

chatgpt-codex-connector Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/layers/attention/backends/mha.py Outdated

Comment thread tokenspeed-kernel/python/tokenspeed_kernel/ops/attention/flash_attn/__init__.py

borontion added 2 commits June 15, 2026 11:13

fix

9667fd8

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

rename

2f4ebab

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

chatgpt-codex-connector Bot reviewed Jun 15, 2026

View reviewed changes

drop fa4 test

54193ff

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

chatgpt-codex-connector Bot reviewed Jun 15, 2026

View reviewed changes

Comment thread tokenspeed-kernel/python/tokenspeed_kernel/ops/attention/flash_attn/__init__.py

refactor

3d78eb0

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

borontion added 7 commits June 15, 2026 15:26

remove decode scheduler metadata

31fe731

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

update comments

63e520c

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

cleanup

bc88c4f

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

fix fp8 path

4ffd66a

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

remove redundant .contiguous()

d2bf4d1

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

remove tags

38fb5bb

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

introduce attn plan

9938db4

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

chatgpt-codex-connector Bot reviewed Jun 16, 2026

View reviewed changes

Comment thread tokenspeed-kernel/python/tokenspeed_kernel/ops/attention/flashinfer/__init__.py

borontion changed the title ~~[WIP] feat: support trtllm backend features in mha backend~~ feat: support trtllm backend features in mha backend Jun 16, 2026

inline kwargs

8500865

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

antiagainst approved these changes Jun 17, 2026

View reviewed changes

borontion merged commit 51d41e9 into main Jun 17, 2026
33 of 37 checks passed

borontion deleted the borontion/improve-mha-backend branch June 17, 2026 04:15

borontion mentioned this pull request Jun 17, 2026

[WIP] feat: use trtllm-style spec decode in mha backend #465

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support trtllm backend features in mha backend#445

feat: support trtllm backend features in mha backend#445
borontion merged 33 commits into
mainfrom
borontion/improve-mha-backend

borontion commented Jun 14, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 15, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

antiagainst left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

borontion commented Jun 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Test Plan

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 15, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

antiagainst left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

borontion commented Jun 14, 2026 •

edited

Loading