Skip to content

feat: support trtllm backend features in mha backend#445

Merged
borontion merged 33 commits into
mainfrom
borontion/improve-mha-backend
Jun 17, 2026
Merged

feat: support trtllm backend features in mha backend#445
borontion merged 33 commits into
mainfrom
borontion/improve-mha-backend

Conversation

@borontion

@borontion borontion commented Jun 14, 2026

Copy link
Copy Markdown
Contributor

Summary

The goal of this PR is making the defualt mha backend covers all features and reach the same performance as trtllm backend, for models including GPT-OSS and Qwen3.5, specifically:

  1. Support FP8 kv cache and FP8 attention: now the registered flashinfer (trtllm) kernel is registered to allow FP8 QKV type. For FP8 kv cache, re-use the existing triton kernel fused_fp8_set_kv_buffer, which fuse downcast and kv cache write.
  2. Add attention plan: Introduce attn_plan which is used to decide 2 ways for a prefill step: prewrite which means first write to KV cache and then run mha_extend using cached KV; postwrite which means directly use QKV for mha_prefill and then write to KV cache. Selecting which path to go depending registered kernel - currently we will only use postwrite if there is high-priority mha_prefill kernel + not using FP8 attention.
  3. Resolve Qwen3.5 nuemrics issue: previously when using mha backend to run Qwen3.5 there is a numerics, so we always use trtllm backend. It seems the root cause is that the current version of FA4 can not correctly non-contiguous KV and head dim 256. Forcing KV to be contiguous can be expansive. For now, this PR just remove head dim 256 from FA4's trait.
  4. Remove the split extend path (from feat: use split prefill for prefix cache in mha backend #178): previously there is an optional split extend path to use a prefill kernel + extend kernel for prefix cache. This can creates many extra conditions to handle in mha backend. This RP removes this optional split extend path - now it always do prewrite kv cache + extend kernel. Later PR will add an optimized Gluon kernel for extend kernel on AMD.
  5. Remove FA3 scheduler metadata (from perf(qwen3): cut H100 decode kernel time -8% with fused stride-aware kernels #81): the decode scheduler metadata for FA3 on Hopper is only used for eager mode, which is not useful. This PR removes it from mha backend and remove exposed apis in kernel.

Next steps:

  • Align the handling of spec decode step.
  • Add optimized mha_extend Gluon kernel.

Test Plan

# kernel test
pytest tokenspeed-kernel/test/ops/test_attention.py
# model test
python test/runtime/models/test_generation_models.py

borontion added 19 commits June 13, 2026 15:02
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
@borontion borontion marked this pull request as ready for review June 15, 2026 18:01
@borontion borontion requested a review from a team as a code owner June 15, 2026 18:01
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b566d7dfab

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread python/tokenspeed/runtime/layers/attention/backends/mha.py Outdated
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 2f4ebabac2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

solution="flashinfer",
capability=CapabilityRequirement(
min_arch_version=ArchVersion(9, 0),
min_arch_version=ArchVersion(10, 0),

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Restore Hopper support for flashinfer MHA

When users force --attention-backend flashinfer on Hopper/H100, MHAAttnBackend passes solution='flashinfer', but this registration is now gated at SM 10.0. I checked the attention flashinfer module (rg 'solution="flashinfer"' tokenspeed-kernel/python/tokenspeed_kernel/ops/attention) and the only MHA flashinfer registrations are these extend/decode TRTLLM ones, both with the same 10.0 capability, so select_kernel has no Hopper candidate and the request errors instead of using the previously registered FlashInfer MHA path.

Useful? React with 👍 / 👎.

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 54193ff48e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9938db492b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

@borontion borontion changed the title [WIP] feat: support trtllm backend features in mha backend feat: support trtllm backend features in mha backend Jun 16, 2026
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

@antiagainst antiagainst left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice unification!

@borontion borontion merged commit 51d41e9 into main Jun 17, 2026
33 of 37 checks passed
@borontion borontion deleted the borontion/improve-mha-backend branch June 17, 2026 04:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants