feat: support trtllm backend features in mha backend#445
Conversation
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b566d7dfab
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2f4ebabac2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| solution="flashinfer", | ||
| capability=CapabilityRequirement( | ||
| min_arch_version=ArchVersion(9, 0), | ||
| min_arch_version=ArchVersion(10, 0), |
There was a problem hiding this comment.
Restore Hopper support for flashinfer MHA
When users force --attention-backend flashinfer on Hopper/H100, MHAAttnBackend passes solution='flashinfer', but this registration is now gated at SM 10.0. I checked the attention flashinfer module (rg 'solution="flashinfer"' tokenspeed-kernel/python/tokenspeed_kernel/ops/attention) and the only MHA flashinfer registrations are these extend/decode TRTLLM ones, both with the same 10.0 capability, so select_kernel has no Hopper candidate and the request errors instead of using the previously registered FlashInfer MHA path.
Useful? React with 👍 / 👎.
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 54193ff48e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9938db492b
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Summary
The goal of this PR is making the defualt mha backend covers all features and reach the same performance as trtllm backend, for models including GPT-OSS and Qwen3.5, specifically:
fused_fp8_set_kv_buffer, which fuse downcast and kv cache write.attn_planwhich is used to decide 2 ways for a prefill step:prewritewhich means first write to KV cache and then runmha_extendusing cached KV;postwritewhich means directly use QKV formha_prefilland then write to KV cache. Selecting which path to go depending registered kernel - currently we will only usepostwriteif there is high-prioritymha_prefillkernel + not using FP8 attention.Next steps:
Test Plan