Skip to content

feat(kernel): introduce unified moe kernel api#374

Merged
borontion merged 86 commits into
mainfrom
borontion/refactor-moe-api
Jun 12, 2026
Merged

feat(kernel): introduce unified moe kernel api#374
borontion merged 86 commits into
mainfrom
borontion/refactor-moe-api

Conversation

@borontion

@borontion borontion commented Jun 7, 2026

Copy link
Copy Markdown
Contributor

Summary

This PR moves moe backend concepts into tokenspeed kernel. MoE will be a platform-agnostic layer that invoke moe kernel api.

MoE entrypoint

Introduce a single moe_apply api. Under the hood it can launch a single or more multiple kernels depending on the implementation. This api is too coarse from kernel-level. Moving forward, we will replace it as a kernel execution plan.

See tokenspeed-kernel/python/tokenspeed_kernel/ops/moe/__init__.py.

Supported type and backend

Currently I keep the following weight dtype + implementation:

  • Unquant (fp16): flashinfer cutlass
  • fp8 (with scale): flashinfer cutlass
  • mxfp4: triton / gluon / flashinfer trtllm
  • nvfp4: flashinfer cutlass / flashinfer trtllm / flashinfer cutedsl

Other data types are not actively used in our current target models. Also drop the past triton backend for fp8/unquant, which has convoluted dependencies. A portable impl for fp8/unquant which can run amd gpu will be added later. Also renamed the backend name for 'triton_kernel' to 'triton'.

Kernel early selection

MoE is different from attention where the weights need to be preshuffled after model loaded. The preshuffle and actual moe impl are tightly coupled. So we need to determine the kernel implementation earlier, before engine accepts incoming requests.

This is done by adding a specific moe_plan utility function. It will select the corresbonding moe kernel based on various features and the current platform. The selection is encoded into a dict which includes the target kernel name, which will be passed to moe_apply.

AMD Gluon kernel

Decouple amd gluon kernel from other kernel utilit by copying used code into tokenspeed-kernel-amd module. This includes 3 parts

  1. fp8 quant
  2. fallback triton kernel routing
  3. fallback triton kernel group gemm

See: tokenspeed-kernel-amd/python/tokenspeed_kernel_amd/ops/moe/utils.py for ported triton kernel code.

Currently I lower the priority for the gluon kernel due to known performance issue.

borontion added 30 commits June 4, 2026 13:58
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
borontion added 12 commits June 10, 2026 22:23
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
@borontion borontion changed the title [WIP, DO NOT MERGE] feat(kernel): introduce moe kernel api feat(kernel): introduce unified moe kernel api Jun 11, 2026

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 682fdb6fb2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread python/tokenspeed/runtime/layers/moe/expert.py Outdated
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
panditsa added a commit to panditsa/tokenspeed that referenced this pull request Jun 11, 2026
Squashed warp-decode work (coop-LDS stage1 + per-M split-K stage2, interleave/
K-tail/scale fixes) for rebase onto the PR lightseekorg#374 MoE-API refactor. Full
per-commit history preserved in backup/gptoss-warp-decode-moe-* .

Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>

@antiagainst antiagainst left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome! Looks large change but mostly moving code around and fleshing out kernel interface and registration. Just two nits from me.

Comment thread tokenspeed-kernel/python/tokenspeed_kernel/ops/moe/gluon/mxfp4.py
Comment thread tokenspeed-kernel/python/tokenspeed_kernel/selection.py
Signed-off-by: Pengzhan Zhao <borontion@gmail.com>
@borontion borontion merged commit 38b3a35 into main Jun 12, 2026
4 of 36 checks passed
@borontion borontion deleted the borontion/refactor-moe-api branch June 12, 2026 00:35
panditsa added a commit to panditsa/tokenspeed that referenced this pull request Jun 12, 2026
Squashed warp-decode work (coop-LDS stage1 + per-M split-K stage2, interleave/
K-tail/scale fixes) for rebase onto the PR lightseekorg#374 MoE-API refactor. Full
per-commit history preserved in backup/gptoss-warp-decode-moe-* .

Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants