feat(kernel): introduce unified moe kernel api by borontion · Pull Request #374 · lightseekorg/tokenspeed

borontion · 2026-06-07T09:06:16Z

Summary

This PR moves moe backend concepts into tokenspeed kernel. MoE will be a platform-agnostic layer that invoke moe kernel api.

MoE entrypoint

Introduce a single moe_apply api. Under the hood it can launch a single or more multiple kernels depending on the implementation. This api is too coarse from kernel-level. Moving forward, we will replace it as a kernel execution plan.

See tokenspeed-kernel/python/tokenspeed_kernel/ops/moe/__init__.py.

Supported type and backend

Currently I keep the following weight dtype + implementation:

Unquant (fp16): flashinfer cutlass
fp8 (with scale): flashinfer cutlass
mxfp4: triton / gluon / flashinfer trtllm
nvfp4: flashinfer cutlass / flashinfer trtllm / flashinfer cutedsl

Other data types are not actively used in our current target models. Also drop the past triton backend for fp8/unquant, which has convoluted dependencies. A portable impl for fp8/unquant which can run amd gpu will be added later. Also renamed the backend name for 'triton_kernel' to 'triton'.

Kernel early selection

MoE is different from attention where the weights need to be preshuffled after model loaded. The preshuffle and actual moe impl are tightly coupled. So we need to determine the kernel implementation earlier, before engine accepts incoming requests.

This is done by adding a specific moe_plan utility function. It will select the corresbonding moe kernel based on various features and the current platform. The selection is encoded into a dict which includes the target kernel name, which will be passed to moe_apply.

AMD Gluon kernel

Decouple amd gluon kernel from other kernel utilit by copying used code into tokenspeed-kernel-amd module. This includes 3 parts

fp8 quant
fallback triton kernel routing
fallback triton kernel group gemm

See: tokenspeed-kernel-amd/python/tokenspeed_kernel_amd/ops/moe/utils.py for ported triton kernel code.

Currently I lower the priority for the gluon kernel due to known performance issue.

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 682fdb6fb2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

…e-api

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

Squashed warp-decode work (coop-LDS stage1 + per-M split-K stage2, interleave/ K-tail/scale fixes) for rebase onto the PR lightseekorg#374 MoE-API refactor. Full per-commit history preserved in backup/gptoss-warp-decode-moe-* . Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>

antiagainst

Awesome! Looks large change but mostly moving code around and fleshing out kernel interface and registration. Just two nits from me.

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

Squashed warp-decode work (coop-LDS stage1 + per-M split-K stage2, interleave/ K-tail/scale fixes) for rebase onto the PR lightseekorg#374 MoE-API refactor. Full per-commit history preserved in backup/gptoss-warp-decode-moe-* . Signed-off-by: Sanket Pandit <sanket.pandit@amd.com>

borontion added 30 commits June 4, 2026 13:58

rename

de7504a

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

clean up unused backends

629ba4a

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

remove unquantized backend

39925c6

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

inline fp8 impl

c212cc6

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

cleanup unused kernels

fd9197a

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

add fp16 moe back

5a21874

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

clean up deepep wrapper

e04c84f

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

cleanup import

00dcb67

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

update moe helpers

c9834d2

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

delete dead code

4f67b1e

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

move deep ep abstractions into kernel

cb3bc0f

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

tmp delete moe apis

54b8a89

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

cleanup dispatch info

cb3cf8a

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

delete triton_config

d21cb7a

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

cleanup

63dc1b0

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

fix import

43e3a6a

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

delete del slop

25140af

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

cleanup

9740f85

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

add ignore layer prop

716482a

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

cleanup

9420f33

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

fix import

697dccb

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

scaffold moe apis in v2

8851dfd

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

decompose flashinfer impl

ceaa344

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

decompose triton impl

a38bef6

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

delete triton moe backends

b025df3

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

decompose triton and gluon kernel

35da56e

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

drop misc moe related kernels

bfad5e0

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

clean up tests

f7514be

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

add trtllm mxfp4

2eeebe3

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

fix api

e691f48

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

borontion added 12 commits June 10, 2026 22:23

fix routing

6127bea

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

fix

9c298d5

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

fix

9c431e6

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

fix

8efbda8

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

fix

527d7a3

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

cleanup

1c09a46

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

use triton for backend

67d4368

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

adjust priority

ffbd54f

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

adjust priority

40e3d08

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

add capability

797a859

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

drop test

5a7a0f4

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

update tests

682fdb6

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

borontion changed the title ~~[WIP, DO NOT MERGE] feat(kernel): introduce moe kernel api~~ feat(kernel): introduce unified moe kernel api Jun 11, 2026

chatgpt-codex-connector Bot reviewed Jun 11, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/layers/moe/expert.py Outdated

borontion added 5 commits June 11, 2026 00:03

drop unused triat

43f084e

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

lower gluon kernel perf

21db936

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

revert backend selection

aa0e03f

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

fix

d4e10d6

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

Merge branch 'main' into borontion/refactor-moe-api

7433849

Max191 mentioned this pull request Jun 11, 2026

fix(kernel): Fix gluon MoE GEMM numerical bug #421

Merged

borontion added 4 commits June 11, 2026 13:09

Merge remote-tracking branch 'origin/main' into borontion/refactor-mo…

3d8a647

…e-api

cleanup tests

82c9ddb

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

fix test

3639ca0

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

fix test

df0faf1

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

antiagainst approved these changes Jun 11, 2026

View reviewed changes

Comment thread tokenspeed-kernel/python/tokenspeed_kernel/ops/moe/gluon/mxfp4.py

Comment thread tokenspeed-kernel/python/tokenspeed_kernel/selection.py

add doc

95fbd00

Signed-off-by: Pengzhan Zhao <borontion@gmail.com>

borontion merged commit 38b3a35 into main Jun 12, 2026
4 of 36 checks passed

borontion deleted the borontion/refactor-moe-api branch June 12, 2026 00:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(kernel): introduce unified moe kernel api#374

feat(kernel): introduce unified moe kernel api#374
borontion merged 86 commits into
mainfrom
borontion/refactor-moe-api

borontion commented Jun 7, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

Uh oh!

antiagainst left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

borontion commented Jun 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

MoE entrypoint

Supported type and backend

Kernel early selection

AMD Gluon kernel

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

antiagainst left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

borontion commented Jun 7, 2026 •

edited

Loading