Skip to content

[Feature] SOTA LFM2 (hybrid short-conv + GQA) inference on gfx942 (CDNA3) and Blackwell #450

@vincentzed

Description

@vincentzed

Checklist

  • I searched existing issues and open PRs before filing this request.

Problem and motivation

TokenSpeed's stated mission is to be the most performant engine for production
agentic workloads. The LFM2 family (Liquid Foundation Models 2) is
purpose-built for exactly that regime — fast, cheap, low-latency generation —
but there is no first-class path to serve it on TokenSpeed today, and the two
hardware targets we care most about are only half-covered:

  • gfx942 (AMD CDNA3 — MI300X / MI325X) is not on the supported list. The
    issue templates declare AMD support as MI350X / MI355X only (gfx950 /
    CDNA4)
    . A large installed base of data-center inference capacity is gfx942,
    and that is where we want to run LFM2.
  • Blackwell (B200 / B300, SM100 / SM103) is supported and already hosts
    TokenSpeed's SOTA MLA path — but LFM2 is GQA-based, not MLA, and its
    dominant compute is a gated short convolution that has no kernel home in
    the registry yet.

Why LFM2 is a good fit for TokenSpeed specifically. LFM2 is a hybrid
architecture
— a stack of double-gated short-range convolution blocks
interleaved with a minority of grouped-query attention (GQA) blocks (roughly
a ~10 conv : 6 attention ratio at the 1.2B scale). The consequences for an
inference engine are exactly the things TokenSpeed optimizes for:

  • Most layers are KV-cache-free. Only the GQA blocks hold a KV cache; the
    conv blocks carry a tiny fixed-size recurrent conv state that is independent
    of context length. That collapses KV memory and bandwidth — the primary
    decode-throughput / TPOT lever for agentic serving.
  • O(L) prefill, O(1)-state decode on the conv path → low TTFT and flat
    memory growth across long multi-turn agent transcripts.
  • The family spans the agentic-serving sweet spot: dense LFM2-1.2B / 2.6B,
    the LFM2-8B-A1B MoE variant, and multimodal LFM2-VL — all of which
    want the same conv + GQA kernels underneath.

Net: LFM2 should be able to deliver higher tokens/s/GPU and lower TPOT than a
comparably capable dense-attention model on both gfx942 and Blackwell — but
only if the conv operator and the hybrid state cache are first-class citizens.

Proposed solution (RFC / design sketch)

This decomposes into a model definition plus a small number of new kernels that
slot into the existing registry and the existing hybrid-state-cache machinery.

1. Model definitionruntime/models/lfm2.py (+ lfm2_moe.py),
registered like the current hybrid models. The block stack mixes:

  • ShortConvBlock: input-varying, double-gated depthwise causal 1D conv
    (small kernel, e.g. width 3) + gating + RMSNorm. Stateless across requests;
    carries a (kernel_width - 1)-wide conv state during decode.
  • GQAAttentionBlock: standard GQA — reuse existing attention backends as-is
    (FA4 / FlashInfer / TRT-LLM on Blackwell; Gluon on AMD).
  • SwiGLU MLP (dense) or MoE (reuse ops/moe/{flashinfer,gluon,triton}).

2. Hybrid state cache. Reuse the constant-memory recurrent-state path
already built for Mamba/SSM hybrids (--mamba-ssm-dtype,
--mamba-full-memory-ratio; cf. longcat_flash, qwen3_5). The conv state is
simpler than an SSM state — a fixed (B, H, kernel_width - 1) ring buffer
living alongside the GQA KV cache. The placement compiler should size conv-state
and KV identically for TP / EP.

3. Kernels (the genuinely new work), following the repo's backend-choice
rule — CuTe DSL for NVIDIA, Triton Gluon for AMD, Triton for portable:

op Blackwell (SM100) gfx942 (CDNA3) portable fallback
gated short-conv prefill (causal depthwise) CuTe DSL Triton Gluon Triton
gated short-conv decode (1-step state update) CuTe DSL / fused Triton Gluon Triton
GQA attention existing FA4 / FlashInfer / TRT-LLM existing Gluon existing Triton
fused gate + RMSNorm reuse ops/layernorm + activation fusions Gluon Triton

The conv prefill/decode kernels are the only new primitives; everything else is
wiring and reuse.

4. gfx942 enablement. Independent of LFM2: confirm/extend the AMD build to
emit gfx942 alongside gfx950, and gate the conv Gluon kernels for CDNA3. If
gfx942 is intentionally out of scope, please say so explicitly so we can plan a
fork target instead.

Definition of "SOTA" / done. At or above TRT-LLM / vLLM tokens-per-s-per-GPU
and lower TPOT for an agentic decode profile (short prompts, long multi-turn
decode, high concurrency), measured with the in-repo test/agentic_benchmark
harness on B200 / B300 and MI325X. Concrete targets to be set against a baseline
once the path runs end-to-end; the README's 580-TPS framing is the bar for what
a follow-up "SOTA LFM2" number should read like.

Execution plan

Understanding that core features are core-team-driven, filing this as an RFC
to gauge roadmap fit before any implementation, per the project's RFC-first
guidance for larger features (#149). Happy to:

  1. Open a design discussion on the conv-state cache integration and the
    conv-kernel API surface (register_kernel signatures).
  2. Contribute the portable Triton conv prefill/decode kernels + a reference
    lfm2.py model definition as a first, verifiable slice (small, testable,
    matching the existing hybrid-model style), with CuTe DSL / Gluon
    specializations as follow-ups.
  3. Bring gfx942 (MI325X) and B200 hardware + the agentic-benchmark numbers for
    review.

Open questions for the core team:

  • Is gfx942 (CDNA3) on the roadmap at all, or is CDNA4-only (gfx950) a
    deliberate line?
  • Is the Mamba/SSM hybrid-state-cache path the intended home for a conv
    recurrent state, or is there a preferred generic recurrent-state abstraction?
  • Preferred order: dense LFM2 first, or go straight to LFM2-8B-A1B (MoE) to
    exercise the EP path?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions