[Feature] SOTA LFM2 (hybrid short-conv + GQA) inference on gfx942 (CDNA3) and Blackwell

## Checklist

- [x] I searched existing issues and open PRs before filing this request.

## Problem and motivation

TokenSpeed's stated mission is to be the most performant engine for production
agentic workloads. The **LFM2** family (Liquid Foundation Models 2) is
purpose-built for exactly that regime — fast, cheap, low-latency generation —
but there is no first-class path to serve it on TokenSpeed today, and the two
hardware targets we care most about are only half-covered:

- **gfx942 (AMD CDNA3 — MI300X / MI325X)** is *not* on the supported list. The
  issue templates declare AMD support as **MI350X / MI355X only (gfx950 /
  CDNA4)**. A large installed base of data-center inference capacity is gfx942,
  and that is where we want to run LFM2.
- **Blackwell (B200 / B300, SM100 / SM103)** is supported and already hosts
  TokenSpeed's SOTA MLA path — but LFM2 is **GQA-based, not MLA**, and its
  dominant compute is a *gated short convolution* that has no kernel home in
  the registry yet.

**Why LFM2 is a good fit for TokenSpeed specifically.** LFM2 is a **hybrid
architecture** — a stack of *double-gated short-range convolution* blocks
interleaved with a minority of *grouped-query attention (GQA)* blocks (roughly
a ~10 conv : 6 attention ratio at the 1.2B scale). The consequences for an
inference engine are exactly the things TokenSpeed optimizes for:

- **Most layers are KV-cache-free.** Only the GQA blocks hold a KV cache; the
  conv blocks carry a tiny fixed-size recurrent conv state that is independent
  of context length. That collapses KV memory and bandwidth — the primary
  decode-throughput / TPOT lever for agentic serving.
- **O(L) prefill, O(1)-state decode** on the conv path → low TTFT and flat
  memory growth across long multi-turn agent transcripts.
- The family spans the agentic-serving sweet spot: dense **LFM2-1.2B / 2.6B**,
  the **LFM2-8B-A1B MoE** variant, and multimodal **LFM2-VL** — all of which
  want the same conv + GQA kernels underneath.

Net: LFM2 should be able to deliver higher tokens/s/GPU and lower TPOT than a
comparably capable dense-attention model on both gfx942 and Blackwell — but
only if the conv operator and the hybrid state cache are first-class citizens.

## Proposed solution (RFC / design sketch)

This decomposes into a model definition plus a small number of new kernels that
slot into the existing registry and the existing hybrid-state-cache machinery.

**1. Model definition** — `runtime/models/lfm2.py` (+ `lfm2_moe.py`),
registered like the current hybrid models. The block stack mixes:

- `ShortConvBlock`: input-varying, double-gated depthwise **causal 1D conv**
  (small kernel, e.g. width 3) + gating + RMSNorm. Stateless across requests;
  carries a `(kernel_width - 1)`-wide conv state during decode.
- `GQAAttentionBlock`: standard GQA — reuse existing attention backends as-is
  (FA4 / FlashInfer / TRT-LLM on Blackwell; Gluon on AMD).
- SwiGLU MLP (dense) or MoE (reuse `ops/moe/{flashinfer,gluon,triton}`).

**2. Hybrid state cache.** Reuse the constant-memory recurrent-state path
already built for Mamba/SSM hybrids (`--mamba-ssm-dtype`,
`--mamba-full-memory-ratio`; cf. `longcat_flash`, `qwen3_5`). The conv state is
*simpler* than an SSM state — a fixed `(B, H, kernel_width - 1)` ring buffer
living alongside the GQA KV cache. The placement compiler should size conv-state
and KV identically for TP / EP.

**3. Kernels** (the genuinely new work), following the repo's backend-choice
rule — CuTe DSL for NVIDIA, Triton Gluon for AMD, Triton for portable:

| op | Blackwell (SM100) | gfx942 (CDNA3) | portable fallback |
|----|-------------------|----------------|-------------------|
| gated short-conv **prefill** (causal depthwise) | CuTe DSL | Triton Gluon | Triton |
| gated short-conv **decode** (1-step state update) | CuTe DSL / fused | Triton Gluon | Triton |
| GQA attention | existing FA4 / FlashInfer / TRT-LLM | existing Gluon | existing Triton |
| fused gate + RMSNorm | reuse `ops/layernorm` + activation fusions | Gluon | Triton |

The conv prefill/decode kernels are the only new primitives; everything else is
wiring and reuse.

**4. gfx942 enablement.** Independent of LFM2: confirm/extend the AMD build to
emit **gfx942** alongside gfx950, and gate the conv Gluon kernels for CDNA3. If
gfx942 is intentionally out of scope, please say so explicitly so we can plan a
fork target instead.

**Definition of "SOTA" / done.** At or above TRT-LLM / vLLM tokens-per-s-per-GPU
and lower TPOT for an agentic decode profile (short prompts, long multi-turn
decode, high concurrency), measured with the in-repo `test/agentic_benchmark`
harness on B200 / B300 and MI325X. Concrete targets to be set against a baseline
once the path runs end-to-end; the README's 580-TPS framing is the bar for what
a follow-up "SOTA LFM2" number should read like.

## Execution plan

Understanding that core features are core-team-driven, filing this as an **RFC**
to gauge roadmap fit before any implementation, per the project's RFC-first
guidance for larger features (#149). Happy to:

1. Open a design discussion on the conv-state cache integration and the
   conv-kernel API surface (`register_kernel` signatures).
2. Contribute the **portable Triton** conv prefill/decode kernels + a reference
   `lfm2.py` model definition as a first, verifiable slice (small, testable,
   matching the existing hybrid-model style), with CuTe DSL / Gluon
   specializations as follow-ups.
3. Bring gfx942 (MI325X) and B200 hardware + the agentic-benchmark numbers for
   review.

Open questions for the core team:

- Is **gfx942 (CDNA3)** on the roadmap at all, or is CDNA4-only (gfx950) a
  deliberate line?
- Is the Mamba/SSM hybrid-state-cache path the intended home for a conv
  recurrent state, or is there a preferred generic recurrent-state abstraction?
- Preferred order: dense LFM2 first, or go straight to **LFM2-8B-A1B (MoE)** to
  exercise the EP path?


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Feature] SOTA LFM2 (hybrid short-conv + GQA) inference on gfx942 (CDNA3) and Blackwell #450

Checklist

Problem and motivation

Proposed solution (RFC / design sketch)

Execution plan

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

op	Blackwell (SM100)	gfx942 (CDNA3)	portable fallback
gated short-conv prefill (causal depthwise)	CuTe DSL	Triton Gluon	Triton
gated short-conv decode (1-step state update)	CuTe DSL / fused	Triton Gluon	Triton
GQA attention	existing FA4 / FlashInfer / TRT-LLM	existing Gluon	existing Triton
fused gate + RMSNorm	reuse `ops/layernorm` + activation fusions	Gluon	Triton

[Feature] SOTA LFM2 (hybrid short-conv + GQA) inference on gfx942 (CDNA3) and Blackwell #450

Description

Checklist

Problem and motivation

Proposed solution (RFC / design sketch)

Execution plan

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions