feat(logprobs): vLLM-style output logprobs (LogprobParams), spec-decode support by HJSang · Pull Request #337 · lightseekorg/tokenspeed

HJSang · 2026-06-02T03:38:23Z

Summary

Adds a vLLM-style logprobs API on a dedicated LogprobParams struct and a vLLM-style Logprob output shape, wired end-to-end (Python frontend → C++ scheduler → forward path). SamplingParams stays sampling-only.

This PR is scoped to output (generated-token) logprobs — the surface that is correct today. The prompt-logprob and top-k paths are rejected loudly by LogprobParams.verify() rather than silently returning wrong/partial values, and are left as follow-ups.

The deprecated request fields (return_logprob, logprob_start_len, top_logprobs_num, token_ids_logprob, return_text_in_logprobs) are still translated into LogprobParams for back-compat; requests that map to an unsupported mode get a clear error.

API

LogprobParams (engine/logprob_params.py) — verify() is the single gate:

field	status
`logprobs`	honored for `None` (off) and `0` (sampled token only). `N>0` (output top-k) and `-1` (full vocab) are rejected — not yet materialized.
`prompt_logprobs`	rejected — prompt path is only correct for single-chunk pure-extend prefill; chunked / mixed / prefix-cache paths would be silently wrong.
`logprob_token_ids`	rejected — only feeds the prompt/extend path.
`return_text`	decode each returned token to text.

Output shape (engine/logprobs.py): Logprob{logprob, rank, decoded_token} as dict[int, Logprob] per position, exposed as meta_info["logprobs"] (one entry per generated token).

Speculative decoding

Output logprobs now work under speculative decoding (MTP / EAGLE). Engine.generate / async_generate previously nulled all logprob requests whenever a spec algorithm was set, silently dropping them; the engine actually computes correct, accept-length-aligned output logprobs on the spec verify path (sampler verify() gathers them; the output processor slices them by output_length with spec_num_tokens stride), so that guard is removed and verify() is the gate.

Implementation

Python: LogprobParams + Logprob; legacy-field coercion in io_struct.py; request plumbing in entrypoints/engine*.py; per-request assembly in generation_output_processor.py / logprobs.py; output-logprob flow through execution/model_executor.py + execution/types.py::ModelExecutionResult + execution/cuda_graph_wrapper.py.
C++ scheduler: prompt_logprobs + logprob_token_ids carried on RequestSpec / ForwardOperation / FlatForwardOperation (+ pybind); MatchIntent::SkipRead for prompt-logprob requests. (Prompt path present but gated off behind verify() for now.)

Validation (nv2 B200)

Output logprobs=0 — Qwen2-1.5B-Instruct: per-token logprobs returned, aligned to generated tokens, within bf16 noise vs HF log_softmax.
Output logprobs=0 under MTP — Qwen3.5-397B-A17B-NVFP4 (tp4, trtllm + flashinfer_trtllm, eager + CUDA-graph): finite, ≤0 per-token logprobs (0 → present after removing the spec guard).
Rejections — prompt_logprobs, logprob_token_ids, logprobs>0, logprobs=-1 all raise a clear ValueError at the request entrypoint.
Rebased branch build — scheduler C++ recompiles on top of current main, Python imports clean, engine inits + generates with aligned output logprobs.

Follow-ups (not in this PR)

Re-enable prompt logprobs + output top-k (logprobs>0) once the chunked/mixed prompt path and CUDA-graph top-k buffers land; verify() is the single place to relax.
OpenAI HTTP serving path (ts serve → smg_grpc_servicer, external package) maps logprobs separately and still needs the matching change for served logprobs under spec.
Minor spec-path output-logprob alignment residual observed on the 397B (≈48 logprobs for 53 tokens — boundary/bonus tokens).
PD disaggregation (pd/mini_lb.py still merges the old input_token_logprobs meta key).

🤖 Generated with Claude Code

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 574f1eec93

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a392acbff6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 05a5437e5d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ca28d25dc2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 64b6500f69

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 04b1532a67

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

HJSang · 2026-06-02T16:09:54Z

Codex review triage

Prompt-only requests return empty & prompt top-k hardcoded empty — already fixed by the Phase B / top-k routing commits; _maybe_set_input_logprob_ctx sets extend_return_logprob independently and prompt top-k is now routed. Validated on B200 (prompt_logprobs=0 and =5 match HF).
logprobs/prompt_logprobs = -1 (full vocab) — was accepted by verify() but nothing materializes a vocab-sized result. Now rejected loudly until implemented (04b1532).
OUTPUT top-k (logprobs=N>0) / output logprob_token_ids — deferred (5e00c8c, documented at the hardcoded site). Output logprobs flow through the captured CUDA-graph decode path, so top-k must be computed in the sampler and captured into the graph output buffers — a separate milestone. Until then logprobs=N>0 returns the sampled token's logprob (correct value, no alternatives).
Mixed prefill/decode batches skip prompt logprobs — known pure-extend limitation (num_extends==bs); the input-path per-request slicing assumes no interleaved decode rows. Documented follow-up.

Validated end-to-end on nv2 B200 vs HF log_softmax: output logprobs=0 (2 models), prompt prompt_logprobs=0 (vLLM convention, mean Δ≈0.008–0.05), prompt top-k prompt_logprobs=5, prompt logprob_token_ids (24/24 ids match), and prefix-cache correctness (SkipRead).

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1c0ba37ec9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3c58b71336

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9c61763179

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d2ee343cf6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6dbbb87031

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b58a7ed3bf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c774bf01ac

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

qywu · 2026-06-05T22:22:57Z

prompt_logprobs is only correct on the single-chunk path — worth making the unsupported cases loud (or scoping them out).

Verified locally: prompt_logprobs works for a single-chunk, pure-extend prefill (56-token prompt → 56 positions, [0]=None, values line up). But model_executor._maybe_set_input_logprob_ctx activates only for pure-extend batches and hardcodes start_lens=[0] with no cross-chunk offset — so a prompt that gets chunked (prompt_tokens > chunked_prefill_size, default 8192) or lands in a mixed extend+decode batch silently falls outside the validated path. SkipRead handles prefix-cache reuse but not chunking. The PR already lists this under follow-ups; the concern is it's currently silent rather than loud.

Two ways to make that safe:

Keep prompt_logprobs, add a guard — raise NotImplementedError (or reject in LogprobParams.verify()) when prompt_tokens > chunked_prefill_size / mixed batch / start_len > 0, until the chunked path lands. Preserves the common-case feature + vLLM parity, makes the edges safe.
Scope this PR to output logprobs only — reject prompt_logprobs in verify() with a clear "not supported yet" message and defer the prompt path (incl. the SkipRead recompute) to a follow-up. Smaller, fully-correct surface.

I'd lean (1), since the single-chunk path is already validated and prompt_logprobs matters for vLLM parity — but (2) is reasonable if it isn't needed yet. Either beats silently-wrong logprobs on long/chunked prompts.

Caveat: I verified single-chunk directly; I couldn't run a clean chunked repro because chunked prefill crashes independently on my older test base (unrelated to this PR), so I can't say whether the real-base chunked behavior is silently-wrong vs. an error.

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a0a2784efc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

qywu · 2026-06-08T03:49:25Z

there is merge conflict for python/tokenspeed/runtime/layers/logits_processor.py

HJSang · 2026-06-08T04:31:18Z

there is merge conflict for python/tokenspeed/runtime/layers/logits_processor.py

resolved

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 035f19e5b7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-08T04:31:46Z

+                # Output top-k unsupported -> honor return_logprob as the
+                # sampled-token logprob; top_logprobs_num is intentionally
+                # clamped to 0 rather than rejected for back-compat.
+                logprobs=0 if rl else None,


Preserve legacy top-logprobs results

Fresh evidence in the current diff is that the compatibility shim now always maps legacy return_logprob=True, top_logprobs_num=N requests to LogprobParams(logprobs=0), so callers that previously requested top-N alternatives no longer fail loudly but silently receive only the sampled token's logprob. This affects existing clients and in-repo callers that still use the deprecated fields for experiment scoring/comparison, because their result shape/content is downgraded even though the request is accepted.

Useful? React with 👍 / 👎.

…de support Add a dedicated LogprobParams request struct and a vLLM-style Logprob output shape (per-position {token_id: Logprob}), kept separate from SamplingParams. Scope to OUTPUT logprobs only for now: LogprobParams.verify() is the single gate and loudly rejects the not-yet-correct surface — prompt_logprobs and logprob_token_ids (prompt path is only valid for single-chunk pure-extend prefill; chunked/mixed/prefix-cache paths would be silently wrong), output top-k (logprobs>0, only the sampled token is materialized), and full-vocab (-1). Only logprobs in {None, 0} are honored. GPU parity runner updated to request logprobs=0. Also enable output logprobs under speculative decoding: Engine.generate / async_generate previously nulled all logprob requests whenever a spec algorithm was set, silently dropping them. The engine computes correct, accept-length-aligned output logprobs on the spec verify path, so the guard was overly conservative; remove it and rely on verify() as the gate. Back-compat / streaming hardening: - Legacy field coercion (io_struct) now builds a per-row LogprobParams for batched list inputs (e.g. return_logprob=[False, True]) instead of collapsing to row 0, and clamps legacy top_logprobs_num>0 to the sampled-token logprob (logprobs=0) rather than erroring. - RequestOutputCollector now sums cumulative_logprob across coalesced streamed frames (each frame's value is a per-frame delta) so it stays consistent with the appended per-position logprobs. Validated on B200 (Qwen2-1.5B and Qwen3.5-397B-A17B-NVFP4 MTP/tp4): output logprobs=0 returns finite, <=0 per-token logprobs; prompt/top-k surface rejected at the request entrypoint. Rebased branch rebuilds (scheduler C++ + Python import + engine run) on top of current main. Note: the OpenAI HTTP serving path (ts serve -> smg_grpc_servicer) maps logprobs in a separate, out-of-repo package and needs the matching change there; this covers the in-repo Engine/SDK path. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: Hejian Sang <sanghj0923@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 740d514720

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-06-12T03:06:28Z

+            logprobs_info["cumulative_logprob"] = logprobs_info.get(
+                "cumulative_logprob", 0.0
+            ) + self._sum_slot0(out_sampled_val)


Keep cumulative_logprob cumulative for streams

When stream=True and the client consumes each frame promptly, OutputProcessor passes a fresh logprobs_info dict for every frame, so this assignment reports only the current frame's logprob sum rather than the cumulative sum over the generated prefix. The collector now sums deltas only when multiple frames are coalesced before a read, which means fast streaming clients see cumulative_logprob reset per chunk while slow clients see a larger merged value for the same tokens; store the running scalar on the request state or emit a clearly named delta instead.

Useful? React with 👍 / 👎.

qywu

Requesting changes — keep this PR scoped to the functional surface (output logprobs).

The entire tokenspeed-scheduler/ C++ diff exists only to support the prompt-logprob path, which LogprobParams.verify() rejects in this PR. On the output-only path request->PromptLogprobs() is always -1, so MatchIntent::SkipRead never fires and the new prompt_logprobs / logprob_token_ids fields just carry -1/empty — it's dead code until the follow-up.

Could we drop the scheduler changes here and land them with the prompt-logprob follow-up, where they're actually exercised and testable? Output logprobs are computed entirely Python-side (model_executor / sampler verify() / output processor), so they need no scheduler change.

If removed, the coupled Python plumbing has to go too so nothing references the removed pybind fields:

scheduler_utils.py::make_spec — the prompt_logprobs / logprob_token_ids params + spec.* assignments
request_handler.py — the lp.num_prompt_logprobs() / logprob_token_ids -> make_spec(...) wiring
the C++ side: python_module.cpp, request_spec.h, request.{h,cpp}, forward.{h,cpp}, types.h, and the prefix-cache SkipRead handling

(Keeping them is defensible too — harmless no-op, saves a recompile later. But if the goal is a minimal, fully-exercised PR, splitting the scheduler work into the follow-up matches each diff to what it enables.)

qywu

Following up on the request-changes review with the specific prompt-logprob code to remove so this PR is scoped to output logprobs only. The output-logprobs API refactor stays (the LogprobParams/Logprob types, logprob_params plumbing in io_struct/input_processor/output_processor/entrypoints, the num_out path in logprobs.py, and the output-logprob tests). Each inline note marks code that exists solely for the prompt/input-logprob path — currently unreachable because LogprobParams.verify() rejects prompt_logprobs/logprob_token_ids, so it's untested dead code until the follow-up.

Grouped: scheduler-coupled (request_handler.py, scheduler_utils.py) · execution path (context.py, types.py, cuda_graph_wrapper.py, model_executor.py) · output-processor accumulation/shipping (generation_output_processor.py, collector.py) · assembly (logprobs.py prompt branch) · API/guard (logprob_params.py, io_struct.py legacy mapping) · tests (runners.py).

qywu · 2026-06-12T08:04:46Z

-    "input_token_ids_logprobs",
-    "output_token_ids_logprobs",
+    "logprobs",
+    "prompt_logprobs",


prompt_logprobs is the only prompt-specific entry in this merge policy — logprobs and cumulative_logprob are output-side and stay. Drop prompt_logprobs from _APPEND_META_KEYS when the prompt path is removed (the engine never emits a prompt_logprobs meta key once verify() rejects it).

qywu · 2026-06-12T08:04:46Z

        token_ids_logprob: list[int] | None = None,
        multimodal_inputs=None,
        prompt_input_ids_unpadded: list[int] | None = None,
+        return_prompt_logprob: bool = False,


return_prompt_logprob and every accumulator it gates (input_token_logprobs_val/idx, input_top_logprobs_val/idx, input_token_ids_logprobs_val/idx, sent_prompt_logprob_offset — lines 112–135) are prompt-path only. Remove them, plus the return_prompt_logprob derivation/passing in from_recv_req (~189, ~201).

qywu · 2026-06-12T08:04:46Z

            if model_execution_results.output_logprobs is not None
            else None
        )
+        _input_lp = getattr(model_execution_results, "input_token_logprobs", None)


This whole block through ~line 670 is prompt-path only: it reads ModelExecutionResult.input_* fields and forward_op.prompt_logprobs (both removed with the scheduler change), the ilp_pt accumulator, and the per-request seg_val/seg_idx/pl_req accumulation into request_state.input_*. Remove.

qywu · 2026-06-12T08:04:46Z

                output_token_logprobs_val.append([])
                output_token_logprobs_idx.append([])

+            # Prompt/input-token logprobs: ship the un-shipped tail of the


The prompt-logprob shipping block (937–978) and its wiring into BatchStrOut (999–1015) should revert to the prior empties: input_token_logprobs_val=[], input_top_logprobs_val=[], input_token_ids_logprobs_val=[] (idx likewise). The output_* fields and the output-top-k TODO (1005–1011) stay.

qywu · 2026-06-12T08:04:46Z

+                "cumulative_logprob", 0.0
+            ) + self._sum_slot0(out_sampled_val)
+
+        if num_prompt is not None:


if num_prompt is not None: … ~167 builds the prompt_logprobs meta entry — prompt path only; remove. The output block (num_out), _build_positions, build_position_logprobs, and the Logprob dataclass stay. Note: with prompt and output-token-id logprobs both deferred, want_token_ids is always False, so the tid_* args + the fold-in loop in _build_positions (196–211) are dead for this PR too.

qywu · 2026-06-12T08:04:47Z

+            ),
+        )
+
+    def _maybe_set_input_logprob_ctx(self, ctx, forward_op, bs, num_extends):


_maybe_set_input_logprob_ctx (672–707) is entirely the prompt-logprob activation path: it reads forward_op.prompt_logprobs / logprob_token_ids and sets the ctx.extend_* fields. Remove the method and its call site at ~1474.

qywu · 2026-06-12T08:04:47Z

                        ),
                        **mamba_kwargs,
                    )
+                    if _input_lp_bundle is None:


The _input_lp_bundle unpack (1547–1551, 1576–1589), the input_*=None inits (1416–1420), the CPU copy (1646–1656), and the ModelExecutionResult(input_*=…) kwargs (1665–1670) are all prompt path. Remove with the 4-tuple revert above.

qywu · 2026-06-12T08:04:47Z

        aux_hidden_states: torch.Tensor | None = None,
    ) -> LogitsProcessorOutput:
        # Get the last hidden states and last logits for the next token prediction
+        # NOTE: ``extend_return_logprob`` is only ever set True once the


Comment-only change. With the prompt path removed, extend_return_logprob is always False so this branch is always taken — the added 'Phase B' NOTE can be dropped (or reduced to a one-liner without the prompt-path framing).

qywu · 2026-06-12T08:04:47Z

+            # OUTPUT top-k logprobs are deferred (not produced yet), so only the
+            # prompt (prefill) top-k and prompt token-id logprobs are extracted;
+            # output top-k is left empty and skipped by check_close_model_outputs.
+            prompt_lp = response["meta_info"].get("prompt_logprobs") or []


Heads-up: once the prompt path is gone, meta_info has no prompt_logprobs key, so prompt_positions is always []. That makes the prefill-logprob comparison in check_close_model_outputs (704–708, n_common=0) and the prompt-token-id comparison (750–781) vacuous — they pass trivially. Better to drop the prompt/prefill comparisons and keep this runner asserting only the output (decode, logprobs=0) path, so the test actually exercises what the PR ships.

qywu · 2026-06-12T08:04:47Z

+            tid = list(tid) if tid else None
+            if not (rl or tid):
+                return None
+            want_prompt = isinstance(lsl, int) and lsl >= 0


For an output-only PR the legacy coercion only needs return_logprob → logprobs=0. want_prompt + prompt_logprobs=… (145, 151) and logprob_token_ids=tid (152) map to features verify() rejects, so a legacy caller passing logprob_start_len>=0 or token_ids_logprob would now hard-error. Either trim these mappings, or keep them and accept the reject — but call out the back-compat behavior change.

qywu

Two follow-ups in one review: (1) the net-new redundancy items from the offline pass, and (2) a careful SGLang logprobs compatibility check (this codebase is SGLang-derived).

Redundancy: two dead symbols (MAX_LOGPROB_TOKEN_IDS, LogprobsOnePosition) and the write-only ForwardContext fields.

SGLang compatibility — three concerns, all verified against main and the PR branch:

The public meta_info logprob keys are renamed + reshaped (six SGLang tuple-lists → logprobs/prompt_logprobs dict-of-Logprob). Breaking for any SGLang-shaped consumer; CI won't catch it since the tests were rewritten.
Logprob (dataclass) + int-keyed dicts aren't JSON/wire-native the way SGLang's lists were; no in-repo serializer handles them.
Legacy request-field coercion silently changes meaning: top_logprobs_num dropped, logprob_start_len/token_ids_logprob now hard-error.

None of these block the output-logprobs goal — but if SGLang wire-compat matters (router, eval harnesses, the external OpenAI/smg path), they need a conscious decision (compat shim vs. documented break).

qywu · 2026-06-12T08:13:34Z

+
+from dataclasses import dataclass
+
+MAX_LOGPROB_TOKEN_IDS = 128


Dead constant — MAX_LOGPROB_TOKEN_IDS is defined here and referenced nowhere on the branch (git grep finds only this line). Either wire it into verify() (e.g. cap len(logprob_token_ids)) or drop it.

qywu · 2026-06-12T08:13:34Z

+
+
+# One position maps token_id -> Logprob.
+LogprobsOnePosition = dict[int, "Logprob"]


Dead type alias — LogprobsOnePosition is defined and referenced nowhere (git grep finds only this line). _build_positions/build_position_logprobs annotate with the inline dict[int, Logprob] instead. Drop it or actually use it as the return annotation.

qywu · 2026-06-12T08:13:34Z

    accept_lengths: torch.Tensor | None = None
+
+    # --- input/prompt logprobs (off-policy); set by _maybe_set_input_logprob_ctx ---
+    extend_return_logprob: bool = False


Beyond being prompt-only: 8 of these 9 fields are write-only. _maybe_set_input_logprob_ctx assigns ctx.extend_return_logprob, extend_logprob_start_lens_cpu, extend_seq_lens_cpu, extend_logprob_pruned_lens_cpu, top_logprobs_nums, token_ids_logprobs, extend_return_top_logprob, extend_token_ids_logprob — but nothing ever reads them back off ctx (only extend_input_logprob_token_ids_gpu is read, in model_executor ~1651). The logits kernel reads LogitsMetadata.*, and the PR adds no ctx → LogitsMetadata bridge, so these don't even drive the existing prompt kernel. Dead regardless of the prompt-path decision.

qywu · 2026-06-12T08:13:34Z

-                    _get("output_top_logprobs_idx"),
-                    return_text_in_logprobs,
-                )
+            logprobs_info.setdefault("logprobs", []).extend(positions)


SGLang compat (breaking). This replaces the entire SGLang meta_info logprob surface — input_token_logprobs, output_token_logprobs, input_top_logprobs, output_top_logprobs, input_token_ids_logprobs, output_token_ids_logprobs (each a list[(logprob, token_id, decoded_token)]) — with meta_info["logprobs"] / ["prompt_logprobs"] as list[dict[int, Logprob]] (+ cumulative_logprob). Any SGLang-shaped consumer (the external smg_grpc_servicer OpenAI path, SGLang router/eval harnesses, the PD mini_lb merge of input_token_logprobs) reading the old keys silently gets nothing. The in-repo tests were rewritten to the new shape, so CI won't catch this. If SGLang wire-compat is a requirement, consider emitting both key sets during a deprecation window; if it's an intentional break, please call it out in the PR description as an API break.

qywu · 2026-06-12T08:13:34Z

+
+
+@dataclass
+class Logprob:


SGLang compat (serialization). SGLang returned plain tuples/lists, which JSON-serialize directly. dict[int, Logprob] does not: Logprob is a bare dataclass (no to_dict/__json__), and JSON object keys must be strings (int keys get coerced, changing the shape consumers see). No in-repo path serializes it (the only asdict/json.dumps hits are server_args + PD registry/transfer, unrelated). The OpenAI/HTTP mapping lives in the external smg package, which the PR notes "still needs the matching change" — so served logprobs are effectively unmapped for this shape until that lands. Worth a to_dict() / explicit wire form here so in-process and served outputs agree.

qywu · 2026-06-12T08:13:34Z

+                # Output top-k unsupported -> honor return_logprob as the
+                # sampled-token logprob; top_logprobs_num is intentionally
+                # clamped to 0 rather than rejected for back-compat.
+                logprobs=0 if rl else None,


SGLang compat (silent semantic changes). This legacy translation changes behavior for existing SGLang clients:

top_logprobs_num is dropped entirely — it isn't even a parameter of _row. A request with return_logprob=True, top_logprobs_num=5 silently returns only the sampled token (logprobs=0), no top-k, no error.

logprob_start_len >= 0 (the standard SGLang way to ask for prompt logprobs) → prompt_logprobs → verify() raises. Previously-working requests now hard-error.

token_ids_logprob → logprob_token_ids → verify() raises (even when return_logprob=False, since _row proceeds on tid alone).

return_logprob=True used to return both prompt and output logprobs; now it's output-only.

return_text_in_logprobs is the one preserved. Suggest at minimum erroring (not silently clamping) on top_logprobs_num > 0 so the data loss is visible, and documenting the prompt-field rejections as a back-compat break.

qywu · 2026-06-12T08:17:45Z

Design proposal: support SGLang and vLLM logprob formats together

Following up on the compatibility review. The good news is this PR is already most of the way there — both formats are just two views of the same per-position data, and the neutral wire arrays this PR keeps unchanged (*_token_logprobs_val/idx, *_top_logprobs_val/idx, *_token_ids_logprobs_val/idx) are already a superset of what either dialect needs. So we can serve both with one compute path, two renderers, chosen per request — no scheduler involvement, no double work.

Architecture

scheduler/model ──► neutral wire arrays (BatchStrOut: *_val / *_idx)   [UNCHANGED]
                              │
                    convert_logprob_style(params, recv_obj)            ◄── single dispatch point
                       │                         │
                 render_sglang()           render_vllm()
                       │                         │
   meta_info["output_token_logprobs"]    meta_info["logprobs"]
   meta_info["output_top_logprobs"]      meta_info["prompt_logprobs"]
   ... (6 tuple-list keys)               ... (dict[int, Logprob] + cumulative_logprob)

render_sglang is essentially the pre-PR convert_logprob_style (tuple-lists); render_vllm is the new code this PR already wrote. Both read the same neutral arrays.

Request side — a near 1:1 mapping (PR is close)

The PR's _coerce_legacy_logprob_fields just needs to be lossless instead of dropping fields:

SGLang request	vLLM `LogprobParams`	Note
`return_logprob=True, top_logprobs_num=0`	`logprobs=0`	sampled token only
`return_logprob=True, top_logprobs_num=N`	`logprobs=N`	currently dropped — should map N
`logprob_start_len>=0` (+ `top_logprobs_num`)	`prompt_logprobs=N`	`start_len` is the extra SGLang knob (caveat 2)
`token_ids_logprob=[...]`	`logprob_token_ids=[...]`	direct
`return_text_in_logprobs`	`return_text`	already preserved

Gate unsupported capabilities (output top-k N>0, prompt logprobs) as capabilities — not by silently clamping the API shape.

Output side — add a `format` flag, dispatch on it

@dataclass
class LogprobParams:
    logprobs: int | None = None
    prompt_logprobs: int | None = None
    logprob_token_ids: list[int] | None = None
    return_text: bool = False
    format: Literal["vllm", "sglang", "both"] = "vllm"   # new

Selection policy: the response matches the request dialect.

Built via legacy fields → format="sglang" → tuple-list keys (back-compat preserved exactly).
Built via logprob_params directly → format="vllm" (default) → dict[int, Logprob].
"both" available for migration/testing (opt-in; it doubles payload).

This gives full SGLang back-compat with zero ambiguity, computed once. collector._APPEND_META_KEYS becomes the union of both dialects' list keys (only one dialect is present per request, so the union is harmless).

Where each change lands

Change	File
`format` field; keep `render_sglang` (old code) + `render_vllm` (new); dispatch in `convert_logprob_style`	`engine/logprobs.py`
Set `format="sglang"` in legacy coercion; carry `top_logprobs_num` into the count	`engine/io_struct.py`
`_APPEND_META_KEYS` = union of both key sets	`engine/collector.py`
`Logprob.to_dict()` so the vLLM shape is JSON-safe; serving layer calls it	`engine/logprobs.py` + serving boundary
Parametrize logprob tests over `format ∈ {sglang, vllm}`	`test/runners.py`, `test/runtime/...`

Caveats to decide consciously

rank fidelity (vLLM): vLLM's Logprob.rank is the token's 1-based rank in the full vocab distribution; the PR uses slot index (0=sampled, 1..N). True parity needs the actual vocab rank from the sampler. Orthogonal to this architecture, but don't advertise vLLM-rank parity until it's real.
logprob_start_len: SGLang allows mid-prompt start; vLLM is whole-prompt. render_sglang honors start_len (slice), render_vllm returns the whole prompt. Doesn't bite now (prompt logprobs deferred) — just bake start_len into the neutral representation so it's there when the prompt path lands.
OpenAI/serving is a third renderer. Point the external smg servicer at the neutral arrays, not at either meta_info dialect, so it never depends on format.
"both" cost: always-emit-both doubles the logprob payload (heavy for RL/prompt-logprob workloads) — keep it opt-in, not the default.

Net

Keep the neutral wire arrays as the contract, split convert_logprob_style into two pure renderers, add a format flag defaulting to "match the request dialect," and make Logprob serializable. One compute path, both APIs satisfied, scheduler untouched. This also closes the CI blind spot from the review (tests would then assert both shapes).

HJSang requested a review from a team as a code owner June 2, 2026 03:38

chatgpt-codex-connector Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/engine/generation_output_processor.py

lightseek-bot requested a review from qywu June 2, 2026 03:42

chatgpt-codex-connector Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/engine/logprobs.py

chatgpt-codex-connector Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/engine/generation_output_processor.py Outdated

Comment thread python/tokenspeed/runtime/execution/model_executor.py

chatgpt-codex-connector Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/execution/model_executor.py

chatgpt-codex-connector Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/engine/logprob_params.py Outdated

chatgpt-codex-connector Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/engine/input_processor.py

HJSang force-pushed the hejian/vllm-logprobs branch 2 times, most recently from 0d1f63a to 1c0ba37 Compare June 2, 2026 16:15

chatgpt-codex-connector Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/entrypoints/engine.py

HJSang mentioned this pull request Jun 2, 2026

feat(engine): compute_log_probs API for RL sequence scoring (RL-plan M2) #321

Closed

HJSang force-pushed the hejian/vllm-logprobs branch from 1c0ba37 to 3c58b71 Compare June 2, 2026 17:24

chatgpt-codex-connector Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/execution/model_executor.py

HJSang force-pushed the hejian/vllm-logprobs branch from 3c58b71 to 9c61763 Compare June 2, 2026 17:32

chatgpt-codex-connector Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/engine/generation_output_processor.py

HJSang force-pushed the hejian/vllm-logprobs branch 2 times, most recently from 58a21fd to d2ee343 Compare June 2, 2026 21:06

chatgpt-codex-connector Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/engine/io_struct.py Outdated

HJSang force-pushed the hejian/vllm-logprobs branch from d2ee343 to 6dbbb87 Compare June 2, 2026 21:20

chatgpt-codex-connector Bot reviewed Jun 2, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/engine/io_struct.py Outdated

HJSang force-pushed the hejian/vllm-logprobs branch from 6dbbb87 to b58a7ed Compare June 3, 2026 03:37

chatgpt-codex-connector Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/engine/collector.py Outdated

HJSang force-pushed the hejian/vllm-logprobs branch 2 times, most recently from cc8bb2a to c774bf0 Compare June 3, 2026 04:27

chatgpt-codex-connector Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/engine/logprobs.py

qywu reviewed Jun 5, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/engine/logprob_params.py

chatgpt-codex-connector Bot reviewed Jun 7, 2026

View reviewed changes

Comment thread python/tokenspeed/runtime/engine/io_struct.py Outdated

HJSang force-pushed the hejian/vllm-logprobs branch from 670156e to 08360f7 Compare June 8, 2026 04:08

HJSang changed the title ~~feat(logprobs): vLLM-style logprob API (LogprobParams + Logprob output shape)~~ feat(logprobs): vLLM-style output logprobs (LogprobParams), spec-decode support Jun 8, 2026

HJSang force-pushed the hejian/vllm-logprobs branch from 08360f7 to 035f19e Compare June 8, 2026 04:28

chatgpt-codex-connector Bot reviewed Jun 8, 2026

View reviewed changes

HJSang force-pushed the hejian/vllm-logprobs branch from 035f19e to 740d514 Compare June 12, 2026 03:00

chatgpt-codex-connector Bot reviewed Jun 12, 2026

View reviewed changes

lightseek-bot requested a review from qywu June 12, 2026 04:47

qywu requested changes Jun 12, 2026

View reviewed changes

qywu reviewed Jun 12, 2026

View reviewed changes


		from dataclasses import dataclass

		MAX_LOGPROB_TOKEN_IDS = 128



		# One position maps token_id -> Logprob.
		LogprobsOnePosition = dict[int, "Logprob"]

Conversation

HJSang commented Jun 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

API

Speculative decoding

Implementation

Validation (nv2 B200)

Follow-ups (not in this PR)

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

HJSang commented Jun 2, 2026

Codex review triage

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

qywu commented Jun 5, 2026

HJSang commented Jun 2, 2026 •

edited

Loading