feat(logprobs): vLLM-style output logprobs (LogprobParams), spec-decode support#337
feat(logprobs): vLLM-style output logprobs (LogprobParams), spec-decode support#337HJSang wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 574f1eec93
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a392acbff6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 05a5437e5d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: ca28d25dc2
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 64b6500f69
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 04b1532a67
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Codex review triage
Validated end-to-end on nv2 B200 vs HF |
0d1f63a to
1c0ba37
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1c0ba37ec9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
1c0ba37 to
3c58b71
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 3c58b71336
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
3c58b71 to
9c61763
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 9c61763179
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
58a21fd to
d2ee343
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: d2ee343cf6
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
d2ee343 to
6dbbb87
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6dbbb87031
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
6dbbb87 to
b58a7ed
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: b58a7ed3bf
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
cc8bb2a to
c774bf0
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c774bf01ac
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
Verified locally: Two ways to make that safe:
I'd lean (1), since the single-chunk path is already validated and Caveat: I verified single-chunk directly; I couldn't run a clean chunked repro because chunked prefill crashes independently on my older test base (unrelated to this PR), so I can't say whether the real-base chunked behavior is silently-wrong vs. an error. |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a0a2784efc
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
there is merge conflict for python/tokenspeed/runtime/layers/logits_processor.py |
670156e to
08360f7
Compare
08360f7 to
035f19e
Compare
resolved |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 035f19e5b7
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # Output top-k unsupported -> honor return_logprob as the | ||
| # sampled-token logprob; top_logprobs_num is intentionally | ||
| # clamped to 0 rather than rejected for back-compat. | ||
| logprobs=0 if rl else None, |
There was a problem hiding this comment.
Preserve legacy top-logprobs results
Fresh evidence in the current diff is that the compatibility shim now always maps legacy return_logprob=True, top_logprobs_num=N requests to LogprobParams(logprobs=0), so callers that previously requested top-N alternatives no longer fail loudly but silently receive only the sampled token's logprob. This affects existing clients and in-repo callers that still use the deprecated fields for experiment scoring/comparison, because their result shape/content is downgraded even though the request is accepted.
Useful? React with 👍 / 👎.
…de support
Add a dedicated LogprobParams request struct and a vLLM-style Logprob output
shape (per-position {token_id: Logprob}), kept separate from SamplingParams.
Scope to OUTPUT logprobs only for now: LogprobParams.verify() is the single
gate and loudly rejects the not-yet-correct surface — prompt_logprobs and
logprob_token_ids (prompt path is only valid for single-chunk pure-extend
prefill; chunked/mixed/prefix-cache paths would be silently wrong), output
top-k (logprobs>0, only the sampled token is materialized), and full-vocab
(-1). Only logprobs in {None, 0} are honored. GPU parity runner updated to
request logprobs=0.
Also enable output logprobs under speculative decoding: Engine.generate /
async_generate previously nulled all logprob requests whenever a spec
algorithm was set, silently dropping them. The engine computes correct,
accept-length-aligned output logprobs on the spec verify path, so the guard
was overly conservative; remove it and rely on verify() as the gate.
Back-compat / streaming hardening:
- Legacy field coercion (io_struct) now builds a per-row LogprobParams for
batched list inputs (e.g. return_logprob=[False, True]) instead of
collapsing to row 0, and clamps legacy top_logprobs_num>0 to the
sampled-token logprob (logprobs=0) rather than erroring.
- RequestOutputCollector now sums cumulative_logprob across coalesced
streamed frames (each frame's value is a per-frame delta) so it stays
consistent with the appended per-position logprobs.
Validated on B200 (Qwen2-1.5B and Qwen3.5-397B-A17B-NVFP4 MTP/tp4): output
logprobs=0 returns finite, <=0 per-token logprobs; prompt/top-k surface
rejected at the request entrypoint. Rebased branch rebuilds (scheduler C++ +
Python import + engine run) on top of current main.
Note: the OpenAI HTTP serving path (ts serve -> smg_grpc_servicer) maps
logprobs in a separate, out-of-repo package and needs the matching change
there; this covers the in-repo Engine/SDK path.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Hejian Sang <sanghj0923@gmail.com>
035f19e to
740d514
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 740d514720
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| logprobs_info["cumulative_logprob"] = logprobs_info.get( | ||
| "cumulative_logprob", 0.0 | ||
| ) + self._sum_slot0(out_sampled_val) |
There was a problem hiding this comment.
Keep cumulative_logprob cumulative for streams
When stream=True and the client consumes each frame promptly, OutputProcessor passes a fresh logprobs_info dict for every frame, so this assignment reports only the current frame's logprob sum rather than the cumulative sum over the generated prefix. The collector now sums deltas only when multiple frames are coalesced before a read, which means fast streaming clients see cumulative_logprob reset per chunk while slow clients see a larger merged value for the same tokens; store the running scalar on the request state or emit a clearly named delta instead.
Useful? React with 👍 / 👎.
qywu
left a comment
There was a problem hiding this comment.
Requesting changes — keep this PR scoped to the functional surface (output logprobs).
The entire tokenspeed-scheduler/ C++ diff exists only to support the prompt-logprob path, which LogprobParams.verify() rejects in this PR. On the output-only path request->PromptLogprobs() is always -1, so MatchIntent::SkipRead never fires and the new prompt_logprobs / logprob_token_ids fields just carry -1/empty — it's dead code until the follow-up.
Could we drop the scheduler changes here and land them with the prompt-logprob follow-up, where they're actually exercised and testable? Output logprobs are computed entirely Python-side (model_executor / sampler verify() / output processor), so they need no scheduler change.
If removed, the coupled Python plumbing has to go too so nothing references the removed pybind fields:
scheduler_utils.py::make_spec— theprompt_logprobs/logprob_token_idsparams +spec.*assignmentsrequest_handler.py— thelp.num_prompt_logprobs()/logprob_token_ids->make_spec(...)wiring- the C++ side:
python_module.cpp,request_spec.h,request.{h,cpp},forward.{h,cpp},types.h, and the prefix-cacheSkipReadhandling
(Keeping them is defensible too — harmless no-op, saves a recompile later. But if the goal is a minimal, fully-exercised PR, splitting the scheduler work into the follow-up matches each diff to what it enables.)
qywu
left a comment
There was a problem hiding this comment.
Following up on the request-changes review with the specific prompt-logprob code to remove so this PR is scoped to output logprobs only. The output-logprobs API refactor stays (the LogprobParams/Logprob types, logprob_params plumbing in io_struct/input_processor/output_processor/entrypoints, the num_out path in logprobs.py, and the output-logprob tests). Each inline note marks code that exists solely for the prompt/input-logprob path — currently unreachable because LogprobParams.verify() rejects prompt_logprobs/logprob_token_ids, so it's untested dead code until the follow-up.
Grouped: scheduler-coupled (request_handler.py, scheduler_utils.py) · execution path (context.py, types.py, cuda_graph_wrapper.py, model_executor.py) · output-processor accumulation/shipping (generation_output_processor.py, collector.py) · assembly (logprobs.py prompt branch) · API/guard (logprob_params.py, io_struct.py legacy mapping) · tests (runners.py).
| "input_token_ids_logprobs", | ||
| "output_token_ids_logprobs", | ||
| "logprobs", | ||
| "prompt_logprobs", |
There was a problem hiding this comment.
prompt_logprobs is the only prompt-specific entry in this merge policy — logprobs and cumulative_logprob are output-side and stay. Drop prompt_logprobs from _APPEND_META_KEYS when the prompt path is removed (the engine never emits a prompt_logprobs meta key once verify() rejects it).
| token_ids_logprob: list[int] | None = None, | ||
| multimodal_inputs=None, | ||
| prompt_input_ids_unpadded: list[int] | None = None, | ||
| return_prompt_logprob: bool = False, |
There was a problem hiding this comment.
return_prompt_logprob and every accumulator it gates (input_token_logprobs_val/idx, input_top_logprobs_val/idx, input_token_ids_logprobs_val/idx, sent_prompt_logprob_offset — lines 112–135) are prompt-path only. Remove them, plus the return_prompt_logprob derivation/passing in from_recv_req (~189, ~201).
| if model_execution_results.output_logprobs is not None | ||
| else None | ||
| ) | ||
| _input_lp = getattr(model_execution_results, "input_token_logprobs", None) |
There was a problem hiding this comment.
This whole block through ~line 670 is prompt-path only: it reads ModelExecutionResult.input_* fields and forward_op.prompt_logprobs (both removed with the scheduler change), the ilp_pt accumulator, and the per-request seg_val/seg_idx/pl_req accumulation into request_state.input_*. Remove.
| output_token_logprobs_val.append([]) | ||
| output_token_logprobs_idx.append([]) | ||
|
|
||
| # Prompt/input-token logprobs: ship the un-shipped tail of the |
There was a problem hiding this comment.
The prompt-logprob shipping block (937–978) and its wiring into BatchStrOut (999–1015) should revert to the prior empties: input_token_logprobs_val=[], input_top_logprobs_val=[], input_token_ids_logprobs_val=[] (idx likewise). The output_* fields and the output-top-k TODO (1005–1011) stay.
| "cumulative_logprob", 0.0 | ||
| ) + self._sum_slot0(out_sampled_val) | ||
|
|
||
| if num_prompt is not None: |
There was a problem hiding this comment.
if num_prompt is not None: … ~167 builds the prompt_logprobs meta entry — prompt path only; remove. The output block (num_out), _build_positions, build_position_logprobs, and the Logprob dataclass stay. Note: with prompt and output-token-id logprobs both deferred, want_token_ids is always False, so the tid_* args + the fold-in loop in _build_positions (196–211) are dead for this PR too.
| ), | ||
| ) | ||
|
|
||
| def _maybe_set_input_logprob_ctx(self, ctx, forward_op, bs, num_extends): |
There was a problem hiding this comment.
_maybe_set_input_logprob_ctx (672–707) is entirely the prompt-logprob activation path: it reads forward_op.prompt_logprobs / logprob_token_ids and sets the ctx.extend_* fields. Remove the method and its call site at ~1474.
| ), | ||
| **mamba_kwargs, | ||
| ) | ||
| if _input_lp_bundle is None: |
There was a problem hiding this comment.
The _input_lp_bundle unpack (1547–1551, 1576–1589), the input_*=None inits (1416–1420), the CPU copy (1646–1656), and the ModelExecutionResult(input_*=…) kwargs (1665–1670) are all prompt path. Remove with the 4-tuple revert above.
| aux_hidden_states: torch.Tensor | None = None, | ||
| ) -> LogitsProcessorOutput: | ||
| # Get the last hidden states and last logits for the next token prediction | ||
| # NOTE: ``extend_return_logprob`` is only ever set True once the |
There was a problem hiding this comment.
Comment-only change. With the prompt path removed, extend_return_logprob is always False so this branch is always taken — the added 'Phase B' NOTE can be dropped (or reduced to a one-liner without the prompt-path framing).
| # OUTPUT top-k logprobs are deferred (not produced yet), so only the | ||
| # prompt (prefill) top-k and prompt token-id logprobs are extracted; | ||
| # output top-k is left empty and skipped by check_close_model_outputs. | ||
| prompt_lp = response["meta_info"].get("prompt_logprobs") or [] |
There was a problem hiding this comment.
Heads-up: once the prompt path is gone, meta_info has no prompt_logprobs key, so prompt_positions is always []. That makes the prefill-logprob comparison in check_close_model_outputs (704–708, n_common=0) and the prompt-token-id comparison (750–781) vacuous — they pass trivially. Better to drop the prompt/prefill comparisons and keep this runner asserting only the output (decode, logprobs=0) path, so the test actually exercises what the PR ships.
| tid = list(tid) if tid else None | ||
| if not (rl or tid): | ||
| return None | ||
| want_prompt = isinstance(lsl, int) and lsl >= 0 |
There was a problem hiding this comment.
For an output-only PR the legacy coercion only needs return_logprob → logprobs=0. want_prompt + prompt_logprobs=… (145, 151) and logprob_token_ids=tid (152) map to features verify() rejects, so a legacy caller passing logprob_start_len>=0 or token_ids_logprob would now hard-error. Either trim these mappings, or keep them and accept the reject — but call out the back-compat behavior change.
qywu
left a comment
There was a problem hiding this comment.
Two follow-ups in one review: (1) the net-new redundancy items from the offline pass, and (2) a careful SGLang logprobs compatibility check (this codebase is SGLang-derived).
Redundancy: two dead symbols (MAX_LOGPROB_TOKEN_IDS, LogprobsOnePosition) and the write-only ForwardContext fields.
SGLang compatibility — three concerns, all verified against main and the PR branch:
- The public
meta_infologprob keys are renamed + reshaped (six SGLang tuple-lists →logprobs/prompt_logprobsdict-of-Logprob). Breaking for any SGLang-shaped consumer; CI won't catch it since the tests were rewritten. Logprob(dataclass) + int-keyed dicts aren't JSON/wire-native the way SGLang's lists were; no in-repo serializer handles them.- Legacy request-field coercion silently changes meaning:
top_logprobs_numdropped,logprob_start_len/token_ids_logprobnow hard-error.
None of these block the output-logprobs goal — but if SGLang wire-compat matters (router, eval harnesses, the external OpenAI/smg path), they need a conscious decision (compat shim vs. documented break).
|
|
||
| from dataclasses import dataclass | ||
|
|
||
| MAX_LOGPROB_TOKEN_IDS = 128 |
There was a problem hiding this comment.
Dead constant — MAX_LOGPROB_TOKEN_IDS is defined here and referenced nowhere on the branch (git grep finds only this line). Either wire it into verify() (e.g. cap len(logprob_token_ids)) or drop it.
|
|
||
|
|
||
| # One position maps token_id -> Logprob. | ||
| LogprobsOnePosition = dict[int, "Logprob"] |
There was a problem hiding this comment.
Dead type alias — LogprobsOnePosition is defined and referenced nowhere (git grep finds only this line). _build_positions/build_position_logprobs annotate with the inline dict[int, Logprob] instead. Drop it or actually use it as the return annotation.
| accept_lengths: torch.Tensor | None = None | ||
|
|
||
| # --- input/prompt logprobs (off-policy); set by _maybe_set_input_logprob_ctx --- | ||
| extend_return_logprob: bool = False |
There was a problem hiding this comment.
Beyond being prompt-only: 8 of these 9 fields are write-only. _maybe_set_input_logprob_ctx assigns ctx.extend_return_logprob, extend_logprob_start_lens_cpu, extend_seq_lens_cpu, extend_logprob_pruned_lens_cpu, top_logprobs_nums, token_ids_logprobs, extend_return_top_logprob, extend_token_ids_logprob — but nothing ever reads them back off ctx (only extend_input_logprob_token_ids_gpu is read, in model_executor ~1651). The logits kernel reads LogitsMetadata.*, and the PR adds no ctx → LogitsMetadata bridge, so these don't even drive the existing prompt kernel. Dead regardless of the prompt-path decision.
| _get("output_top_logprobs_idx"), | ||
| return_text_in_logprobs, | ||
| ) | ||
| logprobs_info.setdefault("logprobs", []).extend(positions) |
There was a problem hiding this comment.
SGLang compat (breaking). This replaces the entire SGLang meta_info logprob surface — input_token_logprobs, output_token_logprobs, input_top_logprobs, output_top_logprobs, input_token_ids_logprobs, output_token_ids_logprobs (each a list[(logprob, token_id, decoded_token)]) — with meta_info["logprobs"] / ["prompt_logprobs"] as list[dict[int, Logprob]] (+ cumulative_logprob). Any SGLang-shaped consumer (the external smg_grpc_servicer OpenAI path, SGLang router/eval harnesses, the PD mini_lb merge of input_token_logprobs) reading the old keys silently gets nothing. The in-repo tests were rewritten to the new shape, so CI won't catch this. If SGLang wire-compat is a requirement, consider emitting both key sets during a deprecation window; if it's an intentional break, please call it out in the PR description as an API break.
|
|
||
|
|
||
| @dataclass | ||
| class Logprob: |
There was a problem hiding this comment.
SGLang compat (serialization). SGLang returned plain tuples/lists, which JSON-serialize directly. dict[int, Logprob] does not: Logprob is a bare dataclass (no to_dict/__json__), and JSON object keys must be strings (int keys get coerced, changing the shape consumers see). No in-repo path serializes it (the only asdict/json.dumps hits are server_args + PD registry/transfer, unrelated). The OpenAI/HTTP mapping lives in the external smg package, which the PR notes "still needs the matching change" — so served logprobs are effectively unmapped for this shape until that lands. Worth a to_dict() / explicit wire form here so in-process and served outputs agree.
| # Output top-k unsupported -> honor return_logprob as the | ||
| # sampled-token logprob; top_logprobs_num is intentionally | ||
| # clamped to 0 rather than rejected for back-compat. | ||
| logprobs=0 if rl else None, |
There was a problem hiding this comment.
SGLang compat (silent semantic changes). This legacy translation changes behavior for existing SGLang clients:
top_logprobs_numis dropped entirely — it isn't even a parameter of_row. A request withreturn_logprob=True, top_logprobs_num=5silently returns only the sampled token (logprobs=0), no top-k, no error.logprob_start_len >= 0(the standard SGLang way to ask for prompt logprobs) →prompt_logprobs→verify()raises. Previously-working requests now hard-error.token_ids_logprob→logprob_token_ids→verify()raises (even whenreturn_logprob=False, since_rowproceeds ontidalone).return_logprob=Trueused to return both prompt and output logprobs; now it's output-only.
return_text_in_logprobs is the one preserved. Suggest at minimum erroring (not silently clamping) on top_logprobs_num > 0 so the data loss is visible, and documenting the prompt-field rejections as a back-compat break.
Design proposal: support SGLang and vLLM logprob formats togetherFollowing up on the compatibility review. The good news is this PR is already most of the way there — both formats are just two views of the same per-position data, and the neutral wire arrays this PR keeps unchanged ( Architecture
Request side — a near 1:1 mapping (PR is close)The PR's
Gate unsupported capabilities (output top-k Output side — add a
|
| Change | File |
|---|---|
format field; keep render_sglang (old code) + render_vllm (new); dispatch in convert_logprob_style |
engine/logprobs.py |
Set format="sglang" in legacy coercion; carry top_logprobs_num into the count |
engine/io_struct.py |
_APPEND_META_KEYS = union of both key sets |
engine/collector.py |
Logprob.to_dict() so the vLLM shape is JSON-safe; serving layer calls it |
engine/logprobs.py + serving boundary |
Parametrize logprob tests over format ∈ {sglang, vllm} |
test/runners.py, test/runtime/... |
Caveats to decide consciously
rankfidelity (vLLM): vLLM'sLogprob.rankis the token's 1-based rank in the full vocab distribution; the PR uses slot index (0=sampled, 1..N). True parity needs the actual vocab rank from the sampler. Orthogonal to this architecture, but don't advertise vLLM-rank parity until it's real.logprob_start_len: SGLang allows mid-prompt start; vLLM is whole-prompt.render_sglanghonorsstart_len(slice),render_vllmreturns the whole prompt. Doesn't bite now (prompt logprobs deferred) — just bakestart_leninto the neutral representation so it's there when the prompt path lands.- OpenAI/serving is a third renderer. Point the external smg servicer at the neutral arrays, not at either
meta_infodialect, so it never depends onformat. "both"cost: always-emit-both doubles the logprob payload (heavy for RL/prompt-logprob workloads) — keep it opt-in, not the default.
Net
Keep the neutral wire arrays as the contract, split convert_logprob_style into two pure renderers, add a format flag defaulting to "match the request dialect," and make Logprob serializable. One compute path, both APIs satisfied, scheduler untouched. This also closes the CI blind spot from the review (tests would then assert both shapes).
Summary
Adds a vLLM-style logprobs API on a dedicated
LogprobParamsstruct and a vLLM-styleLogproboutput shape, wired end-to-end (Python frontend → C++ scheduler → forward path).SamplingParamsstays sampling-only.This PR is scoped to output (generated-token) logprobs — the surface that is correct today. The prompt-logprob and top-k paths are rejected loudly by
LogprobParams.verify()rather than silently returning wrong/partial values, and are left as follow-ups.The deprecated request fields (
return_logprob,logprob_start_len,top_logprobs_num,token_ids_logprob,return_text_in_logprobs) are still translated intoLogprobParamsfor back-compat; requests that map to an unsupported mode get a clear error.API
LogprobParams(engine/logprob_params.py) —verify()is the single gate:logprobsNone(off) and0(sampled token only).N>0(output top-k) and-1(full vocab) are rejected — not yet materialized.prompt_logprobslogprob_token_idsreturn_textOutput shape (
engine/logprobs.py):Logprob{logprob, rank, decoded_token}asdict[int, Logprob]per position, exposed asmeta_info["logprobs"](one entry per generated token).Speculative decoding
Output logprobs now work under speculative decoding (MTP / EAGLE).
Engine.generate/async_generatepreviously nulled all logprob requests whenever a spec algorithm was set, silently dropping them; the engine actually computes correct, accept-length-aligned output logprobs on the spec verify path (samplerverify()gathers them; the output processor slices them byoutput_lengthwithspec_num_tokensstride), so that guard is removed andverify()is the gate.Implementation
LogprobParams+Logprob; legacy-field coercion inio_struct.py; request plumbing inentrypoints/engine*.py; per-request assembly ingeneration_output_processor.py/logprobs.py; output-logprob flow throughexecution/model_executor.py+execution/types.py::ModelExecutionResult+execution/cuda_graph_wrapper.py.prompt_logprobs+logprob_token_idscarried onRequestSpec/ForwardOperation/FlatForwardOperation(+ pybind);MatchIntent::SkipReadfor prompt-logprob requests. (Prompt path present but gated off behindverify()for now.)Validation (nv2 B200)
logprobs=0— Qwen2-1.5B-Instruct: per-token logprobs returned, aligned to generated tokens, within bf16 noise vs HFlog_softmax.logprobs=0under MTP — Qwen3.5-397B-A17B-NVFP4 (tp4, trtllm + flashinfer_trtllm, eager + CUDA-graph): finite, ≤0 per-token logprobs (0 → present after removing the spec guard).prompt_logprobs,logprob_token_ids,logprobs>0,logprobs=-1all raise a clearValueErrorat the request entrypoint.main, Python imports clean, engine inits + generates with aligned output logprobs.Follow-ups (not in this PR)
logprobs>0) once the chunked/mixed prompt path and CUDA-graph top-k buffers land;verify()is the single place to relax.ts serve→smg_grpc_servicer, external package) maps logprobs separately and still needs the matching change for served logprobs under spec.pd/mini_lb.pystill merges the oldinput_token_logprobsmeta key).🤖 Generated with Claude Code