Skip to content

feat(logprobs): vLLM-style output logprobs (LogprobParams), spec-decode support#337

Open
HJSang wants to merge 1 commit into
mainfrom
hejian/vllm-logprobs
Open

feat(logprobs): vLLM-style output logprobs (LogprobParams), spec-decode support#337
HJSang wants to merge 1 commit into
mainfrom
hejian/vllm-logprobs

Conversation

@HJSang

@HJSang HJSang commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds a vLLM-style logprobs API on a dedicated LogprobParams struct and a vLLM-style Logprob output shape, wired end-to-end (Python frontend → C++ scheduler → forward path). SamplingParams stays sampling-only.

This PR is scoped to output (generated-token) logprobs — the surface that is correct today. The prompt-logprob and top-k paths are rejected loudly by LogprobParams.verify() rather than silently returning wrong/partial values, and are left as follow-ups.

The deprecated request fields (return_logprob, logprob_start_len, top_logprobs_num, token_ids_logprob, return_text_in_logprobs) are still translated into LogprobParams for back-compat; requests that map to an unsupported mode get a clear error.

API

LogprobParams (engine/logprob_params.py) — verify() is the single gate:

field status
logprobs honored for None (off) and 0 (sampled token only). N>0 (output top-k) and -1 (full vocab) are rejected — not yet materialized.
prompt_logprobs rejected — prompt path is only correct for single-chunk pure-extend prefill; chunked / mixed / prefix-cache paths would be silently wrong.
logprob_token_ids rejected — only feeds the prompt/extend path.
return_text decode each returned token to text.

Output shape (engine/logprobs.py): Logprob{logprob, rank, decoded_token} as dict[int, Logprob] per position, exposed as meta_info["logprobs"] (one entry per generated token).

Speculative decoding

Output logprobs now work under speculative decoding (MTP / EAGLE). Engine.generate / async_generate previously nulled all logprob requests whenever a spec algorithm was set, silently dropping them; the engine actually computes correct, accept-length-aligned output logprobs on the spec verify path (sampler verify() gathers them; the output processor slices them by output_length with spec_num_tokens stride), so that guard is removed and verify() is the gate.

Implementation

  • Python: LogprobParams + Logprob; legacy-field coercion in io_struct.py; request plumbing in entrypoints/engine*.py; per-request assembly in generation_output_processor.py / logprobs.py; output-logprob flow through execution/model_executor.py + execution/types.py::ModelExecutionResult + execution/cuda_graph_wrapper.py.
  • C++ scheduler: prompt_logprobs + logprob_token_ids carried on RequestSpec / ForwardOperation / FlatForwardOperation (+ pybind); MatchIntent::SkipRead for prompt-logprob requests. (Prompt path present but gated off behind verify() for now.)

Validation (nv2 B200)

  • Output logprobs=0 — Qwen2-1.5B-Instruct: per-token logprobs returned, aligned to generated tokens, within bf16 noise vs HF log_softmax.
  • Output logprobs=0 under MTP — Qwen3.5-397B-A17B-NVFP4 (tp4, trtllm + flashinfer_trtllm, eager + CUDA-graph): finite, ≤0 per-token logprobs (0 → present after removing the spec guard).
  • Rejectionsprompt_logprobs, logprob_token_ids, logprobs>0, logprobs=-1 all raise a clear ValueError at the request entrypoint.
  • Rebased branch build — scheduler C++ recompiles on top of current main, Python imports clean, engine inits + generates with aligned output logprobs.

Follow-ups (not in this PR)

  • Re-enable prompt logprobs + output top-k (logprobs>0) once the chunked/mixed prompt path and CUDA-graph top-k buffers land; verify() is the single place to relax.
  • OpenAI HTTP serving path (ts servesmg_grpc_servicer, external package) maps logprobs separately and still needs the matching change for served logprobs under spec.
  • Minor spec-path output-logprob alignment residual observed on the 397B (≈48 logprobs for 53 tokens — boundary/bonus tokens).
  • PD disaggregation (pd/mini_lb.py still merges the old input_token_logprobs meta key).

🤖 Generated with Claude Code

@HJSang HJSang requested a review from a team as a code owner June 2, 2026 03:38

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 574f1eec93

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread python/tokenspeed/runtime/engine/generation_output_processor.py
@lightseek-bot lightseek-bot requested a review from qywu June 2, 2026 03:42

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a392acbff6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread python/tokenspeed/runtime/engine/logprobs.py

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 05a5437e5d

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread python/tokenspeed/runtime/engine/generation_output_processor.py Outdated
Comment thread python/tokenspeed/runtime/execution/model_executor.py

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: ca28d25dc2

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread python/tokenspeed/runtime/execution/model_executor.py

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 64b6500f69

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread python/tokenspeed/runtime/engine/logprob_params.py Outdated

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 04b1532a67

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread python/tokenspeed/runtime/engine/input_processor.py
@HJSang

HJSang commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator Author

Codex review triage

  • Prompt-only requests return empty & prompt top-k hardcoded empty — already fixed by the Phase B / top-k routing commits; _maybe_set_input_logprob_ctx sets extend_return_logprob independently and prompt top-k is now routed. Validated on B200 (prompt_logprobs=0 and =5 match HF).
  • logprobs/prompt_logprobs = -1 (full vocab) — was accepted by verify() but nothing materializes a vocab-sized result. Now rejected loudly until implemented (04b1532).
  • OUTPUT top-k (logprobs=N>0) / output logprob_token_idsdeferred (5e00c8c, documented at the hardcoded site). Output logprobs flow through the captured CUDA-graph decode path, so top-k must be computed in the sampler and captured into the graph output buffers — a separate milestone. Until then logprobs=N>0 returns the sampled token's logprob (correct value, no alternatives).
  • Mixed prefill/decode batches skip prompt logprobs — known pure-extend limitation (num_extends==bs); the input-path per-request slicing assumes no interleaved decode rows. Documented follow-up.

Validated end-to-end on nv2 B200 vs HF log_softmax: output logprobs=0 (2 models), prompt prompt_logprobs=0 (vLLM convention, mean Δ≈0.008–0.05), prompt top-k prompt_logprobs=5, prompt logprob_token_ids (24/24 ids match), and prefix-cache correctness (SkipRead).

@HJSang HJSang force-pushed the hejian/vllm-logprobs branch 2 times, most recently from 0d1f63a to 1c0ba37 Compare June 2, 2026 16:15

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1c0ba37ec9

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread python/tokenspeed/runtime/entrypoints/engine.py

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 3c58b71336

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread python/tokenspeed/runtime/execution/model_executor.py
@HJSang HJSang force-pushed the hejian/vllm-logprobs branch from 3c58b71 to 9c61763 Compare June 2, 2026 17:32

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 9c61763179

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread python/tokenspeed/runtime/engine/generation_output_processor.py
@HJSang HJSang force-pushed the hejian/vllm-logprobs branch 2 times, most recently from 58a21fd to d2ee343 Compare June 2, 2026 21:06

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: d2ee343cf6

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread python/tokenspeed/runtime/engine/io_struct.py Outdated
@HJSang HJSang force-pushed the hejian/vllm-logprobs branch from d2ee343 to 6dbbb87 Compare June 2, 2026 21:20

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 6dbbb87031

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread python/tokenspeed/runtime/engine/io_struct.py Outdated
@HJSang HJSang force-pushed the hejian/vllm-logprobs branch from 6dbbb87 to b58a7ed Compare June 3, 2026 03:37

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: b58a7ed3bf

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread python/tokenspeed/runtime/engine/collector.py Outdated
@HJSang HJSang force-pushed the hejian/vllm-logprobs branch 2 times, most recently from cc8bb2a to c774bf0 Compare June 3, 2026 04:27

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c774bf01ac

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread python/tokenspeed/runtime/engine/logprobs.py
@qywu

qywu commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator

prompt_logprobs is only correct on the single-chunk path — worth making the unsupported cases loud (or scoping them out).

Verified locally: prompt_logprobs works for a single-chunk, pure-extend prefill (56-token prompt → 56 positions, [0]=None, values line up). But model_executor._maybe_set_input_logprob_ctx activates only for pure-extend batches and hardcodes start_lens=[0] with no cross-chunk offset — so a prompt that gets chunked (prompt_tokens > chunked_prefill_size, default 8192) or lands in a mixed extend+decode batch silently falls outside the validated path. SkipRead handles prefix-cache reuse but not chunking. The PR already lists this under follow-ups; the concern is it's currently silent rather than loud.

Two ways to make that safe:

  1. Keep prompt_logprobs, add a guard — raise NotImplementedError (or reject in LogprobParams.verify()) when prompt_tokens > chunked_prefill_size / mixed batch / start_len > 0, until the chunked path lands. Preserves the common-case feature + vLLM parity, makes the edges safe.
  2. Scope this PR to output logprobs only — reject prompt_logprobs in verify() with a clear "not supported yet" message and defer the prompt path (incl. the SkipRead recompute) to a follow-up. Smaller, fully-correct surface.

I'd lean (1), since the single-chunk path is already validated and prompt_logprobs matters for vLLM parity — but (2) is reasonable if it isn't needed yet. Either beats silently-wrong logprobs on long/chunked prompts.

Caveat: I verified single-chunk directly; I couldn't run a clean chunked repro because chunked prefill crashes independently on my older test base (unrelated to this PR), so I can't say whether the real-base chunked behavior is silently-wrong vs. an error.

Comment thread python/tokenspeed/runtime/engine/logprob_params.py

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a0a2784efc

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread python/tokenspeed/runtime/engine/io_struct.py Outdated
@qywu

qywu commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator

there is merge conflict for python/tokenspeed/runtime/layers/logits_processor.py

@HJSang HJSang force-pushed the hejian/vllm-logprobs branch from 670156e to 08360f7 Compare June 8, 2026 04:08
@HJSang HJSang changed the title feat(logprobs): vLLM-style logprob API (LogprobParams + Logprob output shape) feat(logprobs): vLLM-style output logprobs (LogprobParams), spec-decode support Jun 8, 2026
@HJSang HJSang force-pushed the hejian/vllm-logprobs branch from 08360f7 to 035f19e Compare June 8, 2026 04:28
@HJSang

HJSang commented Jun 8, 2026

Copy link
Copy Markdown
Collaborator Author

there is merge conflict for python/tokenspeed/runtime/layers/logits_processor.py

resolved

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 035f19e5b7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

# Output top-k unsupported -> honor return_logprob as the
# sampled-token logprob; top_logprobs_num is intentionally
# clamped to 0 rather than rejected for back-compat.
logprobs=0 if rl else None,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve legacy top-logprobs results

Fresh evidence in the current diff is that the compatibility shim now always maps legacy return_logprob=True, top_logprobs_num=N requests to LogprobParams(logprobs=0), so callers that previously requested top-N alternatives no longer fail loudly but silently receive only the sampled token's logprob. This affects existing clients and in-repo callers that still use the deprecated fields for experiment scoring/comparison, because their result shape/content is downgraded even though the request is accepted.

Useful? React with 👍 / 👎.

…de support

Add a dedicated LogprobParams request struct and a vLLM-style Logprob output
shape (per-position {token_id: Logprob}), kept separate from SamplingParams.

Scope to OUTPUT logprobs only for now: LogprobParams.verify() is the single
gate and loudly rejects the not-yet-correct surface — prompt_logprobs and
logprob_token_ids (prompt path is only valid for single-chunk pure-extend
prefill; chunked/mixed/prefix-cache paths would be silently wrong), output
top-k (logprobs>0, only the sampled token is materialized), and full-vocab
(-1). Only logprobs in {None, 0} are honored. GPU parity runner updated to
request logprobs=0.

Also enable output logprobs under speculative decoding: Engine.generate /
async_generate previously nulled all logprob requests whenever a spec
algorithm was set, silently dropping them. The engine computes correct,
accept-length-aligned output logprobs on the spec verify path, so the guard
was overly conservative; remove it and rely on verify() as the gate.

Back-compat / streaming hardening:
- Legacy field coercion (io_struct) now builds a per-row LogprobParams for
  batched list inputs (e.g. return_logprob=[False, True]) instead of
  collapsing to row 0, and clamps legacy top_logprobs_num>0 to the
  sampled-token logprob (logprobs=0) rather than erroring.
- RequestOutputCollector now sums cumulative_logprob across coalesced
  streamed frames (each frame's value is a per-frame delta) so it stays
  consistent with the appended per-position logprobs.

Validated on B200 (Qwen2-1.5B and Qwen3.5-397B-A17B-NVFP4 MTP/tp4): output
logprobs=0 returns finite, <=0 per-token logprobs; prompt/top-k surface
rejected at the request entrypoint. Rebased branch rebuilds (scheduler C++ +
Python import + engine run) on top of current main.

Note: the OpenAI HTTP serving path (ts serve -> smg_grpc_servicer) maps
logprobs in a separate, out-of-repo package and needs the matching change
there; this covers the in-repo Engine/SDK path.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Signed-off-by: Hejian Sang <sanghj0923@gmail.com>
@HJSang HJSang force-pushed the hejian/vllm-logprobs branch from 035f19e to 740d514 Compare June 12, 2026 03:00

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 740d514720

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +144 to +146
logprobs_info["cumulative_logprob"] = logprobs_info.get(
"cumulative_logprob", 0.0
) + self._sum_slot0(out_sampled_val)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Keep cumulative_logprob cumulative for streams

When stream=True and the client consumes each frame promptly, OutputProcessor passes a fresh logprobs_info dict for every frame, so this assignment reports only the current frame's logprob sum rather than the cumulative sum over the generated prefix. The collector now sums deltas only when multiple frames are coalesced before a read, which means fast streaming clients see cumulative_logprob reset per chunk while slow clients see a larger merged value for the same tokens; store the running scalar on the request state or emit a clearly named delta instead.

Useful? React with 👍 / 👎.

@lightseek-bot lightseek-bot requested a review from qywu June 12, 2026 04:47

@qywu qywu left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Requesting changes — keep this PR scoped to the functional surface (output logprobs).

The entire tokenspeed-scheduler/ C++ diff exists only to support the prompt-logprob path, which LogprobParams.verify() rejects in this PR. On the output-only path request->PromptLogprobs() is always -1, so MatchIntent::SkipRead never fires and the new prompt_logprobs / logprob_token_ids fields just carry -1/empty — it's dead code until the follow-up.

Could we drop the scheduler changes here and land them with the prompt-logprob follow-up, where they're actually exercised and testable? Output logprobs are computed entirely Python-side (model_executor / sampler verify() / output processor), so they need no scheduler change.

If removed, the coupled Python plumbing has to go too so nothing references the removed pybind fields:

  • scheduler_utils.py::make_spec — the prompt_logprobs / logprob_token_ids params + spec.* assignments
  • request_handler.py — the lp.num_prompt_logprobs() / logprob_token_ids -> make_spec(...) wiring
  • the C++ side: python_module.cpp, request_spec.h, request.{h,cpp}, forward.{h,cpp}, types.h, and the prefix-cache SkipRead handling

(Keeping them is defensible too — harmless no-op, saves a recompile later. But if the goal is a minimal, fully-exercised PR, splitting the scheduler work into the follow-up matches each diff to what it enables.)

@qywu qywu left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following up on the request-changes review with the specific prompt-logprob code to remove so this PR is scoped to output logprobs only. The output-logprobs API refactor stays (the LogprobParams/Logprob types, logprob_params plumbing in io_struct/input_processor/output_processor/entrypoints, the num_out path in logprobs.py, and the output-logprob tests). Each inline note marks code that exists solely for the prompt/input-logprob path — currently unreachable because LogprobParams.verify() rejects prompt_logprobs/logprob_token_ids, so it's untested dead code until the follow-up.

Grouped: scheduler-coupled (request_handler.py, scheduler_utils.py) · execution path (context.py, types.py, cuda_graph_wrapper.py, model_executor.py) · output-processor accumulation/shipping (generation_output_processor.py, collector.py) · assembly (logprobs.py prompt branch) · API/guard (logprob_params.py, io_struct.py legacy mapping) · tests (runners.py).

"input_token_ids_logprobs",
"output_token_ids_logprobs",
"logprobs",
"prompt_logprobs",

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

prompt_logprobs is the only prompt-specific entry in this merge policy — logprobs and cumulative_logprob are output-side and stay. Drop prompt_logprobs from _APPEND_META_KEYS when the prompt path is removed (the engine never emits a prompt_logprobs meta key once verify() rejects it).

token_ids_logprob: list[int] | None = None,
multimodal_inputs=None,
prompt_input_ids_unpadded: list[int] | None = None,
return_prompt_logprob: bool = False,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return_prompt_logprob and every accumulator it gates (input_token_logprobs_val/idx, input_top_logprobs_val/idx, input_token_ids_logprobs_val/idx, sent_prompt_logprob_offset — lines 112–135) are prompt-path only. Remove them, plus the return_prompt_logprob derivation/passing in from_recv_req (~189, ~201).

if model_execution_results.output_logprobs is not None
else None
)
_input_lp = getattr(model_execution_results, "input_token_logprobs", None)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This whole block through ~line 670 is prompt-path only: it reads ModelExecutionResult.input_* fields and forward_op.prompt_logprobs (both removed with the scheduler change), the ilp_pt accumulator, and the per-request seg_val/seg_idx/pl_req accumulation into request_state.input_*. Remove.

output_token_logprobs_val.append([])
output_token_logprobs_idx.append([])

# Prompt/input-token logprobs: ship the un-shipped tail of the

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The prompt-logprob shipping block (937–978) and its wiring into BatchStrOut (999–1015) should revert to the prior empties: input_token_logprobs_val=[], input_top_logprobs_val=[], input_token_ids_logprobs_val=[] (idx likewise). The output_* fields and the output-top-k TODO (1005–1011) stay.

"cumulative_logprob", 0.0
) + self._sum_slot0(out_sampled_val)

if num_prompt is not None:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if num_prompt is not None: … ~167 builds the prompt_logprobs meta entry — prompt path only; remove. The output block (num_out), _build_positions, build_position_logprobs, and the Logprob dataclass stay. Note: with prompt and output-token-id logprobs both deferred, want_token_ids is always False, so the tid_* args + the fold-in loop in _build_positions (196–211) are dead for this PR too.

),
)

def _maybe_set_input_logprob_ctx(self, ctx, forward_op, bs, num_extends):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_maybe_set_input_logprob_ctx (672–707) is entirely the prompt-logprob activation path: it reads forward_op.prompt_logprobs / logprob_token_ids and sets the ctx.extend_* fields. Remove the method and its call site at ~1474.

),
**mamba_kwargs,
)
if _input_lp_bundle is None:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _input_lp_bundle unpack (1547–1551, 1576–1589), the input_*=None inits (1416–1420), the CPU copy (1646–1656), and the ModelExecutionResult(input_*=…) kwargs (1665–1670) are all prompt path. Remove with the 4-tuple revert above.

aux_hidden_states: torch.Tensor | None = None,
) -> LogitsProcessorOutput:
# Get the last hidden states and last logits for the next token prediction
# NOTE: ``extend_return_logprob`` is only ever set True once the

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment-only change. With the prompt path removed, extend_return_logprob is always False so this branch is always taken — the added 'Phase B' NOTE can be dropped (or reduced to a one-liner without the prompt-path framing).

Comment thread test/runners.py
# OUTPUT top-k logprobs are deferred (not produced yet), so only the
# prompt (prefill) top-k and prompt token-id logprobs are extracted;
# output top-k is left empty and skipped by check_close_model_outputs.
prompt_lp = response["meta_info"].get("prompt_logprobs") or []

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Heads-up: once the prompt path is gone, meta_info has no prompt_logprobs key, so prompt_positions is always []. That makes the prefill-logprob comparison in check_close_model_outputs (704–708, n_common=0) and the prompt-token-id comparison (750–781) vacuous — they pass trivially. Better to drop the prompt/prefill comparisons and keep this runner asserting only the output (decode, logprobs=0) path, so the test actually exercises what the PR ships.

tid = list(tid) if tid else None
if not (rl or tid):
return None
want_prompt = isinstance(lsl, int) and lsl >= 0

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For an output-only PR the legacy coercion only needs return_logprob → logprobs=0. want_prompt + prompt_logprobs=… (145, 151) and logprob_token_ids=tid (152) map to features verify() rejects, so a legacy caller passing logprob_start_len>=0 or token_ids_logprob would now hard-error. Either trim these mappings, or keep them and accept the reject — but call out the back-compat behavior change.

@qywu qywu left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two follow-ups in one review: (1) the net-new redundancy items from the offline pass, and (2) a careful SGLang logprobs compatibility check (this codebase is SGLang-derived).

Redundancy: two dead symbols (MAX_LOGPROB_TOKEN_IDS, LogprobsOnePosition) and the write-only ForwardContext fields.

SGLang compatibility — three concerns, all verified against main and the PR branch:

  1. The public meta_info logprob keys are renamed + reshaped (six SGLang tuple-lists → logprobs/prompt_logprobs dict-of-Logprob). Breaking for any SGLang-shaped consumer; CI won't catch it since the tests were rewritten.
  2. Logprob (dataclass) + int-keyed dicts aren't JSON/wire-native the way SGLang's lists were; no in-repo serializer handles them.
  3. Legacy request-field coercion silently changes meaning: top_logprobs_num dropped, logprob_start_len/token_ids_logprob now hard-error.

None of these block the output-logprobs goal — but if SGLang wire-compat matters (router, eval harnesses, the external OpenAI/smg path), they need a conscious decision (compat shim vs. documented break).


from dataclasses import dataclass

MAX_LOGPROB_TOKEN_IDS = 128

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead constant — MAX_LOGPROB_TOKEN_IDS is defined here and referenced nowhere on the branch (git grep finds only this line). Either wire it into verify() (e.g. cap len(logprob_token_ids)) or drop it.



# One position maps token_id -> Logprob.
LogprobsOnePosition = dict[int, "Logprob"]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dead type alias — LogprobsOnePosition is defined and referenced nowhere (git grep finds only this line). _build_positions/build_position_logprobs annotate with the inline dict[int, Logprob] instead. Drop it or actually use it as the return annotation.

accept_lengths: torch.Tensor | None = None

# --- input/prompt logprobs (off-policy); set by _maybe_set_input_logprob_ctx ---
extend_return_logprob: bool = False

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Beyond being prompt-only: 8 of these 9 fields are write-only. _maybe_set_input_logprob_ctx assigns ctx.extend_return_logprob, extend_logprob_start_lens_cpu, extend_seq_lens_cpu, extend_logprob_pruned_lens_cpu, top_logprobs_nums, token_ids_logprobs, extend_return_top_logprob, extend_token_ids_logprob — but nothing ever reads them back off ctx (only extend_input_logprob_token_ids_gpu is read, in model_executor ~1651). The logits kernel reads LogitsMetadata.*, and the PR adds no ctx → LogitsMetadata bridge, so these don't even drive the existing prompt kernel. Dead regardless of the prompt-path decision.

_get("output_top_logprobs_idx"),
return_text_in_logprobs,
)
logprobs_info.setdefault("logprobs", []).extend(positions)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGLang compat (breaking). This replaces the entire SGLang meta_info logprob surface — input_token_logprobs, output_token_logprobs, input_top_logprobs, output_top_logprobs, input_token_ids_logprobs, output_token_ids_logprobs (each a list[(logprob, token_id, decoded_token)]) — with meta_info["logprobs"] / ["prompt_logprobs"] as list[dict[int, Logprob]] (+ cumulative_logprob). Any SGLang-shaped consumer (the external smg_grpc_servicer OpenAI path, SGLang router/eval harnesses, the PD mini_lb merge of input_token_logprobs) reading the old keys silently gets nothing. The in-repo tests were rewritten to the new shape, so CI won't catch this. If SGLang wire-compat is a requirement, consider emitting both key sets during a deprecation window; if it's an intentional break, please call it out in the PR description as an API break.



@dataclass
class Logprob:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGLang compat (serialization). SGLang returned plain tuples/lists, which JSON-serialize directly. dict[int, Logprob] does not: Logprob is a bare dataclass (no to_dict/__json__), and JSON object keys must be strings (int keys get coerced, changing the shape consumers see). No in-repo path serializes it (the only asdict/json.dumps hits are server_args + PD registry/transfer, unrelated). The OpenAI/HTTP mapping lives in the external smg package, which the PR notes "still needs the matching change" — so served logprobs are effectively unmapped for this shape until that lands. Worth a to_dict() / explicit wire form here so in-process and served outputs agree.

# Output top-k unsupported -> honor return_logprob as the
# sampled-token logprob; top_logprobs_num is intentionally
# clamped to 0 rather than rejected for back-compat.
logprobs=0 if rl else None,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGLang compat (silent semantic changes). This legacy translation changes behavior for existing SGLang clients:

  • top_logprobs_num is dropped entirely — it isn't even a parameter of _row. A request with return_logprob=True, top_logprobs_num=5 silently returns only the sampled token (logprobs=0), no top-k, no error.
  • logprob_start_len >= 0 (the standard SGLang way to ask for prompt logprobs) → prompt_logprobsverify() raises. Previously-working requests now hard-error.
  • token_ids_logproblogprob_token_idsverify() raises (even when return_logprob=False, since _row proceeds on tid alone).
  • return_logprob=True used to return both prompt and output logprobs; now it's output-only.

return_text_in_logprobs is the one preserved. Suggest at minimum erroring (not silently clamping) on top_logprobs_num > 0 so the data loss is visible, and documenting the prompt-field rejections as a back-compat break.

@qywu

qywu commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Design proposal: support SGLang and vLLM logprob formats together

Following up on the compatibility review. The good news is this PR is already most of the way there — both formats are just two views of the same per-position data, and the neutral wire arrays this PR keeps unchanged (*_token_logprobs_val/idx, *_top_logprobs_val/idx, *_token_ids_logprobs_val/idx) are already a superset of what either dialect needs. So we can serve both with one compute path, two renderers, chosen per request — no scheduler involvement, no double work.

Architecture

scheduler/model ──► neutral wire arrays (BatchStrOut: *_val / *_idx)   [UNCHANGED]
                              │
                    convert_logprob_style(params, recv_obj)            ◄── single dispatch point
                       │                         │
                 render_sglang()           render_vllm()
                       │                         │
   meta_info["output_token_logprobs"]    meta_info["logprobs"]
   meta_info["output_top_logprobs"]      meta_info["prompt_logprobs"]
   ... (6 tuple-list keys)               ... (dict[int, Logprob] + cumulative_logprob)

render_sglang is essentially the pre-PR convert_logprob_style (tuple-lists); render_vllm is the new code this PR already wrote. Both read the same neutral arrays.

Request side — a near 1:1 mapping (PR is close)

The PR's _coerce_legacy_logprob_fields just needs to be lossless instead of dropping fields:

SGLang request vLLM LogprobParams Note
return_logprob=True, top_logprobs_num=0 logprobs=0 sampled token only
return_logprob=True, top_logprobs_num=N logprobs=N currently dropped — should map N
logprob_start_len>=0 (+ top_logprobs_num) prompt_logprobs=N start_len is the extra SGLang knob (caveat 2)
token_ids_logprob=[...] logprob_token_ids=[...] direct
return_text_in_logprobs return_text already preserved

Gate unsupported capabilities (output top-k N>0, prompt logprobs) as capabilities — not by silently clamping the API shape.

Output side — add a format flag, dispatch on it

@dataclass
class LogprobParams:
    logprobs: int | None = None
    prompt_logprobs: int | None = None
    logprob_token_ids: list[int] | None = None
    return_text: bool = False
    format: Literal["vllm", "sglang", "both"] = "vllm"   # new

Selection policy: the response matches the request dialect.

  • Built via legacy fields → format="sglang" → tuple-list keys (back-compat preserved exactly).
  • Built via logprob_params directly → format="vllm" (default) → dict[int, Logprob].
  • "both" available for migration/testing (opt-in; it doubles payload).

This gives full SGLang back-compat with zero ambiguity, computed once. collector._APPEND_META_KEYS becomes the union of both dialects' list keys (only one dialect is present per request, so the union is harmless).

Where each change lands

Change File
format field; keep render_sglang (old code) + render_vllm (new); dispatch in convert_logprob_style engine/logprobs.py
Set format="sglang" in legacy coercion; carry top_logprobs_num into the count engine/io_struct.py
_APPEND_META_KEYS = union of both key sets engine/collector.py
Logprob.to_dict() so the vLLM shape is JSON-safe; serving layer calls it engine/logprobs.py + serving boundary
Parametrize logprob tests over format ∈ {sglang, vllm} test/runners.py, test/runtime/...

Caveats to decide consciously

  1. rank fidelity (vLLM): vLLM's Logprob.rank is the token's 1-based rank in the full vocab distribution; the PR uses slot index (0=sampled, 1..N). True parity needs the actual vocab rank from the sampler. Orthogonal to this architecture, but don't advertise vLLM-rank parity until it's real.
  2. logprob_start_len: SGLang allows mid-prompt start; vLLM is whole-prompt. render_sglang honors start_len (slice), render_vllm returns the whole prompt. Doesn't bite now (prompt logprobs deferred) — just bake start_len into the neutral representation so it's there when the prompt path lands.
  3. OpenAI/serving is a third renderer. Point the external smg servicer at the neutral arrays, not at either meta_info dialect, so it never depends on format.
  4. "both" cost: always-emit-both doubles the logprob payload (heavy for RL/prompt-logprob workloads) — keep it opt-in, not the default.

Net

Keep the neutral wire arrays as the contract, split convert_logprob_style into two pure renderers, add a format flag defaulting to "match the request dialect," and make Logprob serializable. One compute path, both APIs satisfied, scheduler untouched. This also closes the CI blind spot from the review (tests would then assert both shapes).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants