Skip to content

[Router] Preserve reasoning_content when caching streaming responses#2141

Merged
Xunzhuo merged 3 commits into
vllm-project:mainfrom
theohsiung:fix/streaming-cache-reasoning-content
Jun 17, 2026
Merged

[Router] Preserve reasoning_content when caching streaming responses#2141
Xunzhuo merged 3 commits into
vllm-project:mainfrom
theohsiung:fix/streaming-cache-reasoning-content

Conversation

@theohsiung

Copy link
Copy Markdown
Contributor

What

Caching a streaming reasoning-model response dropped reasoning_content: the streaming accumulator captured only delta.content, so the reconstructed chat.completion written to the cache had no reasoning. A later cache hit then returned a response missing the reasoning the original live stream delivered (the non-streaming cache preserves it, since it stores the raw upstream body).

Fixes #2140.

Fix

  • request_context.go: add StreamingReasoning.
  • processor_res_body_streaming.go: accumulate delta.reasoning_content alongside delta.content.
  • processor_res_cache.go: emit message.reasoning_content in the reconstructed response when non-empty.

reasoning_content is already a recognized field (looper extraction reads choices[].reasoning_content and .message.reasoning_content; memory + anthropic-outbound handle it).

Test plan

  • processor_res_cache_reasoning_test.go: accumulator captures delta.reasoning_content across chunks; reconstruction includes message.reasoning_content; reconstruction omits the field when no reasoning was streamed. RED before fix → GREEN.
  • go test ./pkg/extproc/ full suite green (no regression).
  • gofmt/go vet clean; golangci-lint (repo config) → 0 issues.

Notes

  • DCO signed-off.
  • Out of scope (separate): multi-choice (n>1) streaming is reconstructed as a single merged choice — a distinct fidelity gap to address separately.

@netlify

netlify Bot commented Jun 10, 2026

Copy link
Copy Markdown

Deploy Preview for vllm-semantic-router ready!

Name Link
🔨 Latest commit c719ae7
🔍 Latest deploy log https://app.netlify.com/projects/vllm-semantic-router/deploys/6a2d883081eea70008021e23
😎 Deploy Preview https://deploy-preview-2141--vllm-semantic-router.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.
🤖 Make changes Run an agent on this branch

To edit notification comments on pull requests, go to your Netlify project configuration.

@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

👥 vLLM Semantic Team Notification

The following members have been identified for the changed files in this PR and have been automatically assigned:

📁 src/semantic-router

Owners: @rootfs, @Xunzhuo, @szedan-rh, @yehuditkerido, @abdallahsamabd, @asaadbalum, @liavweiss, @noalimoy
Files changed:

  • src/semantic-router/pkg/extproc/processor_res_body_streaming.go
  • src/semantic-router/pkg/extproc/processor_res_cache.go
  • src/semantic-router/pkg/extproc/processor_res_cache_reasoning_test.go
  • src/semantic-router/pkg/extproc/request_context.go

vLLM

🎉 Thanks for your contributions!

This comment was automatically generated based on the OWNER files in the repository.

@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

✅ Supply Chain Security Report — All Clear

Scanner Status Findings
AST Codebase Scan (Py, Go, JS/TS, Rust) 19 finding(s) — MEDIUM: 12 · LOW: 7
AST PR Diff Scan No issues detected
Regex Fallback Scan No issues detected

Scanned at 2026-06-13T16:42:17.542Z · View full workflow logs

The streaming accumulator only captured delta.content; the reconstructed
chat.completion written to the semantic cache therefore dropped the reasoning
that reasoning models stream under delta.reasoning_content. A later cache hit
for a semantically-similar request then returned a response WITHOUT the
reasoning the original live stream delivered — a silent fidelity loss (the
non-streaming cache preserves it because it stores the raw upstream body).

Accumulate delta.reasoning_content into ctx.StreamingReasoning and include it
as message.reasoning_content in the reconstructed response when present.
reasoning_content is already a recognized field elsewhere (looper extraction,
memory, anthropic outbound). Absent when no reasoning was streamed.

Signed-off-by: theohsiung <theobear870924@gmail.com>
@theohsiung theohsiung force-pushed the fix/streaming-cache-reasoning-content branch from 82dc59a to dbb3989 Compare June 10, 2026 23:45
@AayushSaini101 AayushSaini101 self-requested a review June 13, 2026 10:16
@theohsiung

Copy link
Copy Markdown
Contributor Author

Hi @AayushSaini101, appreciate you spending your weekend time on this! hahaha 🥹

@Xunzhuo Xunzhuo left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@Xunzhuo Xunzhuo merged commit a78a618 into vllm-project:main Jun 17, 2026
33 checks passed
wilsonwu pushed a commit to wilsonwu/semantic-router that referenced this pull request Jun 18, 2026
…llm-project#2141)

The streaming accumulator only captured delta.content; the reconstructed
chat.completion written to the semantic cache therefore dropped the reasoning
that reasoning models stream under delta.reasoning_content. A later cache hit
for a semantically-similar request then returned a response WITHOUT the
reasoning the original live stream delivered — a silent fidelity loss (the
non-streaming cache preserves it because it stores the raw upstream body).

Accumulate delta.reasoning_content into ctx.StreamingReasoning and include it
as message.reasoning_content in the reconstructed response when present.
reasoning_content is already a recognized field elsewhere (looper extraction,
memory, anthropic outbound). Absent when no reasoning was streamed.

Signed-off-by: theohsiung <theobear870924@gmail.com>
Co-authored-by: Moderator <60972989+AayushSaini101@users.noreply.github.com>
Signed-off-by: Wilson Wu <iwilsonwu@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Router] Streaming semantic cache drops reasoning_content (cache hit returns less than the live stream)

10 participants