[Router] Preserve reasoning_content when caching streaming responses#2141
Merged
Xunzhuo merged 3 commits intoJun 17, 2026
Merged
Conversation
✅ Deploy Preview for vllm-semantic-router ready!
To edit notification comments on pull requests, go to your Netlify project configuration. |
Contributor
👥 vLLM Semantic Team NotificationThe following members have been identified for the changed files in this PR and have been automatically assigned: 📁
|
Contributor
✅ Supply Chain Security Report — All Clear
Scanned at |
The streaming accumulator only captured delta.content; the reconstructed chat.completion written to the semantic cache therefore dropped the reasoning that reasoning models stream under delta.reasoning_content. A later cache hit for a semantically-similar request then returned a response WITHOUT the reasoning the original live stream delivered — a silent fidelity loss (the non-streaming cache preserves it because it stores the raw upstream body). Accumulate delta.reasoning_content into ctx.StreamingReasoning and include it as message.reasoning_content in the reconstructed response when present. reasoning_content is already a recognized field elsewhere (looper extraction, memory, anthropic outbound). Absent when no reasoning was streamed. Signed-off-by: theohsiung <theobear870924@gmail.com>
82dc59a to
dbb3989
Compare
Contributor
Author
|
Hi @AayushSaini101, appreciate you spending your weekend time on this! hahaha 🥹 |
wilsonwu
pushed a commit
to wilsonwu/semantic-router
that referenced
this pull request
Jun 18, 2026
…llm-project#2141) The streaming accumulator only captured delta.content; the reconstructed chat.completion written to the semantic cache therefore dropped the reasoning that reasoning models stream under delta.reasoning_content. A later cache hit for a semantically-similar request then returned a response WITHOUT the reasoning the original live stream delivered — a silent fidelity loss (the non-streaming cache preserves it because it stores the raw upstream body). Accumulate delta.reasoning_content into ctx.StreamingReasoning and include it as message.reasoning_content in the reconstructed response when present. reasoning_content is already a recognized field elsewhere (looper extraction, memory, anthropic outbound). Absent when no reasoning was streamed. Signed-off-by: theohsiung <theobear870924@gmail.com> Co-authored-by: Moderator <60972989+AayushSaini101@users.noreply.github.com> Signed-off-by: Wilson Wu <iwilsonwu@gmail.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

What
Caching a streaming reasoning-model response dropped
reasoning_content: the streaming accumulator captured onlydelta.content, so the reconstructedchat.completionwritten to the cache had no reasoning. A later cache hit then returned a response missing the reasoning the original live stream delivered (the non-streaming cache preserves it, since it stores the raw upstream body).Fixes #2140.
Fix
request_context.go: addStreamingReasoning.processor_res_body_streaming.go: accumulatedelta.reasoning_contentalongsidedelta.content.processor_res_cache.go: emitmessage.reasoning_contentin the reconstructed response when non-empty.reasoning_contentis already a recognized field (looper extraction readschoices[].reasoning_contentand.message.reasoning_content; memory + anthropic-outbound handle it).Test plan
processor_res_cache_reasoning_test.go: accumulator capturesdelta.reasoning_contentacross chunks; reconstruction includesmessage.reasoning_content; reconstruction omits the field when no reasoning was streamed. RED before fix → GREEN.go test ./pkg/extproc/full suite green (no regression).gofmt/go vetclean;golangci-lint(repo config) → 0 issues.Notes