Did you check the docs?
Is your feature request related to a problem? Please describe.
When a request with tools reaches the /v1/chat/completions endpoint with passthrough: true and stream: true, the current behavior is to return a 422 error. This blocks clients that use streaming even when they only need tool calls surfaced at the end of the stream.
Root cause (nemoguardrails/server/api.py, nemoguardrails/rails/llm/llmrails.py, nemoguardrails/server/schemas/utils.py):
StreamingHandler is text-only. Tool call data populated in tool_calls_var during generation is never pushed to the streaming iterator.
format_streaming_chunk has no concept of delta.tool_calls. It only knows delta.content.
There is no mechanism to emit a terminal chunk with finish_reason: "tool_calls" before [DONE].
Describe the solution you'd like
After all text tokens are yielded, wrap tool calls as synthetic delta.tool_calls chunks in OpenAI streaming format, followed by a finish_reason: "tool_calls" chunk. Arguments are not streamed token-by-token but arrive as a single chunk at the end.
Describe alternatives you've considered
Forward raw SSE chunks from the LLM through the guardrails layer for passthrough mode, bypassing StreamingHandler. This would enable token-by-token argument streaming but introduce breaking changes.
Additional context
Relates to issue #1615 and builds on top of #1942.
Did you check the docs?
Is your feature request related to a problem? Please describe.
When a request with tools reaches the
/v1/chat/completions endpointwithpassthrough: trueandstream: true, the current behavior is to return a 422 error. This blocks clients that use streaming even when they only need tool calls surfaced at the end of the stream.Root cause (nemoguardrails/server/api.py, nemoguardrails/rails/llm/llmrails.py, nemoguardrails/server/schemas/utils.py):
StreamingHandleris text-only. Tool call data populated in tool_calls_var during generation is never pushed to the streaming iterator.format_streaming_chunkhas no concept of delta.tool_calls. It only knows delta.content.There is no mechanism to emit a terminal chunk with
finish_reason: "tool_calls"before[DONE].Describe the solution you'd like
After all text tokens are yielded, wrap tool calls as synthetic delta.tool_calls chunks in OpenAI streaming format, followed by a finish_reason: "tool_calls" chunk. Arguments are not streamed token-by-token but arrive as a single chunk at the end.
Describe alternatives you've considered
Forward raw SSE chunks from the LLM through the guardrails layer for passthrough mode, bypassing
StreamingHandler. This would enable token-by-token argument streaming but introduce breaking changes.Additional context
Relates to issue #1615 and builds on top of #1942.