fix: return MCP connection errors to LLM instead of raising#1531
Open
jsonmp-k8 wants to merge 1 commit intokagent-dev:mainfrom
Open
fix: return MCP connection errors to LLM instead of raising#1531jsonmp-k8 wants to merge 1 commit intokagent-dev:mainfrom
jsonmp-k8 wants to merge 1 commit intokagent-dev:mainfrom
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
Wraps MCP tools to gracefully surface persistent connection failures to the LLM as normal tool output (instead of raising), preventing tight retry loops and high CPU usage in static agent runs (Fixes #1530).
Changes:
- Add
ConnectionSafeMcpToolthat catches connection-related exceptions and returns an error payload instructing the LLM not to retry. - Update
KAgentMcpToolset.get_tools()to wrap returnedMcpToolinstances withConnectionSafeMcpTool. - Add unit tests covering connection vs non-connection error behavior (including
CancelledErrorpropagation).
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
python/packages/kagent-adk/src/kagent/adk/_mcp_toolset.py |
Introduces ConnectionSafeMcpTool and wraps MCP tools returned by KAgentMcpToolset to avoid raising persistent connection failures. |
python/packages/kagent-adk/tests/unittests/test_mcp_connection_error_handling.py |
Adds pytest coverage ensuring connection errors are returned as error text while other exceptions still propagate. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
python/packages/kagent-adk/tests/unittests/test_mcp_connection_error_handling.py
Show resolved
Hide resolved
c963afe to
41c3c17
Compare
When an MCP HTTP tool call fails with a persistent connection error (e.g. "connection reset by peer"), the error propagates to the LLM as a function error. The LLM interprets this as transient and retries the same tool call, creating a tight loop that burns 100% CPU for up to max_llm_calls (500) iterations. Wrap McpTool instances with ConnectionSafeMcpTool that catches connection errors (ConnectionError, TimeoutError, httpx.TransportError, McpError) and returns them as error text. This lets the LLM inform the user about the failure instead of retrying indefinitely. Fixes kagent-dev#1530 Signed-off-by: Jaison Paul <paul.jaison@gmail.com>
41c3c17 to
aa46632
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
McpToolinstances withConnectionSafeMcpToolthat catches persistent connection errors and returns them as error text to the LLMConnectionError(stdlib),TimeoutError(stdlib),httpx.TransportError(httpx network/timeout/protocol errors), andMcpError(MCP session stream drops and read timeouts)KAgentMcpToolset.get_tools()automatically wraps allMcpToolinstancesRoot cause
When an MCP HTTP tool call fails with "connection reset by peer", the error propagates up to the ADK flow handler, which sends it back to the LLM as a function error. The LLM interprets this as a transient failure and retries the same tool call — creating a tight loop of LLM call → tool call → connection error → LLM call for up to
max_llm_calls(500) iterations, burning 100% CPU.The MCP client wraps transport-level errors into
McpErrorviamcp.shared.session.send_request()before they reach the tool, so catching only stdlib/httpx errors is insufficient —McpErrormust also be handled.Testing
python -m pytest python/packages/kagent-adk/tests/unittests/test_mcp_connection_error_handling.py -v(10 tests)python -m pytest python/packages/kagent-adk/tests/unittests/ -v(170 passed)Test coverage:
ConnectionResetError,ConnectionRefusedError,TimeoutError— caught, returned as error dicthttpx.ConnectError,httpx.ReadError,httpx.ConnectTimeout— caught viahttpx.TransportErrorMcpError(session read timeout) — caught, returned as error dictValueError,CancelledError— still raised (not connection errors)KAgentMcpToolset.get_tools()wrapsMcpTool→ConnectionSafeMcpToolFixes #1530