Skip to content

fix: return MCP connection errors to LLM instead of raising#1531

Open
jsonmp-k8 wants to merge 1 commit intokagent-dev:mainfrom
jsonmp-k8:fix/1530-mcp-tool-call-cpu-spin
Open

fix: return MCP connection errors to LLM instead of raising#1531
jsonmp-k8 wants to merge 1 commit intokagent-dev:mainfrom
jsonmp-k8:fix/1530-mcp-tool-call-cpu-spin

Conversation

@jsonmp-k8
Copy link
Contributor

@jsonmp-k8 jsonmp-k8 commented Mar 20, 2026

Summary

  • Wrap McpTool instances with ConnectionSafeMcpTool that catches persistent connection errors and returns them as error text to the LLM
  • Catches ConnectionError (stdlib), TimeoutError (stdlib), httpx.TransportError (httpx network/timeout/protocol errors), and McpError (MCP session stream drops and read timeouts)
  • The error message includes the tool name, error type, and instructs the LLM not to retry
  • KAgentMcpToolset.get_tools() automatically wraps all McpTool instances

Root cause

When an MCP HTTP tool call fails with "connection reset by peer", the error propagates up to the ADK flow handler, which sends it back to the LLM as a function error. The LLM interprets this as a transient failure and retries the same tool call — creating a tight loop of LLM call → tool call → connection error → LLM call for up to max_llm_calls (500) iterations, burning 100% CPU.

The MCP client wraps transport-level errors into McpError via mcp.shared.session.send_request() before they reach the tool, so catching only stdlib/httpx errors is insufficient — McpError must also be handled.

Testing

  • python -m pytest python/packages/kagent-adk/tests/unittests/test_mcp_connection_error_handling.py -v (10 tests)
  • python -m pytest python/packages/kagent-adk/tests/unittests/ -v (170 passed)

Test coverage:

  • ConnectionResetError, ConnectionRefusedError, TimeoutError — caught, returned as error dict
  • httpx.ConnectError, httpx.ReadError, httpx.ConnectTimeout — caught via httpx.TransportError
  • McpError (session read timeout) — caught, returned as error dict
  • ValueError, CancelledError — still raised (not connection errors)
  • KAgentMcpToolset.get_tools() wraps McpToolConnectionSafeMcpTool

Fixes #1530

Copilot AI review requested due to automatic review settings March 20, 2026 20:23
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Wraps MCP tools to gracefully surface persistent connection failures to the LLM as normal tool output (instead of raising), preventing tight retry loops and high CPU usage in static agent runs (Fixes #1530).

Changes:

  • Add ConnectionSafeMcpTool that catches connection-related exceptions and returns an error payload instructing the LLM not to retry.
  • Update KAgentMcpToolset.get_tools() to wrap returned McpTool instances with ConnectionSafeMcpTool.
  • Add unit tests covering connection vs non-connection error behavior (including CancelledError propagation).

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.

File Description
python/packages/kagent-adk/src/kagent/adk/_mcp_toolset.py Introduces ConnectionSafeMcpTool and wraps MCP tools returned by KAgentMcpToolset to avoid raising persistent connection failures.
python/packages/kagent-adk/tests/unittests/test_mcp_connection_error_handling.py Adds pytest coverage ensuring connection errors are returned as error text while other exceptions still propagate.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@jsonmp-k8 jsonmp-k8 force-pushed the fix/1530-mcp-tool-call-cpu-spin branch 2 times, most recently from c963afe to 41c3c17 Compare March 20, 2026 21:06
When an MCP HTTP tool call fails with a persistent connection error
(e.g. "connection reset by peer"), the error propagates to the LLM as
a function error. The LLM interprets this as transient and retries the
same tool call, creating a tight loop that burns 100% CPU for up to
max_llm_calls (500) iterations.

Wrap McpTool instances with ConnectionSafeMcpTool that catches
connection errors (ConnectionError, TimeoutError, httpx.TransportError,
McpError) and returns them as error text. This lets the LLM inform the
user about the failure instead of retrying indefinitely.

Fixes kagent-dev#1530

Signed-off-by: Jaison Paul <paul.jaison@gmail.com>
@jsonmp-k8 jsonmp-k8 force-pushed the fix/1530-mcp-tool-call-cpu-spin branch from 41c3c17 to aa46632 Compare March 20, 2026 21:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] kagent-adk static agent spins at 100% CPU on persistent HTTP MCP tool call failures

2 participants