feat!: telemetry metrics updates as per semantic convention#2566
feat!: telemetry metrics updates as per semantic convention#2566AjmeraParth132 wants to merge 1 commit intogoogleapis:mainfrom
Conversation
Summary of ChangesHello @AjmeraParth132, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly refactors the telemetry system to align with OpenTelemetry semantic conventions, enhancing observability and debuggability. By replacing a fragmented set of metrics with a unified, future-proof collection, it provides clearer insights into the performance and behavior of MCP operations, sessions, and backend tool executions. This change simplifies metric collection while offering more granular data for identifying bottlenecks and understanding system health. Highlights
Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request refactors the telemetry system to align with OpenTelemetry semantic conventions, removing outdated metrics and introducing a new, more descriptive set for MCP operations and sessions. However, a critical security concern has been identified: several new telemetry metrics, specifically labels like mcp.method.name, toolset.name, gen_ai.tool.name, and gen_ai.prompt.name, are populated directly from untrusted user input without prior validation. This poses a significant risk of metrics cardinality explosion, potentially leading to a Denial of Service by exhausting memory. Additionally, there are suggestions to improve code clarity and robustness regarding how metric attributes are handled to prevent potential slice aliasing issues.
| defer func() { | ||
| operationDuration := time.Since(operationStart).Seconds() | ||
| durationAttrs := []attribute.KeyValue{ | ||
| attribute.String("mcp.method.name", baseMessage.Method), |
There was a problem hiding this comment.
The mcp.method.name label is populated directly from the baseMessage.Method field of the JSON-RPC request body without any validation. Since this field is user-controlled, an attacker can send a large number of requests with unique, arbitrary method names, leading to a metrics cardinality explosion. This can exhaust the memory of the metrics collection system or the application itself, causing a Denial of Service (DoS). It is recommended to validate the method name against an allow-list of known methods before using it as a metric label.
| attribute.String("mcp.method.name", baseMessage.Method), | ||
| attribute.String("network.transport", networkTransport), | ||
| attribute.String("network.protocol.name", networkProtocolName), | ||
| attribute.String("toolset.name", toolsetName), |
There was a problem hiding this comment.
The toolset.name label is populated from the toolsetName URL parameter, which is untrusted. An attacker can send many requests with unique toolset names in the URL (e.g., /mcp/{toolsetName}), causing a label cardinality explosion in the metrics system. This can lead to a Denial of Service. Consider validating the toolset name against the list of configured toolsets before recording it in metrics.
| if genAIAttrs := util.GenAIMetricAttrsFromContext(ctx); genAIAttrs != nil { | ||
| genAIAttrs.OperationName = "execute_tool" | ||
| genAIAttrs.ToolName = toolName | ||
| } |
There was a problem hiding this comment.
The gen_ai.tool.name attribute is populated from the toolName variable before it is validated against the available tools. This attribute is later used as a label in the McpOperationDuration metric (recorded in internal/server/mcp.go). An attacker can send requests with arbitrary tool names, leading to a metrics cardinality explosion and potential Denial of Service. The assignment should be moved after the tool existence check on line 122.
| if genAIAttrs := util.GenAIMetricAttrsFromContext(ctx); genAIAttrs != nil { | |
| genAIAttrs.OperationName = "execute_tool" | |
| genAIAttrs.ToolName = toolName | |
| } | |
| tool, ok := resourceMgr.GetTool(toolName) | |
| if !ok { | |
| err = fmt.Errorf("invalid tool name: tool with name %q does not exist", toolName) | |
| return jsonrpc.NewError(id, jsonrpc.INVALID_PARAMS, err.Error(), nil), err | |
| } | |
| // Populate gen_ai attributes for operation duration metric | |
| if genAIAttrs := util.GenAIMetricAttrsFromContext(ctx); genAIAttrs != nil { | |
| genAIAttrs.OperationName = "execute_tool" | |
| genAIAttrs.ToolName = toolName | |
| } |
| if genAIAttrs := util.GenAIMetricAttrsFromContext(ctx); genAIAttrs != nil { | ||
| genAIAttrs.OperationName = "get_prompt" | ||
| genAIAttrs.PromptName = promptName | ||
| } |
There was a problem hiding this comment.
The gen_ai.prompt.name attribute is populated from the promptName variable before it is validated. This can lead to a metrics cardinality explosion if used as a label in metrics recorded for every request. Move the assignment after the prompt existence check on line 369.
| if genAIAttrs := util.GenAIMetricAttrsFromContext(ctx); genAIAttrs != nil { | |
| genAIAttrs.OperationName = "get_prompt" | |
| genAIAttrs.PromptName = promptName | |
| } | |
| prompt, ok := resourceMgr.GetPrompt(promptName) | |
| if !ok { | |
| err := fmt.Errorf("prompt with name %q does not exist", promptName) | |
| return jsonrpc.NewError(id, jsonrpc.INVALID_PARAMS, err.Error(), nil), err | |
| } | |
| // Populate gen_ai attributes for operation duration metric | |
| if genAIAttrs := util.GenAIMetricAttrsFromContext(ctx); genAIAttrs != nil { | |
| genAIAttrs.OperationName = "get_prompt" | |
| genAIAttrs.PromptName = promptName | |
| } |
| if genAIAttrs := util.GenAIMetricAttrsFromContext(ctx); genAIAttrs != nil { | ||
| genAIAttrs.OperationName = "execute_tool" | ||
| genAIAttrs.ToolName = toolName | ||
| } |
There was a problem hiding this comment.
The gen_ai.tool.name attribute is populated from the untrusted toolName variable before validation, which can lead to a metrics cardinality explosion. Move the assignment after the tool existence check on line 123.
| if genAIAttrs := util.GenAIMetricAttrsFromContext(ctx); genAIAttrs != nil { | |
| genAIAttrs.OperationName = "execute_tool" | |
| genAIAttrs.ToolName = toolName | |
| } | |
| tool, ok := resourceMgr.GetTool(toolName) | |
| if !ok { | |
| err = fmt.Errorf("invalid tool name: tool with name %q does not exist", toolName) | |
| return jsonrpc.NewError(id, jsonrpc.INVALID_PARAMS, err.Error(), nil), err | |
| } | |
| // Populate gen_ai attributes for operation duration metric | |
| if genAIAttrs := util.GenAIMetricAttrsFromContext(ctx); genAIAttrs != nil { | |
| genAIAttrs.OperationName = "execute_tool" | |
| genAIAttrs.ToolName = toolName | |
| } |
| if genAIAttrs := util.GenAIMetricAttrsFromContext(ctx); genAIAttrs != nil { | ||
| genAIAttrs.OperationName = "get_prompt" | ||
| genAIAttrs.PromptName = promptName | ||
| } |
There was a problem hiding this comment.
The gen_ai.prompt.name attribute is populated from the untrusted promptName variable before validation. Move the assignment after the prompt existence check on line 363.
| if genAIAttrs := util.GenAIMetricAttrsFromContext(ctx); genAIAttrs != nil { | |
| genAIAttrs.OperationName = "get_prompt" | |
| genAIAttrs.PromptName = promptName | |
| } | |
| prompt, ok := resourceMgr.GetPrompt(promptName) | |
| if !ok { | |
| err := fmt.Errorf("prompt with name %q does not exist", promptName) | |
| return jsonrpc.NewError(id, jsonrpc.INVALID_PARAMS, err.Error(), nil), err | |
| } | |
| // Populate gen_ai attributes for operation duration metric | |
| if genAIAttrs := util.GenAIMetricAttrsFromContext(ctx); genAIAttrs != nil { | |
| genAIAttrs.OperationName = "get_prompt" | |
| genAIAttrs.PromptName = promptName | |
| } |
| if genAIAttrs := util.GenAIMetricAttrsFromContext(ctx); genAIAttrs != nil { | ||
| genAIAttrs.OperationName = "execute_tool" | ||
| genAIAttrs.ToolName = toolName | ||
| } |
There was a problem hiding this comment.
The gen_ai.tool.name attribute is populated from the untrusted toolName variable before validation, which can lead to a metrics cardinality explosion. Move the assignment after the tool existence check on line 116.
| if genAIAttrs := util.GenAIMetricAttrsFromContext(ctx); genAIAttrs != nil { | |
| genAIAttrs.OperationName = "execute_tool" | |
| genAIAttrs.ToolName = toolName | |
| } | |
| tool, ok := resourceMgr.GetTool(toolName) | |
| if !ok { | |
| err = fmt.Errorf("invalid tool name: tool with name %q does not exist", toolName) | |
| return jsonrpc.NewError(id, jsonrpc.INVALID_PARAMS, err.Error(), nil), err | |
| } | |
| // Populate gen_ai attributes for operation duration metric | |
| if genAIAttrs := util.GenAIMetricAttrsFromContext(ctx); genAIAttrs != nil { | |
| genAIAttrs.OperationName = "execute_tool" | |
| genAIAttrs.ToolName = toolName | |
| } |
| if genAIAttrs := util.GenAIMetricAttrsFromContext(ctx); genAIAttrs != nil { | ||
| genAIAttrs.OperationName = "get_prompt" | ||
| genAIAttrs.PromptName = promptName | ||
| } |
There was a problem hiding this comment.
The gen_ai.prompt.name attribute is populated from the untrusted promptName variable before validation. Move the assignment after the prompt existence check on line 363.
| if genAIAttrs := util.GenAIMetricAttrsFromContext(ctx); genAIAttrs != nil { | |
| genAIAttrs.OperationName = "get_prompt" | |
| genAIAttrs.PromptName = promptName | |
| } | |
| prompt, ok := resourceMgr.GetPrompt(promptName) | |
| if !ok { | |
| err := fmt.Errorf("prompt with name %q does not exist", promptName) | |
| return jsonrpc.NewError(id, jsonrpc.INVALID_PARAMS, err.Error(), nil), err | |
| } | |
| // Populate gen_ai attributes for operation duration metric | |
| if genAIAttrs := util.GenAIMetricAttrsFromContext(ctx); genAIAttrs != nil { | |
| genAIAttrs.OperationName = "get_prompt" | |
| genAIAttrs.PromptName = promptName | |
| } |
| durationAttrs := fullAttrs | ||
| if sessionErr != nil && sessionErr != io.EOF { | ||
| durationAttrs = append(durationAttrs, attribute.String("error.type", sessionErr.Error())) | ||
| } |
There was a problem hiding this comment.
Assigning durationAttrs directly to fullAttrs and then conditionally appending to it can be confusing due to slice aliasing. While it doesn't cause a bug in the current code, creating an explicit copy makes the intent clearer and prevents potential bugs if the code is refactored in the future.
| durationAttrs := fullAttrs | |
| if sessionErr != nil && sessionErr != io.EOF { | |
| durationAttrs = append(durationAttrs, attribute.String("error.type", sessionErr.Error())) | |
| } | |
| durationAttrs := make([]attribute.KeyValue, len(fullAttrs)) | |
| copy(durationAttrs, fullAttrs) | |
| if sessionErr != nil && sessionErr != io.EOF { | |
| durationAttrs = append(durationAttrs, attribute.String("error.type", sessionErr.Error())) | |
| } |
| durationAttrs := sessionAttrs | ||
| if err != nil { | ||
| span.SetStatus(codes.Error, err.Error()) | ||
| durationAttrs = append(durationAttrs, attribute.String("error.type", err.Error())) | ||
| } |
There was a problem hiding this comment.
Similar to another comment, the direct assignment durationAttrs := sessionAttrs followed by a conditional append can be risky due to slice aliasing. sessionAttrs is used for decrementing the active sessions counter just before this. To improve code safety and clarity, it's better to work with an explicit copy.
| durationAttrs := sessionAttrs | |
| if err != nil { | |
| span.SetStatus(codes.Error, err.Error()) | |
| durationAttrs = append(durationAttrs, attribute.String("error.type", err.Error())) | |
| } | |
| durationAttrs := make([]attribute.KeyValue, len(sessionAttrs)) | |
| copy(durationAttrs, sessionAttrs) | |
| if err != nil { | |
| span.SetStatus(codes.Error, err.Error()) | |
| durationAttrs = append(durationAttrs, attribute.String("error.type", err.Error())) | |
| } |
Description
This PR updates MCP telemetry to align with OTel semantic conventions by removing all existing transport-specific and HTTP-API metrics and introducing a unified, future-proof set. It adds
mcp.server.operation.durationto capture per-method request rate, latency, and errors across all MCP transports;mcp.server.session.durationto track session lifecycles for stdio and SSE;toolbox.server.mcp.active_sessionsfor real-time capacity, saturation, and leak detection.toolbox.tool.operation.durationto isolate backend tool execution time from MCP/protocol overhead, enabling clearer attribution of performance bottlenecks between the MCP layer and tool runtime.Together, these changes simplify the metrics surface while significantly improving observability and debuggability with minimal runtime overhead.
PR Checklist
CONTRIBUTING.md
bug/issue
before writing your code! That way we can discuss the change, evaluate
designs, and agree on the general idea
!if this involve a breaking change🛠️ Fixes #2222