Skip to content

feat!: telemetry metrics updates as per semantic convention#2566

Open
AjmeraParth132 wants to merge 1 commit intogoogleapis:mainfrom
AjmeraParth132:feat-metrics-update
Open

feat!: telemetry metrics updates as per semantic convention#2566
AjmeraParth132 wants to merge 1 commit intogoogleapis:mainfrom
AjmeraParth132:feat-metrics-update

Conversation

@AjmeraParth132
Copy link
Contributor

Description

This PR updates MCP telemetry to align with OTel semantic conventions by removing all existing transport-specific and HTTP-API metrics and introducing a unified, future-proof set. It adds

  1. mcp.server.operation.duration to capture per-method request rate, latency, and errors across all MCP transports;
  2. mcp.server.session.duration to track session lifecycles for stdio and SSE;
  3. toolbox.server.mcp.active_sessions for real-time capacity, saturation, and leak detection.
  4. toolbox.tool.operation.duration to isolate backend tool execution time from MCP/protocol overhead, enabling clearer attribution of performance bottlenecks between the MCP layer and tool runtime.

Together, these changes simplify the metrics surface while significantly improving observability and debuggability with minimal runtime overhead.

PR Checklist

  • Make sure you reviewed
    CONTRIBUTING.md
  • Make sure to open an issue as a
    bug/issue
    before writing your code! That way we can discuss the change, evaluate
    designs, and agree on the general idea
  • Ensure the tests and linter pass
  • Code coverage does not decrease (if any source code was changed)
  • Appropriate docs were updated (if necessary)
  • Make sure to add ! if this involve a breaking change

🛠️ Fixes #2222

@AjmeraParth132 AjmeraParth132 requested a review from a team as a code owner February 26, 2026 01:16
@gemini-code-assist
Copy link
Contributor

Summary of Changes

Hello @AjmeraParth132, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly refactors the telemetry system to align with OpenTelemetry semantic conventions, enhancing observability and debuggability. By replacing a fragmented set of metrics with a unified, future-proof collection, it provides clearer insights into the performance and behavior of MCP operations, sessions, and backend tool executions. This change simplifies metric collection while offering more granular data for identifying bottlenecks and understanding system health.

Highlights

  • New Metrics Introduced: Introduced four new OpenTelemetry semantic convention-aligned metrics: mcp.server.operation.duration for per-method request tracking, mcp.server.session.duration for session lifecycles, toolbox.server.mcp.active_sessions for real-time capacity, and toolbox.tool.operation.duration to isolate tool execution time.
  • Old Metrics Removed: Removed all existing transport-specific and HTTP-API metrics (toolbox.server.toolset.get.count, toolbox.server.tool.get.count, toolbox.server.tool.invoke.count, toolbox.server.mcp.sse.count, toolbox.server.mcp.post.count) to simplify the metrics surface.
  • Metric Attribute Propagation: Implemented a mechanism to propagate gen_ai and network-related attributes through the context for richer metric data, allowing for more detailed analysis of operations and sessions.
  • Session and Operation Duration Tracking: Added logic to track the duration of MCP sessions (stdio and SSE) and individual MCP JSON-RPC operations, including error types, providing comprehensive performance insights.
Changelog
  • internal/server/api.go
    • Removed metric import.
    • Removed old ToolsetGet, ToolGet, and ToolInvoke metric recording from API handlers.
  • internal/server/mcp.go
    • Added sessionStart timestamp and sessionErr variable to track session duration and errors.
    • Introduced mcp.server.session.duration metric recording for stdio and SSE sessions, including error types.
    • Implemented toolbox.server.mcp.active_sessions metric to track active stdio and SSE sessions.
    • Added operationStart timestamp and metricErrorType variable for processMcpMessage.
    • Introduced mcp.server.operation.duration metric recording for MCP JSON-RPC operations, including gen_ai and network attributes.
    • Integrated GenAIMetricAttrs into the context for attribute propagation to method handlers.
    • Updated error handling to set metricErrorType for mcp.server.operation.duration.
  • internal/server/mcp/v20241105/method.go
    • Imported time and metric packages.
    • Populated GenAIMetricAttrs in the context for toolsCallHandler and promptsGetHandler.
    • Added toolbox.tool.execution.duration metric recording for tool invocations, including gen_ai and network attributes.
  • internal/server/mcp/v20250326/method.go
    • Imported time and metric packages.
    • Populated GenAIMetricAttrs in the context for toolsCallHandler and promptsGetHandler.
    • Added toolbox.tool.execution.duration metric recording for tool invocations, including gen_ai and network attributes.
  • internal/server/mcp/v20250618/method.go
    • Imported time and metric packages.
    • Populated GenAIMetricAttrs in the context for toolsCallHandler and promptsGetHandler.
    • Added toolbox.tool.execution.duration metric recording for tool invocations, including gen_ai and network attributes.
  • internal/server/mcp/v20251125/method.go
    • Imported time and metric packages.
    • Populated GenAIMetricAttrs in the context for toolsCallHandler and promptsGetHandler.
    • Added toolbox.tool.execution.duration metric recording for tool invocations, including gen_ai and network attributes.
  • internal/telemetry/instrumentation.go
    • Removed old metric constants (toolsetGetCountName, toolGetCountName, toolInvokeCountName, mcpSseCountName, mcpPostCountName).
    • Defined new OTel semantic convention metric constants (mcpOperationDurationName, mcpSessionDurationName, mcpActiveSessionsName, toolExecutionDurationName).
    • Updated Instrumentation struct to use new metric types (Float64Histogram, Int64UpDownCounter).
    • Replaced creation of old Int64Counter metrics with new Float64Histogram and Int64UpDownCounter metrics.
  • internal/util/util.go
    • Added GenAIMetricAttrs struct to hold generic AI and network metric attributes.
    • Implemented WithGenAIMetricAttrs to add GenAIMetricAttrs to the context.
    • Implemented GenAIMetricAttrsFromContext to retrieve GenAIMetricAttrs from the context.
Activity
  • No human activity (comments, reviews) was detected on this pull request yet.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request refactors the telemetry system to align with OpenTelemetry semantic conventions, removing outdated metrics and introducing a new, more descriptive set for MCP operations and sessions. However, a critical security concern has been identified: several new telemetry metrics, specifically labels like mcp.method.name, toolset.name, gen_ai.tool.name, and gen_ai.prompt.name, are populated directly from untrusted user input without prior validation. This poses a significant risk of metrics cardinality explosion, potentially leading to a Denial of Service by exhausting memory. Additionally, there are suggestions to improve code clarity and robustness regarding how metric attributes are handled to prevent potential slice aliasing issues.

defer func() {
operationDuration := time.Since(operationStart).Seconds()
durationAttrs := []attribute.KeyValue{
attribute.String("mcp.method.name", baseMessage.Method),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The mcp.method.name label is populated directly from the baseMessage.Method field of the JSON-RPC request body without any validation. Since this field is user-controlled, an attacker can send a large number of requests with unique, arbitrary method names, leading to a metrics cardinality explosion. This can exhaust the memory of the metrics collection system or the application itself, causing a Denial of Service (DoS). It is recommended to validate the method name against an allow-list of known methods before using it as a metric label.

attribute.String("mcp.method.name", baseMessage.Method),
attribute.String("network.transport", networkTransport),
attribute.String("network.protocol.name", networkProtocolName),
attribute.String("toolset.name", toolsetName),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The toolset.name label is populated from the toolsetName URL parameter, which is untrusted. An attacker can send many requests with unique toolset names in the URL (e.g., /mcp/{toolsetName}), causing a label cardinality explosion in the metrics system. This can lead to a Denial of Service. Consider validating the toolset name against the list of configured toolsets before recording it in metrics.

Comment on lines +118 to +121
if genAIAttrs := util.GenAIMetricAttrsFromContext(ctx); genAIAttrs != nil {
genAIAttrs.OperationName = "execute_tool"
genAIAttrs.ToolName = toolName
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The gen_ai.tool.name attribute is populated from the toolName variable before it is validated against the available tools. This attribute is later used as a label in the McpOperationDuration metric (recorded in internal/server/mcp.go). An attacker can send requests with arbitrary tool names, leading to a metrics cardinality explosion and potential Denial of Service. The assignment should be moved after the tool existence check on line 122.

Suggested change
if genAIAttrs := util.GenAIMetricAttrsFromContext(ctx); genAIAttrs != nil {
genAIAttrs.OperationName = "execute_tool"
genAIAttrs.ToolName = toolName
}
tool, ok := resourceMgr.GetTool(toolName)
if !ok {
err = fmt.Errorf("invalid tool name: tool with name %q does not exist", toolName)
return jsonrpc.NewError(id, jsonrpc.INVALID_PARAMS, err.Error(), nil), err
}
// Populate gen_ai attributes for operation duration metric
if genAIAttrs := util.GenAIMetricAttrsFromContext(ctx); genAIAttrs != nil {
genAIAttrs.OperationName = "execute_tool"
genAIAttrs.ToolName = toolName
}

Comment on lines +364 to +367
if genAIAttrs := util.GenAIMetricAttrsFromContext(ctx); genAIAttrs != nil {
genAIAttrs.OperationName = "get_prompt"
genAIAttrs.PromptName = promptName
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The gen_ai.prompt.name attribute is populated from the promptName variable before it is validated. This can lead to a metrics cardinality explosion if used as a label in metrics recorded for every request. Move the assignment after the prompt existence check on line 369.

Suggested change
if genAIAttrs := util.GenAIMetricAttrsFromContext(ctx); genAIAttrs != nil {
genAIAttrs.OperationName = "get_prompt"
genAIAttrs.PromptName = promptName
}
prompt, ok := resourceMgr.GetPrompt(promptName)
if !ok {
err := fmt.Errorf("prompt with name %q does not exist", promptName)
return jsonrpc.NewError(id, jsonrpc.INVALID_PARAMS, err.Error(), nil), err
}
// Populate gen_ai attributes for operation duration metric
if genAIAttrs := util.GenAIMetricAttrsFromContext(ctx); genAIAttrs != nil {
genAIAttrs.OperationName = "get_prompt"
genAIAttrs.PromptName = promptName
}

Comment on lines +118 to +121
if genAIAttrs := util.GenAIMetricAttrsFromContext(ctx); genAIAttrs != nil {
genAIAttrs.OperationName = "execute_tool"
genAIAttrs.ToolName = toolName
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The gen_ai.tool.name attribute is populated from the untrusted toolName variable before validation, which can lead to a metrics cardinality explosion. Move the assignment after the tool existence check on line 123.

Suggested change
if genAIAttrs := util.GenAIMetricAttrsFromContext(ctx); genAIAttrs != nil {
genAIAttrs.OperationName = "execute_tool"
genAIAttrs.ToolName = toolName
}
tool, ok := resourceMgr.GetTool(toolName)
if !ok {
err = fmt.Errorf("invalid tool name: tool with name %q does not exist", toolName)
return jsonrpc.NewError(id, jsonrpc.INVALID_PARAMS, err.Error(), nil), err
}
// Populate gen_ai attributes for operation duration metric
if genAIAttrs := util.GenAIMetricAttrsFromContext(ctx); genAIAttrs != nil {
genAIAttrs.OperationName = "execute_tool"
genAIAttrs.ToolName = toolName
}

Comment on lines +358 to +361
if genAIAttrs := util.GenAIMetricAttrsFromContext(ctx); genAIAttrs != nil {
genAIAttrs.OperationName = "get_prompt"
genAIAttrs.PromptName = promptName
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The gen_ai.prompt.name attribute is populated from the untrusted promptName variable before validation. Move the assignment after the prompt existence check on line 363.

Suggested change
if genAIAttrs := util.GenAIMetricAttrsFromContext(ctx); genAIAttrs != nil {
genAIAttrs.OperationName = "get_prompt"
genAIAttrs.PromptName = promptName
}
prompt, ok := resourceMgr.GetPrompt(promptName)
if !ok {
err := fmt.Errorf("prompt with name %q does not exist", promptName)
return jsonrpc.NewError(id, jsonrpc.INVALID_PARAMS, err.Error(), nil), err
}
// Populate gen_ai attributes for operation duration metric
if genAIAttrs := util.GenAIMetricAttrsFromContext(ctx); genAIAttrs != nil {
genAIAttrs.OperationName = "get_prompt"
genAIAttrs.PromptName = promptName
}

Comment on lines +111 to +114
if genAIAttrs := util.GenAIMetricAttrsFromContext(ctx); genAIAttrs != nil {
genAIAttrs.OperationName = "execute_tool"
genAIAttrs.ToolName = toolName
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The gen_ai.tool.name attribute is populated from the untrusted toolName variable before validation, which can lead to a metrics cardinality explosion. Move the assignment after the tool existence check on line 116.

Suggested change
if genAIAttrs := util.GenAIMetricAttrsFromContext(ctx); genAIAttrs != nil {
genAIAttrs.OperationName = "execute_tool"
genAIAttrs.ToolName = toolName
}
tool, ok := resourceMgr.GetTool(toolName)
if !ok {
err = fmt.Errorf("invalid tool name: tool with name %q does not exist", toolName)
return jsonrpc.NewError(id, jsonrpc.INVALID_PARAMS, err.Error(), nil), err
}
// Populate gen_ai attributes for operation duration metric
if genAIAttrs := util.GenAIMetricAttrsFromContext(ctx); genAIAttrs != nil {
genAIAttrs.OperationName = "execute_tool"
genAIAttrs.ToolName = toolName
}

Comment on lines +358 to +361
if genAIAttrs := util.GenAIMetricAttrsFromContext(ctx); genAIAttrs != nil {
genAIAttrs.OperationName = "get_prompt"
genAIAttrs.PromptName = promptName
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

security-medium medium

The gen_ai.prompt.name attribute is populated from the untrusted promptName variable before validation. Move the assignment after the prompt existence check on line 363.

Suggested change
if genAIAttrs := util.GenAIMetricAttrsFromContext(ctx); genAIAttrs != nil {
genAIAttrs.OperationName = "get_prompt"
genAIAttrs.PromptName = promptName
}
prompt, ok := resourceMgr.GetPrompt(promptName)
if !ok {
err := fmt.Errorf("prompt with name %q does not exist", promptName)
return jsonrpc.NewError(id, jsonrpc.INVALID_PARAMS, err.Error(), nil), err
}
// Populate gen_ai attributes for operation duration metric
if genAIAttrs := util.GenAIMetricAttrsFromContext(ctx); genAIAttrs != nil {
genAIAttrs.OperationName = "get_prompt"
genAIAttrs.PromptName = promptName
}

Comment on lines +209 to +212
durationAttrs := fullAttrs
if sessionErr != nil && sessionErr != io.EOF {
durationAttrs = append(durationAttrs, attribute.String("error.type", sessionErr.Error()))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Assigning durationAttrs directly to fullAttrs and then conditionally appending to it can be confusing due to slice aliasing. While it doesn't cause a bug in the current code, creating an explicit copy makes the intent clearer and prevents potential bugs if the code is refactored in the future.

Suggested change
durationAttrs := fullAttrs
if sessionErr != nil && sessionErr != io.EOF {
durationAttrs = append(durationAttrs, attribute.String("error.type", sessionErr.Error()))
}
durationAttrs := make([]attribute.KeyValue, len(fullAttrs))
copy(durationAttrs, fullAttrs)
if sessionErr != nil && sessionErr != io.EOF {
durationAttrs = append(durationAttrs, attribute.String("error.type", sessionErr.Error()))
}

Comment on lines +376 to 380
durationAttrs := sessionAttrs
if err != nil {
span.SetStatus(codes.Error, err.Error())
durationAttrs = append(durationAttrs, attribute.String("error.type", err.Error()))
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

Similar to another comment, the direct assignment durationAttrs := sessionAttrs followed by a conditional append can be risky due to slice aliasing. sessionAttrs is used for decrementing the active sessions counter just before this. To improve code safety and clarity, it's better to work with an explicit copy.

Suggested change
durationAttrs := sessionAttrs
if err != nil {
span.SetStatus(codes.Error, err.Error())
durationAttrs = append(durationAttrs, attribute.String("error.type", err.Error()))
}
durationAttrs := make([]attribute.KeyValue, len(sessionAttrs))
copy(durationAttrs, sessionAttrs)
if err != nil {
span.SetStatus(codes.Error, err.Error())
durationAttrs = append(durationAttrs, attribute.String("error.type", err.Error()))
}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Enhance Telemetry for Toolbox Servers

2 participants