Skip to content

Conversation

@mabdinur
Copy link
Contributor

@mabdinur mabdinur commented Nov 4, 2025

What does this PR do?

Adds OpenTelemetry metrics SDK integration with OTLP exporter support for gRPC and HTTP/protobuf protocols.

Motivation:
Enables customers to use the OpenTelemetry Metrics API and export metrics via OTLP to any compatible backend, including the Datadog Agent.

Change log entry
Adds OpenTelemetry metrics SDK integration with OTLP exporter support. Configure via DD_METRICS_OTEL_ENABLED and standard OpenTelemetry environment variables.

Key Features:

  • OTLP exporter support (gRPC and HTTP/protobuf)
  • Automatic endpoint resolution based on protocol and agent hostname
  • Datadog service metadata mapped to OpenTelemetry resource attributes
  • General OTLP settings (OTEL_EXPORTER_OTLP_*) as defaults for metrics-specific settings

Required Gems for metrics support:

  • opentelemetry-metrics-sdk (~> 0.8)
  • opentelemetry-exporter-otlp-metrics (~> 0.4)

Testing:

BUNDLE_GEMFILE=gemfiles/ruby_3.2_opentelemetry.gemfile bundle exec rspec spec/datadog/opentelemetry/metrics_spec.rb

@github-actions github-actions bot added core Involves Datadog core libraries otel OpenTelemetry-related changes labels Nov 4, 2025
@pr-commenter
Copy link

pr-commenter bot commented Nov 4, 2025

Benchmarks

Benchmark execution time: 2025-11-27 18:28:06

Comparing candidate commit bee73bb in PR branch munir/add-otel-metrics-configs with baseline commit 2c84fc0 in branch master.

Found 1 performance improvements and 0 performance regressions! Performance is the same for 43 metrics, 2 unstable metrics.

scenario:profiling - intern mixed existing and new

  • 🟩 throughput [+2.476op/s; +3.608op/s] or [+8.729%; +12.720%]

@mabdinur mabdinur changed the title chore(otel): add support for otel metrics [Part 1] chore(otel): add support for otel metrics Nov 4, 2025
@mabdinur mabdinur force-pushed the munir/add-otel-metrics-configs branch from a611845 to 9eb0a24 Compare November 5, 2025 00:29
gem 'opentelemetry-sdk', '~> 1.1'
gem 'opentelemetry-metrics-sdk', '>= 0.8'
gem 'opentelemetry-exporter-otlp-metrics', '>= 0.4'
# opentelemetry-metrics-sdk 0.11+ requires opentelemetry-common >= 0.23.0 (for time_in_nanoseconds)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is time_in_nanoseconds needed? I can't find it in this PR.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's defined in opentelemetry-common and used in the opentelemetry metrics sdk: https://github.com/search?q=repo%3Aopen-telemetry%2Fopentelemetry-ruby%20time_in_nanoseconds&type=code. It's not directly used in our library.

This is just note for us so we do not install incompatible opentelemetry libraries.

@mabdinur mabdinur force-pushed the munir/add-otel-metrics-configs branch from cfcb81a to 3faa913 Compare November 21, 2025 19:40
rescue => e
Datadog.logger.error("Failed to export OpenTelemetry Metrics: #{e.message}")
telemetry&.inc('tracers', METRIC_EXPORT_FAILURES, 1, tags: @telemetry_tags)
raise
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we want to raise this error again so that the user receives it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a good question! @mabdinur, can you check if this API expects errors to propagate?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The metrics exporter does not have any error handling. If it raises an exception then the errors will bubble up. We're doing the same here: https://github.com/open-telemetry/opentelemetry-ruby/blob/8c02da8de62df1464a6e4c8cc35f838a0ccd0dd5/exporter/otlp-metrics/lib/opentelemetry/exporter/otlp/metrics/metrics_exporter.rb#L86.

Comment on lines 67 to 87
def self.resolve_metrics_endpoint
metrics_config = @settings.opentelemetry.metrics
exporter_config = @settings.opentelemetry.exporter

return metrics_config.endpoint.to_s if metrics_config.endpoint
scheme = @agent_ssl ? 'https' : 'http'

if metrics_config.protocol
protocol = metrics_config.protocol
port = (protocol == 'http/protobuf') ? 4318 : 4317
path = (protocol == 'http/protobuf') ? '/v1/metrics' : ''
return "#{scheme}://#{@agent_host}:#{port}#{path}"
end

return exporter_config.endpoint.to_s if exporter_config.endpoint

protocol = exporter_config.protocol || 'http/protobuf'
port = (protocol == 'http/protobuf') ? 4318 : 4317
path = (protocol == 'http/protobuf') ? '/v1/metrics' : ''
"#{scheme}://#{@agent_host}:#{port}#{path}"
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can definitely extract a function here that builds the full url from either metrics config or exporter config, and simply use it here:

metrics_endpoint(@settings.opentelemetry.metrics) || metrics_endpoint(@settings.opentelemetry.exporter)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed most of this logic. Since grpc is not supported yet. We can really simplify this

Comment on lines +95 to +98
timeout = metrics_config.timeout_millis || exporter_config.timeout_millis
headers = metrics_config.headers || exporter_config.headers || {}

protocol = metrics_config.protocol || exporter_config.protocol
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's better to check if all the needed settings are in metrics_config, if not - use exporter_config. With the way it is now, the user could configure timeout for metrics, headers for exporter and protocol for metrics - and it would somehow work.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense but it goes against the opentelemetry spec. Each configuration needs to be evaluated separately

@mabdinur mabdinur requested review from ivoanjo and marcotc November 26, 2025 16:58
Copy link
Member

@ivoanjo ivoanjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave it another pass! I think this is getting in quite good shape :)

Comment on lines 8 to 9
| Metrics SDK | https://rubygems.org/gems/opentelemetry-metrics-sdk | 2.25.0+ | >= 0.8 |
| OTLP Metrics Exporter | https://rubygems.org/gems/opentelemetry-exporter-otlp-metrics | 2.25.0+ | >= 0.4 |
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why 2.25.0? Are we not planning to ship this with the next 2.23.0 release, or at least the 2.24.0 after that? 👀

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't sure when this PR would get merged so I picked a release that would likely go out at the end of Q4. Since we are close to getting this merged we can set it to 2.23.0 😄

Comment on lines 64 to 69
# Set environment variable before initializing metrics support
ENV['DD_METRICS_OTEL_ENABLED'] = 'true'
require 'opentelemetry/sdk'
require 'opentelemetry-metrics-sdk'
require 'opentelemetry/exporter/otlp_metrics'
require 'datadog/opentelemetry'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you considered turning these 5 lines into a datadog/opentelemetry/metrics_autoload (or boot or some name) or something like that?

That way customers would only need to copy-paste the simpler require :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually only opentelemetry/sdk and datadog/opentelemetry are required. opentelemetry/sdk is required to call OpenTelemetry::SDK.configure and 'datadog/opentelemetry' is required to enable ddtrace-opentelemetry support for both metrics and traces.

We do not need to import 'opentelemetry/exporter/otlp_metrics' and 'opentelemetry/exporter/otlp_metrics' here. This is handled by 'datadog/opentelemetry'. I added these require statements to signal to users that they NEED to install the metrics-sdk and the metrics-exporter. I'll remove these imports and make the description a bit clearer.

Comment on lines 99 to 101
rescue => e
Datadog.logger.warn("Error while sending runtime metric. Cause: #{e.class.name} #{e.message}")
Datadog.logger.warn("Error while sending runtime metric. Cause: #{e.class.name} #{e.class}: #{e}")
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Why add both e.class and e.class.name? 👀

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was a product of my find and replace to address another comment. I missed the #{e.class.name} component. I will remove it.

describe 'Basic Functionality' do
it 'exports counter metrics' do
setup_metrics
provider = ::OpenTelemetry.meter_provider
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Consider making this into a let? lets are lazy, so they will only be called when needed, which I think is exactly what we need here (e.g. setup_metrics will still correctly run before)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup a let here is a good idea. Also the code when through a few iterations. We can simplify the tests and consolidate some of the helpers

Comment on lines 111 to 113
flush_and_wait(provider)

metric = find_metric_in_json(get_testagent_metrics, 'requests_myapp')
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Every call to get_testagent_metrics seems to be preceded by a flush_and_wait(provider) -- consider combining them to simplify the tests a bit (the less boilerplate they have, the clearer we can focus on what's actually being tested)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, just did this

Comment on lines 134 to 147
it 'exports gauge metrics' do
setup_metrics('OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE' => 'cumulative')
provider = ::OpenTelemetry.meter_provider
gauge = provider.meter('app').create_gauge('temperature')

gauge.record(72)
flush_and_wait(provider)
gauge.record(72)
flush_and_wait(provider)

metric = find_metric_in_json(get_testagent_metrics, 'temperature')
value = metric['gauge']['data_points'].first['as_int']&.to_i
expect(value).to eq(72)
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider turning some of ⬆️ into a comment in the test -- I think it's not obvious why the two records are there

Comment on lines 241 to 273
it 'metrics-specific configs take precedence over general OTLP configs' do
setup_metrics(
'DD_METRICS_OTEL_ENABLED' => 'true',
'OTEL_EXPORTER_OTLP_METRICS_ENDPOINT' => 'http://metrics:4318/v1/metrics',
'OTEL_EXPORTER_OTLP_METRICS_PROTOCOL' => 'http/protobuf',
'OTEL_EXPORTER_OTLP_METRICS_TIMEOUT' => '5000',
'OTEL_EXPORTER_OTLP_METRICS_HEADERS' => 'metrics=value',
'OTEL_METRIC_EXPORT_INTERVAL' => '4000',
'OTEL_METRIC_EXPORT_TIMEOUT' => '3000',
'OTEL_EXPORTER_OTLP_ENDPOINT' => 'http://general:4317',
'OTEL_EXPORTER_OTLP_PROTOCOL' => 'grpc',
'OTEL_EXPORTER_OTLP_TIMEOUT' => '2000',
'OTEL_EXPORTER_OTLP_HEADERS' => 'general=value'
)

provider = ::OpenTelemetry.meter_provider
reader = provider.metric_readers.first
exporter = reader.instance_variable_get(:@exporter)

expect(exporter.instance_variable_get(:@uri).to_s).to eq('http://metrics:4318/v1/metrics')
expect(exporter.instance_variable_get(:@timeout)).to eq(5.0)
expect(reader.instance_variable_get(:@export_interval)).to eq(4.0)
expect(reader.instance_variable_get(:@export_timeout)).to eq(3.0)
expect(exporter.instance_variable_get(:@headers)['metrics']).to eq('value')
end

it 'general OTLP configs work in isolation' do
setup_metrics(
'OTEL_EXPORTER_OTLP_ENDPOINT' => 'http://general:4317',
'OTEL_EXPORTER_OTLP_PROTOCOL' => 'http/protobuf',
'OTEL_EXPORTER_OTLP_TIMEOUT' => '8000',
'OTEL_EXPORTER_OTLP_HEADERS' => 'general=value'
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider factoring out the OTLP configs per-test. That way it's even more clear that they get overridden! Something like

describe 'configuration priority' do
   let(:env_vars) do
      # general otlp configs
   end

   before { setup_metrics(env_vars) }

   it 'uses the general OTLP configs in isolation' do
      ...the current test...
   end

   context 'when metrics-specific configs are provided' do
      let(:env_vars) do
         super().merge(
             # metrics configs only -- general come from super()
         )
      end

      it 'gives metrics-specific precendence' do
            ...etc...
       end

    end

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, this wouldve actually saved me a lot time. Debugging this test is hell

Comment on lines 288 to 289
reader = provider.metric_readers.first
exporter = reader.instance_variable_get(:@exporter)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: These two also could become lets after provider also becomes one :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup and we can also put attributes and resource in lets as well

@mabdinur mabdinur force-pushed the munir/add-otel-metrics-configs branch from a6ddfdc to bee73bb Compare November 27, 2025 17:57
@mabdinur
Copy link
Contributor Author

Gave it another pass! I think this is getting in quite good shape :)

Thanks so much for the thorough review and feedback. I really appreciate your time and patience. I learned a lot about ruby along the way 😸

@mabdinur mabdinur requested a review from ivoanjo November 27, 2025 19:42
Copy link
Member

@ivoanjo ivoanjo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 Looks great to me! I really like how clean the tests ended up :)

Comment on lines +26 to +33
proc do |value|
if value && value.to_s.downcase != 'http/protobuf'
Datadog.logger.warn("#{env_var_name}=#{value} is not supported. Using http/protobuf instead.")
'http/protobuf'
else
value
end
end
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: I suggest this small tweak for two reasons

  1. It's clearer only protobuf is available
  2. In the prior version we actually did not downcase the actual value so I think we could end up with 'HtTp/PrOtObUf' which is kinda confusing?
Suggested change
proc do |value|
if value && value.to_s.downcase != 'http/protobuf'
Datadog.logger.warn("#{env_var_name}=#{value} is not supported. Using http/protobuf instead.")
'http/protobuf'
else
value
end
end
proc do |value|
if value && value.to_s.downcase != 'http/protobuf'
Datadog.logger.warn("#{env_var_name}=#{value} is not supported. Using http/protobuf instead.")
end
'http/protobuf'
end

@logger.warn("Could not load OTLP metrics exporter: #{e.class}: #{e}")
end

private :create_resource, :configure_metric_reader, :resolve_metrics_endpoint, :configure_otlp_exporter
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Since this is now a regular class, you can leave a private in a single line at the beginning of the private section of the methods and not need to list them here ;)

describe 'Basic Functionality' do
it 'exports counter metrics' do
setup_metrics
provider = ::OpenTelemetry.meter_provider
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: (Can be removed? Since it's a let now?)

Suggested change
provider = ::OpenTelemetry.meter_provider


it 'exports histogram metrics' do
setup_metrics
provider = ::OpenTelemetry.meter_provider
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: (Can be removed? Since it's a let now?)

Suggested change
provider = ::OpenTelemetry.meter_provider

'OTEL_METRIC_EXPORT_INTERVAL' => '4000',
'OTEL_METRIC_EXPORT_TIMEOUT' => '3000',
'OTEL_EXPORTER_OTLP_PROTOCOL' => 'grpc',
'OTEL_EXPORTER_OTLP_TIMEOUT' => '2000'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: I think this one can be removed, since it's already defined in the outer block (to 8000)

Suggested change
'OTEL_EXPORTER_OTLP_TIMEOUT' => '2000'

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

core Involves Datadog core libraries integrations Involves tracing integrations otel OpenTelemetry-related changes tracing

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants