-
Notifications
You must be signed in to change notification settings - Fork 399
feat(otel): add support for otel metrics #5021
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
BenchmarksBenchmark execution time: 2025-11-27 18:28:06 Comparing candidate commit bee73bb in PR branch Found 1 performance improvements and 0 performance regressions! Performance is the same for 43 metrics, 2 unstable metrics. scenario:profiling - intern mixed existing and new
|
a611845 to
9eb0a24
Compare
| gem 'opentelemetry-sdk', '~> 1.1' | ||
| gem 'opentelemetry-metrics-sdk', '>= 0.8' | ||
| gem 'opentelemetry-exporter-otlp-metrics', '>= 0.4' | ||
| # opentelemetry-metrics-sdk 0.11+ requires opentelemetry-common >= 0.23.0 (for time_in_nanoseconds) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is time_in_nanoseconds needed? I can't find it in this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's defined in opentelemetry-common and used in the opentelemetry metrics sdk: https://github.com/search?q=repo%3Aopen-telemetry%2Fopentelemetry-ruby%20time_in_nanoseconds&type=code. It's not directly used in our library.
This is just note for us so we do not install incompatible opentelemetry libraries.
cfcb81a to
3faa913
Compare
Co-authored-by: Marco Costa <[email protected]>
| rescue => e | ||
| Datadog.logger.error("Failed to export OpenTelemetry Metrics: #{e.message}") | ||
| telemetry&.inc('tracers', METRIC_EXPORT_FAILURES, 1, tags: @telemetry_tags) | ||
| raise |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we want to raise this error again so that the user receives it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a good question! @mabdinur, can you check if this API expects errors to propagate?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The metrics exporter does not have any error handling. If it raises an exception then the errors will bubble up. We're doing the same here: https://github.com/open-telemetry/opentelemetry-ruby/blob/8c02da8de62df1464a6e4c8cc35f838a0ccd0dd5/exporter/otlp-metrics/lib/opentelemetry/exporter/otlp/metrics/metrics_exporter.rb#L86.
lib/datadog/opentelemetry/metrics.rb
Outdated
| def self.resolve_metrics_endpoint | ||
| metrics_config = @settings.opentelemetry.metrics | ||
| exporter_config = @settings.opentelemetry.exporter | ||
|
|
||
| return metrics_config.endpoint.to_s if metrics_config.endpoint | ||
| scheme = @agent_ssl ? 'https' : 'http' | ||
|
|
||
| if metrics_config.protocol | ||
| protocol = metrics_config.protocol | ||
| port = (protocol == 'http/protobuf') ? 4318 : 4317 | ||
| path = (protocol == 'http/protobuf') ? '/v1/metrics' : '' | ||
| return "#{scheme}://#{@agent_host}:#{port}#{path}" | ||
| end | ||
|
|
||
| return exporter_config.endpoint.to_s if exporter_config.endpoint | ||
|
|
||
| protocol = exporter_config.protocol || 'http/protobuf' | ||
| port = (protocol == 'http/protobuf') ? 4318 : 4317 | ||
| path = (protocol == 'http/protobuf') ? '/v1/metrics' : '' | ||
| "#{scheme}://#{@agent_host}:#{port}#{path}" | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can definitely extract a function here that builds the full url from either metrics config or exporter config, and simply use it here:
metrics_endpoint(@settings.opentelemetry.metrics) || metrics_endpoint(@settings.opentelemetry.exporter)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I removed most of this logic. Since grpc is not supported yet. We can really simplify this
| timeout = metrics_config.timeout_millis || exporter_config.timeout_millis | ||
| headers = metrics_config.headers || exporter_config.headers || {} | ||
|
|
||
| protocol = metrics_config.protocol || exporter_config.protocol |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's better to check if all the needed settings are in metrics_config, if not - use exporter_config. With the way it is now, the user could configure timeout for metrics, headers for exporter and protocol for metrics - and it would somehow work.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes sense but it goes against the opentelemetry spec. Each configuration needs to be evaluated separately
ivoanjo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Gave it another pass! I think this is getting in quite good shape :)
docs/OpenTelemetry.md
Outdated
| | Metrics SDK | https://rubygems.org/gems/opentelemetry-metrics-sdk | 2.25.0+ | >= 0.8 | | ||
| | OTLP Metrics Exporter | https://rubygems.org/gems/opentelemetry-exporter-otlp-metrics | 2.25.0+ | >= 0.4 | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 2.25.0? Are we not planning to ship this with the next 2.23.0 release, or at least the 2.24.0 after that? 👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't sure when this PR would get merged so I picked a release that would likely go out at the end of Q4. Since we are close to getting this merged we can set it to 2.23.0 😄
| # Set environment variable before initializing metrics support | ||
| ENV['DD_METRICS_OTEL_ENABLED'] = 'true' | ||
| require 'opentelemetry/sdk' | ||
| require 'opentelemetry-metrics-sdk' | ||
| require 'opentelemetry/exporter/otlp_metrics' | ||
| require 'datadog/opentelemetry' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you considered turning these 5 lines into a datadog/opentelemetry/metrics_autoload (or boot or some name) or something like that?
That way customers would only need to copy-paste the simpler require :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually only opentelemetry/sdk and datadog/opentelemetry are required. opentelemetry/sdk is required to call OpenTelemetry::SDK.configure and 'datadog/opentelemetry' is required to enable ddtrace-opentelemetry support for both metrics and traces.
We do not need to import 'opentelemetry/exporter/otlp_metrics' and 'opentelemetry/exporter/otlp_metrics' here. This is handled by 'datadog/opentelemetry'. I added these require statements to signal to users that they NEED to install the metrics-sdk and the metrics-exporter. I'll remove these imports and make the description a bit clearer.
| rescue => e | ||
| Datadog.logger.warn("Error while sending runtime metric. Cause: #{e.class.name} #{e.message}") | ||
| Datadog.logger.warn("Error while sending runtime metric. Cause: #{e.class.name} #{e.class}: #{e}") | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: Why add both e.class and e.class.name? 👀
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was a product of my find and replace to address another comment. I missed the #{e.class.name} component. I will remove it.
| describe 'Basic Functionality' do | ||
| it 'exports counter metrics' do | ||
| setup_metrics | ||
| provider = ::OpenTelemetry.meter_provider |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: Consider making this into a let? lets are lazy, so they will only be called when needed, which I think is exactly what we need here (e.g. setup_metrics will still correctly run before)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup a let here is a good idea. Also the code when through a few iterations. We can simplify the tests and consolidate some of the helpers
| flush_and_wait(provider) | ||
|
|
||
| metric = find_metric_in_json(get_testagent_metrics, 'requests_myapp') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: Every call to get_testagent_metrics seems to be preceded by a flush_and_wait(provider) -- consider combining them to simplify the tests a bit (the less boilerplate they have, the clearer we can focus on what's actually being tested)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, just did this
| it 'exports gauge metrics' do | ||
| setup_metrics('OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE' => 'cumulative') | ||
| provider = ::OpenTelemetry.meter_provider | ||
| gauge = provider.meter('app').create_gauge('temperature') | ||
|
|
||
| gauge.record(72) | ||
| flush_and_wait(provider) | ||
| gauge.record(72) | ||
| flush_and_wait(provider) | ||
|
|
||
| metric = find_metric_in_json(get_testagent_metrics, 'temperature') | ||
| value = metric['gauge']['data_points'].first['as_int']&.to_i | ||
| expect(value).to eq(72) | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider turning some of ⬆️ into a comment in the test -- I think it's not obvious why the two records are there
| it 'metrics-specific configs take precedence over general OTLP configs' do | ||
| setup_metrics( | ||
| 'DD_METRICS_OTEL_ENABLED' => 'true', | ||
| 'OTEL_EXPORTER_OTLP_METRICS_ENDPOINT' => 'http://metrics:4318/v1/metrics', | ||
| 'OTEL_EXPORTER_OTLP_METRICS_PROTOCOL' => 'http/protobuf', | ||
| 'OTEL_EXPORTER_OTLP_METRICS_TIMEOUT' => '5000', | ||
| 'OTEL_EXPORTER_OTLP_METRICS_HEADERS' => 'metrics=value', | ||
| 'OTEL_METRIC_EXPORT_INTERVAL' => '4000', | ||
| 'OTEL_METRIC_EXPORT_TIMEOUT' => '3000', | ||
| 'OTEL_EXPORTER_OTLP_ENDPOINT' => 'http://general:4317', | ||
| 'OTEL_EXPORTER_OTLP_PROTOCOL' => 'grpc', | ||
| 'OTEL_EXPORTER_OTLP_TIMEOUT' => '2000', | ||
| 'OTEL_EXPORTER_OTLP_HEADERS' => 'general=value' | ||
| ) | ||
|
|
||
| provider = ::OpenTelemetry.meter_provider | ||
| reader = provider.metric_readers.first | ||
| exporter = reader.instance_variable_get(:@exporter) | ||
|
|
||
| expect(exporter.instance_variable_get(:@uri).to_s).to eq('http://metrics:4318/v1/metrics') | ||
| expect(exporter.instance_variable_get(:@timeout)).to eq(5.0) | ||
| expect(reader.instance_variable_get(:@export_interval)).to eq(4.0) | ||
| expect(reader.instance_variable_get(:@export_timeout)).to eq(3.0) | ||
| expect(exporter.instance_variable_get(:@headers)['metrics']).to eq('value') | ||
| end | ||
|
|
||
| it 'general OTLP configs work in isolation' do | ||
| setup_metrics( | ||
| 'OTEL_EXPORTER_OTLP_ENDPOINT' => 'http://general:4317', | ||
| 'OTEL_EXPORTER_OTLP_PROTOCOL' => 'http/protobuf', | ||
| 'OTEL_EXPORTER_OTLP_TIMEOUT' => '8000', | ||
| 'OTEL_EXPORTER_OTLP_HEADERS' => 'general=value' | ||
| ) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider factoring out the OTLP configs per-test. That way it's even more clear that they get overridden! Something like
describe 'configuration priority' do
let(:env_vars) do
# general otlp configs
end
before { setup_metrics(env_vars) }
it 'uses the general OTLP configs in isolation' do
...the current test...
end
context 'when metrics-specific configs are provided' do
let(:env_vars) do
super().merge(
# metrics configs only -- general come from super()
)
end
it 'gives metrics-specific precendence' do
...etc...
end
endThere was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup, this wouldve actually saved me a lot time. Debugging this test is hell
| reader = provider.metric_readers.first | ||
| exporter = reader.instance_variable_get(:@exporter) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: These two also could become lets after provider also becomes one :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yup and we can also put attributes and resource in lets as well
a6ddfdc to
bee73bb
Compare
Thanks so much for the thorough review and feedback. I really appreciate your time and patience. I learned a lot about ruby along the way 😸 |
ivoanjo
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Looks great to me! I really like how clean the tests ended up :)
| proc do |value| | ||
| if value && value.to_s.downcase != 'http/protobuf' | ||
| Datadog.logger.warn("#{env_var_name}=#{value} is not supported. Using http/protobuf instead.") | ||
| 'http/protobuf' | ||
| else | ||
| value | ||
| end | ||
| end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: I suggest this small tweak for two reasons
- It's clearer only protobuf is available
- In the prior version we actually did not
downcasethe actual value so I think we could end up with 'HtTp/PrOtObUf' which is kinda confusing?
| proc do |value| | |
| if value && value.to_s.downcase != 'http/protobuf' | |
| Datadog.logger.warn("#{env_var_name}=#{value} is not supported. Using http/protobuf instead.") | |
| 'http/protobuf' | |
| else | |
| value | |
| end | |
| end | |
| proc do |value| | |
| if value && value.to_s.downcase != 'http/protobuf' | |
| Datadog.logger.warn("#{env_var_name}=#{value} is not supported. Using http/protobuf instead.") | |
| end | |
| 'http/protobuf' | |
| end |
| @logger.warn("Could not load OTLP metrics exporter: #{e.class}: #{e}") | ||
| end | ||
|
|
||
| private :create_resource, :configure_metric_reader, :resolve_metrics_endpoint, :configure_otlp_exporter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: Since this is now a regular class, you can leave a private in a single line at the beginning of the private section of the methods and not need to list them here ;)
| describe 'Basic Functionality' do | ||
| it 'exports counter metrics' do | ||
| setup_metrics | ||
| provider = ::OpenTelemetry.meter_provider |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: (Can be removed? Since it's a let now?)
| provider = ::OpenTelemetry.meter_provider |
|
|
||
| it 'exports histogram metrics' do | ||
| setup_metrics | ||
| provider = ::OpenTelemetry.meter_provider |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: (Can be removed? Since it's a let now?)
| provider = ::OpenTelemetry.meter_provider |
| 'OTEL_METRIC_EXPORT_INTERVAL' => '4000', | ||
| 'OTEL_METRIC_EXPORT_TIMEOUT' => '3000', | ||
| 'OTEL_EXPORTER_OTLP_PROTOCOL' => 'grpc', | ||
| 'OTEL_EXPORTER_OTLP_TIMEOUT' => '2000' |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor: I think this one can be removed, since it's already defined in the outer block (to 8000)
| 'OTEL_EXPORTER_OTLP_TIMEOUT' => '2000' |
What does this PR do?
Adds OpenTelemetry metrics SDK integration with OTLP exporter support for gRPC and HTTP/protobuf protocols.
Motivation:
Enables customers to use the OpenTelemetry Metrics API and export metrics via OTLP to any compatible backend, including the Datadog Agent.
Change log entry
Adds OpenTelemetry metrics SDK integration with OTLP exporter support. Configure via
DD_METRICS_OTEL_ENABLEDand standard OpenTelemetry environment variables.Key Features:
OTEL_EXPORTER_OTLP_*) as defaults for metrics-specific settingsRequired Gems for metrics support:
opentelemetry-metrics-sdk(~> 0.8)opentelemetry-exporter-otlp-metrics(~> 0.4)Testing:
BUNDLE_GEMFILE=gemfiles/ruby_3.2_opentelemetry.gemfile bundle exec rspec spec/datadog/opentelemetry/metrics_spec.rb