Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add SDK span telemetry metrics #1631

Merged
merged 67 commits into from
Feb 13, 2025

Conversation

JonasKunz
Copy link
Contributor

@JonasKunz JonasKunz commented Nov 29, 2024

Changes

With this PR I'd like to start a discussion around adding SDK self-monitoring metrics to the semantic conventions.
The goal of these metrics is to give insights into how the SDK is performing, e.g. whether data is being dropped due to overload / misconfiguration or everything is healthy.
I'd like to add these to semconv to keep them language agnostic, so that for example a single dashboard can be used to visualize the health state of all SDKs used in a system.

We checked the SDK implementations, it seems like only the Java SDK currently has some health metrics implemented.
This PR took some inspiration from those and is intended to improve and therefore supersede them.

I'd like to start out with just span related metrics to keep the PR and discussions simpler here, but would follow up with similar PRs for logs and traces based on the discussion results on this PR.

Prior work

This PR can be seen as a follow up to the closed OTEP 259:

So we kind of have gone full circle: The discussion started with just SDK metrics (only for exporters), going to an approach to unify the metrics across SDK-exporters and collector, which then ended up with just collector metrics.
So this PR can be seen as the required revival of #184 (see also this comment).

In my opinion, it is a good thing to separate the collector and SDK self-metrics:

  • There have been concerns about both using the same metrics for both: How do you distinguish the metrics exposed by collector components from the self-monitoring metrics exposed by an Otel-SDK used in the collector for e.g. tracing the collector itself?
  • Though many concepts between the collector and SDK share the same name, they are not the same thing (to my knowledge, I'm not a collector expert): For example processors in the collector are designed to form pipelines potentially mutating the data as it passes through. In contrast, SDK span processor don't form pipelines (at least not visible to the SDK, those would be hidden custom implementations). Instead SDK span processors are merely observers with multiple callbacks for the span lifecycle. So it would feel like "shoehorning" things into the same metric, even though they are not the same concepts.
  • Separating collector and SDK metrics makes their evolution and reaching agreements a lot easier: When using separate metrics and namespaces, collector metrics can focus on the collector implementation and SDK metrics can be defined just using the SDK spec. If combine both in shared metrics, those will have to be always be aligned with both the SDK spec and the collector implementation. I think this would make maintenance much harder for little benefit.
  • I have a hard time finding benefits of sharing metrics for SDK and collector: The main benefit I find would of course be easier dashboarding / analysis. However, I do think having to look at two sets of metrics to do so is a fine tradeoff, considering the difficulties with the unification listed above and shown by the history of OTEP 259.

Existing Metrics in Java SDK

For reference, here is what the existing health metrics currently look like in the Java SDK:

Batch Span Processor metrics

  • Gauge queueSize, value is the current size of the queue
    • Attribute spanProcessorType=BatchSpanProcessor (there was a former ExecutorServiceSpanProcessor which has been removed)
    • This metric currently causes collisions if two BatchSpanProcessor instances are used
  • Counter processedSpans, value is the number of spans submitted to the Processor
    • Attribute spanProcessorType=BatchSpanProcessor
    • Attribute dropped (boolean), true for the number of spans which could not be processed due to a full queue

The SDK also implements pretty much the same metrics for the BatchLogRecordProcessor just span replaced everywhere with log

Exporter metrics

Exporter metrics are the same for spans, metrics and logs. They are distinguishable based on a type attribute.
Also the metric names are dependent on a "name" and "transport" defined by the exporter. For OTLP those are:

  • exporterName=otlp
  • transport is one of grpc, http (= protobuf) or http-json

The transport is used just for the instrumentation scope name: io.opentelemetry.exporters.<exporterName>-<transport>

Based on that, the following metrics are exposed:

Merge requirement checklist

@JonasKunz JonasKunz marked this pull request as ready for review November 29, 2024 10:40
@JonasKunz JonasKunz requested review from a team as code owners November 29, 2024 10:40
@lmolkova
Copy link
Contributor

lmolkova commented Dec 3, 2024

Related #1580

model/otel/metrics.yaml Outdated Show resolved Hide resolved
model/otel/metrics.yaml Outdated Show resolved Hide resolved
model/otel/metrics.yaml Outdated Show resolved Hide resolved
model/otel/metrics.yaml Outdated Show resolved Hide resolved
model/otel/metrics.yaml Outdated Show resolved Hide resolved
model/otel/metrics.yaml Outdated Show resolved Hide resolved
model/otel/metrics.yaml Show resolved Hide resolved
docs/otel/sdk-metrics.md Outdated Show resolved Hide resolved
@lzchen
Copy link

lzchen commented Feb 10, 2025

@JonasKunz

Would a otel.sdk.exporter.span.exported.duration metric that tracks the average duration of exports be on the roadmap? This metric would be quite useful to us.

@JonasKunz
Copy link
Contributor Author

Would a otel.sdk.exporter.span.exported.duration metric that tracks the average duration of exports be on the roadmap?

@lzchen not in this PR, but I don't see why we wouldn't add something like this in the future. To me, this would fall in a exporter.request metric category for exporters which use network-requests for their exports. There we should capture things such as request count, size and duration.

This PR is current is about tracking data loss (+bonus of tracking the effective sampling rate).

@dashpole
Copy link
Contributor

dashpole commented Feb 11, 2025

Would a otel.sdk.exporter.span.exported.duration metric that tracks the average duration of exports be on the roadmap? This metric would be quite useful to us.

@lzchen Would http and gRPC instrumentation be good enough to solve this use-case? Or do you think having explicit additional metrics in the exporters is needed?

@lzchen
Copy link

lzchen commented Feb 11, 2025

@JonasKunz

To me, this would fall in a exporter.request metric category for exporters which use network-requests for their exports. There we should capture things such as request count, size and duration.

For our use case in particular, tracking those things (request count, size and duration) are exactly what we need. Speaking separately though, would "duration" be a useful metric for exporters in general even for those that don't wind up using network requests?

@dashpole

Would http and gRPC instrumentation be good enough to solve this use-case? Or do you think having explicit additional metrics in the exporters is needed?

I believe certain implementations (like Python) have made it so that instrumentations do not track calls made by the SDK (and thus, the exporter) itself. I think explicit metrics related to SDK components are needed in that regard.

@dashpole
Copy link
Contributor

I believe certain implementations (like Python) have made it so that instrumentations do not track calls made by the SDK (and thus, the exporter) itself. I think explicit metrics related to SDK components are needed in that regard.

That makes sense for tracing (where it is easy to produce an infinite export loop) but, IMO, makes less sense for metrics where that kind of feedback loop doesn't exist.

@lzchen
Copy link

lzchen commented Feb 11, 2025

That makes sense for tracing (where it is easy to produce an infinite export loop) but, IMO, makes less sense for metrics where that kind of feedback loop doesn't exist.

That's a good point. At least today unfortunately, all our instrumentations behave that way. Hypothetically, if we were to change this behavior, the instrumentations won't be able to differentiate between calls made from the SDK and ones made from the user's application correct?

@dashpole
Copy link
Contributor

Yeah... People would need to use the server.port, or maybe we could find a way to make otel http trace exporters set url.template to /v1/traces.

@lmolkova
Copy link
Contributor

lmolkova commented Feb 12, 2025

A few points on duration:

  1. export duration != http/grpc call duration - it's a duration of logical operation after all tries and includes time to transfer request and response payloads. I.e. it's useful independently of underlying requests
  2. Counts can be derived from duration, so if (once) we add duration, the count metrics would become redundant. We avoid this in semconv in general and stick to duration histograms.

So I think duration is the first and the most important choice.

@JonasKunz
Copy link
Contributor Author

export duration != http/grpc call duration - it's a duration of logical operation after all tries and includes time to transfer request and response payloads. I.e. it's useful independently of underlying requests

So IIUC you are referring to a part of what I would call "pipeline latency", the total time a span takes from being ended to being successfully exported. The metric you are envisioning would be the portion of this latency taken in the exporter, ignoring e.g. batching span processor delay.

Counts can be derived from duration, so if (once) we add duration, the count metrics would become redundant. We avoid this in semconv in general and stick to duration histograms.

My main concern here would be storage overhead. Histograms are much more expensive than counters, speaking of at least 10x with bad buckets, even more if you use exponential histograms or proper fine-granular buckets. This would make it hard to justify having the health metrics enabled by default. At the same time having the health metrics enabled by default gives the best out-of-the-box experience for users.

At the same time I can't really see the general importance / usefulness of having the exporter durations: It feels like more of a nice to have. What conclusions does this metric allow you to draw? Do you have concrete examples?

@AlexanderWert
Copy link
Member

AlexanderWert commented Feb 12, 2025

Since all other comments and discussions above are resolved and duration is something that we (if at all) would want to do as a follow up anyways and also is additive, I'd propose, we create a follow up issue where the above discussion can continue and we merge this PR as is.

WDYT? @lmolkova @JonasKunz @dashpole @lzchen

@lmolkova
Copy link
Contributor

lmolkova commented Feb 13, 2025

So IIUC you are referring to a part of what I would call "pipeline latency", the total time a span takes from being ended to being successfully exported. The metric you are envisioning would be the portion of this latency taken in the exporter, ignoring e.g. batching span processor delay.

Pipeline latency is cool, but the moment you have it, you need to also have a way to break it down into pieces (exporting part, processor queue).

At the same time I can't really see the general importance / usefulness of having the exporter durations: It feels like more of a nice to have. What conclusions does this metric allow you to draw? Do you have concrete examples?

Debugging connectivity with my backend - network issues, throttling, slow backend response, retries, retry backoff interval optimizations.
As @lzchen mentioned, absolute majority of OTel SDKs suppress all instrumentation around exporter calls, so you'd not have protocol-level spans and/or metrics. So a metric like this would be the only way to know how good your connectivity is.

Counts are good, but they won't tell you that your P99 is 10 sec after all tries because your backoff interval is wrong. You'd just see less of them and will have no idea.

I don't think it's nice to have.
Given it's experimental, we can go ahead with just counts and potentially deprecating them when/if histograms are added.

As a cost mitigation strategy, we can always use a small amount of buckets by default and users can always reconfigure them if they need less/more.

@AlexanderWert AlexanderWert merged commit 87bd2c1 into open-telemetry:main Feb 13, 2025
15 checks passed
@JonasKunz
Copy link
Contributor Author

Issue for follow-up discussions around adding duration:
#1906

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:otel enhancement New feature or request
Projects
Archived in project
Development

Successfully merging this pull request may close these issues.

8 participants