Add SDK span telemetry metrics #1631

JonasKunz · 2024-11-29T10:06:39Z

Changes

With this PR I'd like to start a discussion around adding SDK self-monitoring metrics to the semantic conventions.
The goal of these metrics is to give insights into how the SDK is performing, e.g. whether data is being dropped due to overload / misconfiguration or everything is healthy.
I'd like to add these to semconv to keep them language agnostic, so that for example a single dashboard can be used to visualize the health state of all SDKs used in a system.

We checked the SDK implementations, it seems like only the Java SDK currently has some health metrics implemented.
This PR took some inspiration from those and is intended to improve and therefore supersede them.

I'd like to start out with just span related metrics to keep the PR and discussions simpler here, but would follow up with similar PRs for logs and traces based on the discussion results on this PR.

Prior work

This PR can be seen as a follow up to the closed OTEP 259:

The OTEP originally superseded Add processed/exported Span metrics. #184, which initially focused only on SDK exporter metrics.
Add processed/exported Span metrics. #184 was closed in favor of the predecessor of OTEP 259 to instead allow monitoring of entire "pipelines" using unified metrics across SDK exporters and collector components : A pipeline starts with an SDK exporter and goes through the processing of collector(s)
Finally, that OTEP was closed in favor of this collector RFC which adds metrics just for the collector components.

So we kind of have gone full circle: The discussion started with just SDK metrics (only for exporters), going to an approach to unify the metrics across SDK-exporters and collector, which then ended up with just collector metrics.
So this PR can be seen as the required revival of #184 (see also this comment).

In my opinion, it is a good thing to separate the collector and SDK self-metrics:

There have been concerns about both using the same metrics for both: How do you distinguish the metrics exposed by collector components from the self-monitoring metrics exposed by an Otel-SDK used in the collector for e.g. tracing the collector itself?
Though many concepts between the collector and SDK share the same name, they are not the same thing (to my knowledge, I'm not a collector expert): For example processors in the collector are designed to form pipelines potentially mutating the data as it passes through. In contrast, SDK span processor don't form pipelines (at least not visible to the SDK, those would be hidden custom implementations). Instead SDK span processors are merely observers with multiple callbacks for the span lifecycle. So it would feel like "shoehorning" things into the same metric, even though they are not the same concepts.
Separating collector and SDK metrics makes their evolution and reaching agreements a lot easier: When using separate metrics and namespaces, collector metrics can focus on the collector implementation and SDK metrics can be defined just using the SDK spec. If combine both in shared metrics, those will have to be always be aligned with both the SDK spec and the collector implementation. I think this would make maintenance much harder for little benefit.
I have a hard time finding benefits of sharing metrics for SDK and collector: The main benefit I find would of course be easier dashboarding / analysis. However, I do think having to look at two sets of metrics to do so is a fine tradeoff, considering the difficulties with the unification listed above and shown by the history of OTEP 259.

Existing Metrics in Java SDK

For reference, here is what the existing health metrics currently look like in the Java SDK:

Batch Span Processor metrics

Gauge queueSize, value is the current size of the queue
- Attribute spanProcessorType=BatchSpanProcessor (there was a former ExecutorServiceSpanProcessor which has been removed)
- This metric currently causes collisions if two BatchSpanProcessor instances are used
Counter processedSpans, value is the number of spans submitted to the Processor
- Attribute spanProcessorType=BatchSpanProcessor
- Attribute dropped (boolean), true for the number of spans which could not be processed due to a full queue

The SDK also implements pretty much the same metrics for the BatchLogRecordProcessor just span replaced everywhere with log

Exporter metrics

Exporter metrics are the same for spans, metrics and logs. They are distinguishable based on a type attribute.
Also the metric names are dependent on a "name" and "transport" defined by the exporter. For OTLP those are:

exporterName=otlp
transport is one of grpc, http (= protobuf) or http-json

The transport is used just for the instrumentation scope name: io.opentelemetry.exporters.<exporterName>-<transport>

Based on that, the following metrics are exposed:

Counter <exporterName>.exporter.seen: The number of records (spans, metrics or logs) submitted to the exporter
- Attribute type: one of span, metric or log
Counter <exporterName>.exporter.exported: The number of records (spans, metrics or logs) actually exported (or failed)
- Attribute type: one of span, metric or log
- Attribute success (boolean): false for exporter failures

Merge requirement checklist

CONTRIBUTING.md guidelines followed.
Change log entry added, according to the guidelines in When to add a changelog entry.
- If your PR does not need a change log, start the PR title with [chore]
schema-next.yaml updated with changes to existing conventions.

model/telemetry/metrics.yaml

model/telemetry/registry.yaml

model/telemetry/metrics.yaml

lmolkova · 2024-12-03T00:55:56Z

Related #1580

model/telemetry/registry.yaml

model/otel/metrics.yaml

Co-authored-by: Joao Grassi <[email protected]>

model/otel/metrics.yaml

docs/otel/sdk-metrics.md

lzchen · 2025-02-10T19:16:55Z

@JonasKunz

Would a otel.sdk.exporter.span.exported.duration metric that tracks the average duration of exports be on the roadmap? This metric would be quite useful to us.

JonasKunz · 2025-02-11T08:17:29Z

Would a otel.sdk.exporter.span.exported.duration metric that tracks the average duration of exports be on the roadmap?

@lzchen not in this PR, but I don't see why we wouldn't add something like this in the future. To me, this would fall in a exporter.request metric category for exporters which use network-requests for their exports. There we should capture things such as request count, size and duration.

This PR is current is about tracking data loss (+bonus of tracking the effective sampling rate).

dashpole · 2025-02-11T16:58:12Z

Would a otel.sdk.exporter.span.exported.duration metric that tracks the average duration of exports be on the roadmap? This metric would be quite useful to us.

@lzchen Would http and gRPC instrumentation be good enough to solve this use-case? Or do you think having explicit additional metrics in the exporters is needed?

lzchen · 2025-02-11T17:32:20Z

@JonasKunz

To me, this would fall in a exporter.request metric category for exporters which use network-requests for their exports. There we should capture things such as request count, size and duration.

For our use case in particular, tracking those things (request count, size and duration) are exactly what we need. Speaking separately though, would "duration" be a useful metric for exporters in general even for those that don't wind up using network requests?

@dashpole

Would http and gRPC instrumentation be good enough to solve this use-case? Or do you think having explicit additional metrics in the exporters is needed?

I believe certain implementations (like Python) have made it so that instrumentations do not track calls made by the SDK (and thus, the exporter) itself. I think explicit metrics related to SDK components are needed in that regard.

dashpole · 2025-02-11T17:37:06Z

I believe certain implementations (like Python) have made it so that instrumentations do not track calls made by the SDK (and thus, the exporter) itself. I think explicit metrics related to SDK components are needed in that regard.

That makes sense for tracing (where it is easy to produce an infinite export loop) but, IMO, makes less sense for metrics where that kind of feedback loop doesn't exist.

lzchen · 2025-02-11T18:09:36Z

That makes sense for tracing (where it is easy to produce an infinite export loop) but, IMO, makes less sense for metrics where that kind of feedback loop doesn't exist.

That's a good point. At least today unfortunately, all our instrumentations behave that way. Hypothetically, if we were to change this behavior, the instrumentations won't be able to differentiate between calls made from the SDK and ones made from the user's application correct?

dashpole · 2025-02-11T19:26:23Z

Yeah... People would need to use the server.port, or maybe we could find a way to make otel http trace exporters set url.template to /v1/traces.

lmolkova · 2025-02-12T02:49:58Z

A few points on duration:

export duration != http/grpc call duration - it's a duration of logical operation after all tries and includes time to transfer request and response payloads. I.e. it's useful independently of underlying requests
Counts can be derived from duration, so if (once) we add duration, the count metrics would become redundant. We avoid this in semconv in general and stick to duration histograms.

So I think duration is the first and the most important choice.

JonasKunz · 2025-02-12T09:00:05Z

export duration != http/grpc call duration - it's a duration of logical operation after all tries and includes time to transfer request and response payloads. I.e. it's useful independently of underlying requests

So IIUC you are referring to a part of what I would call "pipeline latency", the total time a span takes from being ended to being successfully exported. The metric you are envisioning would be the portion of this latency taken in the exporter, ignoring e.g. batching span processor delay.

Counts can be derived from duration, so if (once) we add duration, the count metrics would become redundant. We avoid this in semconv in general and stick to duration histograms.

My main concern here would be storage overhead. Histograms are much more expensive than counters, speaking of at least 10x with bad buckets, even more if you use exponential histograms or proper fine-granular buckets. This would make it hard to justify having the health metrics enabled by default. At the same time having the health metrics enabled by default gives the best out-of-the-box experience for users.

At the same time I can't really see the general importance / usefulness of having the exporter durations: It feels like more of a nice to have. What conclusions does this metric allow you to draw? Do you have concrete examples?

AlexanderWert · 2025-02-12T13:31:10Z

Since all other comments and discussions above are resolved and duration is something that we (if at all) would want to do as a follow up anyways and also is additive, I'd propose, we create a follow up issue where the above discussion can continue and we merge this PR as is.

WDYT? @lmolkova @JonasKunz @dashpole @lzchen

lmolkova · 2025-02-13T02:15:02Z

So IIUC you are referring to a part of what I would call "pipeline latency", the total time a span takes from being ended to being successfully exported. The metric you are envisioning would be the portion of this latency taken in the exporter, ignoring e.g. batching span processor delay.

Pipeline latency is cool, but the moment you have it, you need to also have a way to break it down into pieces (exporting part, processor queue).

At the same time I can't really see the general importance / usefulness of having the exporter durations: It feels like more of a nice to have. What conclusions does this metric allow you to draw? Do you have concrete examples?

Debugging connectivity with my backend - network issues, throttling, slow backend response, retries, retry backoff interval optimizations.
As @lzchen mentioned, absolute majority of OTel SDKs suppress all instrumentation around exporter calls, so you'd not have protocol-level spans and/or metrics. So a metric like this would be the only way to know how good your connectivity is.

Counts are good, but they won't tell you that your P99 is 10 sec after all tries because your backoff interval is wrong. You'd just see less of them and will have no idea.

I don't think it's nice to have.
Given it's experimental, we can go ahead with just counts and potentially deprecating them when/if histograms are added.

As a cost mitigation strategy, we can always use a small amount of buckets by default and users can always reconfigure them if they need less/more.

JonasKunz · 2025-02-13T08:48:47Z

Issue for follow-up discussions around adding duration:
#1906

JonasKunz added 3 commits November 29, 2024 11:03

Added SDK span telemetry metrics

8f2b666

Fix formatting

8b2a1db

Fix yamllint

8bbea82

JonasKunz force-pushed the sdk-telemetry branch from 04f924f to 8bbea82 Compare November 29, 2024 10:26

JonasKunz added 2 commits November 29, 2024 11:26

Merge remote-tracking branch 'otel/main' into sdk-telemetry

e15696f

Changelog

cef63f2

JonasKunz commented Nov 29, 2024

View reviewed changes

model/telemetry/metrics.yaml Outdated Show resolved Hide resolved

JonasKunz marked this pull request as ready for review November 29, 2024 10:40

JonasKunz requested review from a team as code owners November 29, 2024 10:40