Skip to content

Commit 0ee3eb1

Browse files
reyangcijothomasvishweshbankwar
authored
Improve metrics docs - part 2 (#5284)
Co-authored-by: Cijo Thomas <[email protected]> Co-authored-by: Vishwesh Bankwar <[email protected]>
1 parent d2f9d05 commit 0ee3eb1

File tree

3 files changed

+265
-23
lines changed

3 files changed

+265
-23
lines changed

docs/metrics/README.md

+263-21
Original file line numberDiff line numberDiff line change
@@ -55,6 +55,14 @@ readonly fields (e.g. [Program.cs](./getting-started-console/Program.cs)) or
5555
singleton via dependency injection (e.g.
5656
[Instrumentation.cs](../../examples/AspNetCore/Instrumentation.cs)).
5757

58+
:stop_sign: You should avoid invalid instrument names.
59+
60+
> [!NOTE]
61+
> OpenTelemetry will not collect metrics from instruments that are using invalid
62+
names. Refer to the [OpenTelemetry
63+
Specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/api.md#instrument-name-syntax)
64+
for the valid syntax.
65+
5866
:stop_sign: You should avoid changing the order of tags while reporting
5967
measurements.
6068

@@ -118,6 +126,261 @@ the hot-path. You SHOULD try to keep the number of tags less than or equal to 8.
118126
If you are exceeding this, check if you can model some of the tags as Resource,
119127
as [shown here](#modeling-static-tags-as-resource).
120128

129+
## MeterProvider Management
130+
131+
:stop_sign: You should avoid creating `MeterProvider` instances too frequently,
132+
`MeterProvider` is fairly expensive and meant to be reused throughout the
133+
application. For most applications, one `MeterProvider` instance per process
134+
would be sufficient.
135+
136+
```mermaid
137+
graph LR
138+
139+
subgraph Meter A
140+
InstrumentX
141+
end
142+
143+
subgraph Meter B
144+
InstrumentY
145+
InstrumentZ
146+
end
147+
148+
subgraph Meter Provider 2
149+
MetricReader2
150+
MetricExporter2
151+
MetricReader3
152+
MetricExporter3
153+
end
154+
155+
subgraph Meter Provider 1
156+
MetricReader1
157+
MetricExporter1
158+
end
159+
160+
InstrumentX --> | Measurements | MetricReader1
161+
InstrumentY --> | Measurements | MetricReader1 --> MetricExporter1
162+
InstrumentZ --> | Measurements | MetricReader2 --> MetricExporter2
163+
InstrumentZ --> | Measurements | MetricReader3 --> MetricExporter3
164+
```
165+
166+
:heavy_check_mark: You should properly manage the lifecycle of `MeterProvider`
167+
instances if they are created by you.
168+
169+
Here is the rule of thumb when managing the lifecycle of `MeterProvider`:
170+
171+
* If you are building an application with [dependency injection
172+
(DI)](https://learn.microsoft.com/dotnet/core/extensions/dependency-injection)
173+
(e.g. [ASP.NET Core](https://learn.microsoft.com/aspnet/core) and [.NET
174+
Worker](https://learn.microsoft.com/dotnet/core/extensions/workers)), in most
175+
cases you should create the `MeterProvider` instance and let DI manage its
176+
lifecycle. Refer to the [Getting Started with OpenTelemetry .NET Metrics in 5
177+
Minutes - ASP.NET Core Application](./getting-started-aspnetcore/README.md)
178+
tutorial to learn more.
179+
* If you are building an application without DI, create a `MeterProvider`
180+
instance and manage the lifecycle explicitly. Refer to the [Getting Started
181+
with OpenTelemetry .NET Metrics in 5 Minutes - Console
182+
Application](./getting-started-console/README.md) tutorial to learn more.
183+
* If you forget to dispose the `MeterProvider` instance before the application
184+
ends, metrics might get dropped due to the lack of proper flush.
185+
* If you dispose the `MeterProvider` instance too early, any subsequent
186+
measurements will not be collected.
187+
188+
## Memory Management
189+
190+
In OpenTelemetry,
191+
[measurements](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/api.md#measurement)
192+
are reported via the metrics API. The SDK
193+
[aggregates](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/sdk.md#aggregation)
194+
metrics using certain algorithm and memory management strategy to achieve good
195+
performance and efficiency. Here are the rules which OpenTelemetry .NET follows
196+
while implementing the metrics aggregation logic:
197+
198+
1. [**Pre-Aggregation**](#pre-aggregation): aggregation occurs within the SDK.
199+
2. [**Cardinality Limits**](#cardinality-limits): the aggregation logic respects
200+
[cardinality
201+
limits](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/sdk.md#cardinality-limits),
202+
so the SDK does not use indefinite amount of memory when there is cardinality
203+
explosion.
204+
3. [**Memory Preallocation**](#memory-preallocation): the memory used by
205+
aggregation logic is allocated during the SDK initialization, so the SDK does
206+
not have to allocate memory on-the-fly. This is to avoid garbage collection
207+
being triggered on the hot code path.
208+
209+
### Example
210+
211+
Let's take the following example:
212+
213+
* During the time range (T0, T1]:
214+
* value = 1, name = `apple`, color = `red`
215+
* value = 2, name = `lemon`, color = `yellow`
216+
* During the time range (T1, T2]:
217+
* no fruit has been received
218+
* During the time range (T2, T3]
219+
* value = 5, name = `apple`, color = `red`
220+
* value = 2, name = `apple`, color = `green`
221+
* value = 4, name = `lemon`, color = `yellow`
222+
* value = 2, name = `lemon`, color = `yellow`
223+
* value = 1, name = `lemon`, color = `yellow`
224+
* value = 3, name = `lemon`, color = `yellow`
225+
226+
If we aggregate and export the metrics using [Cumulative Aggregation
227+
Temporality](https://github.com/open-telemetry/opentelemetry-specification/blob/main/pecification/metrics/data-model.md#temporality):
228+
229+
* (T0, T1]
230+
* attributes: {name = `apple`, color = `red`}, count: `1`
231+
* attributes: {verb = `lemon`, color = `yellow`}, count: `2`
232+
* (T0, T2]
233+
* attributes: {name = `apple`, color = `red`}, count: `1`
234+
* attributes: {verb = `lemon`, color = `yellow`}, count: `2`
235+
* (T0, T3]
236+
* attributes: {name = `apple`, color = `red`}, count: `6`
237+
* attributes: {name = `apple`, color = `green`}, count: `2`
238+
* attributes: {verb = `lemon`, color = `yellow`}, count: `12`
239+
240+
If we aggregate and export the metrics using [Delta Aggregation
241+
Temporality](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/data-model.md#temporality):
242+
243+
* (T0, T1]
244+
* attributes: {name = `apple`, color = `red`}, count: `1`
245+
* attributes: {verb = `lemon`, color = `yellow`}, count: `2`
246+
* (T1, T2]
247+
* nothing since we don't have any measurement received
248+
* (T2, T3]
249+
* attributes: {name = `apple`, color = `red`}, count: `5`
250+
* attributes: {name = `apple`, color = `green`}, count: `2`
251+
* attributes: {verb = `lemon`, color = `yellow`}, count: `10`
252+
253+
### Pre-Aggregation
254+
255+
Taking the [fruit example](#example), there are 6 measurements reported during
256+
`(T2, T3]`. Instead of exporting every individual measurement events, the SDK
257+
aggregates them and only export the summarized results. This approach, as
258+
illustrated in the following diagram, is called pre-aggregation:
259+
260+
```mermaid
261+
graph LR
262+
263+
subgraph SDK
264+
Instrument --> | Measurements | Pre-Aggregation[Pre-Aggregation]
265+
end
266+
267+
subgraph Collector
268+
Aggregation
269+
end
270+
271+
Pre-Aggregation --> | Metrics | Aggregation
272+
```
273+
274+
Pre-aggregation brings serveral benefits:
275+
276+
1. Although the amount of calculation remains the same, the amount of data
277+
transmitted can be significantly reduced using pre-aggregation, thus
278+
improving the overall efficiency.
279+
2. Pre-aggregation makes it possible to apply [cardinality
280+
limits](#cardinality-limits) during SDK initialization, combined with [memory
281+
preallocation](#memory-preallocation), they make the metrics data collection
282+
behavior more predictable (e.g. a server under denial-of-service attack would
283+
still produce a constant volume of metrics data, rather than flooding the
284+
observability system with large volume of measurement events).
285+
286+
There are cases where users might want to export raw measurement events instead
287+
of using pre-aggregation, as illustrated in the following diagram. OpenTelemetry
288+
does not support this scenario at the moment, if you are interested, please join
289+
the discussion by replying to this [feature
290+
ask](https://github.com/open-telemetry/opentelemetry-specification/issues/617).
291+
292+
```mermaid
293+
graph LR
294+
295+
subgraph SDK
296+
Instrument
297+
end
298+
299+
subgraph Collector
300+
Aggregation
301+
end
302+
303+
Instrument --> | Measurements | Aggregation
304+
```
305+
306+
### Cardinality Limits
307+
308+
The number of unique combinations of attributes is called cardinality. Taking
309+
the [fruit example](#example), if we know that we can only have apple/lemon as
310+
the name, red/yellow/green as the color, then we can say the cardinality is 6.
311+
No matter how many apples and lemons we have, we can always use the following
312+
table to summarize the total number of fruits based on the name and color.
313+
314+
| Name | Color | Count |
315+
| ----- | ------ | ----- |
316+
| apple | red | ? |
317+
| apple | yellow | ? |
318+
| apple | green | ? |
319+
| lemon | red | ? |
320+
| lemon | yellow | ? |
321+
| lemon | green | ? |
322+
323+
In other words, we know how much storage and network are needed to collect and
324+
transmit these metrics, regardless of the traffic pattern.
325+
326+
In real world applications, the cardinality can be very high. Imagine if we have
327+
a long running service and we collect metrics with 7 attributes and each
328+
attribute can have 30 different values. We might eventually end up having to
329+
remember the complete set of all 21,870,000,000 combinations! This cardinality
330+
explosion is a well-known challenge in the metrics space. For example, it can
331+
cause surprisingly high cost in the observability system, or even be leveraged
332+
by hackers to launch a denial-of-service attack.
333+
334+
[Cardinality
335+
limit](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/sdk.md#cardinality-limits)
336+
is a throttling mechanism which allows the metrics collection system to have a
337+
predictable and reliable behavior when excessive cardinality happens, whether it
338+
was due to a malicious attack or developer making mistakes while writing code.
339+
340+
OpenTelemetry has a default cardinality limit of `2000` per metric. This limit
341+
can be configured at `MeterProvider` level using
342+
`SetMaxMetricPointsPerMetricStream` method, or at individual
343+
[view](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/sdk.md#view)
344+
level. Refer to this
345+
[doc](../../docs/metrics/customizing-the-sdk/README.md#changing-maximum-metricpoints-per-metricstream)
346+
for more information.
347+
348+
> [!NOTE]
349+
> Setting cardinality limit per view is not yet implemented in OpenTelemetry
350+
.NET. You can track the progress by following this
351+
[issue](https://github.com/open-telemetry/opentelemetry-dotnet/issues/5296).
352+
353+
Given a metric, once the cardinality limit is reached, any new measurement which
354+
cannot be independently aggregated because of the limit will be aggregated using
355+
the [overflow
356+
attribute](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/sdk.md#overflow-attribute).
357+
358+
> [!NOTE]
359+
> Overflow attribute was introduced in OpenTelemetry .NET
360+
[1.6.0-rc.1](../../src/OpenTelemetry/CHANGELOG.md#160-rc1). It is currently an
361+
experimental feature which can be turned on by setting the environment
362+
variable `OTEL_DOTNET_EXPERIMENTAL_METRICS_EMIT_OVERFLOW_ATTRIBUTE=true`. Once
363+
the [OpenTelemetry
364+
Specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/sdk.md#overflow-attribute)
365+
become stable, this feature will be turned on by default.
366+
367+
When [Delta Aggregation
368+
Temporality](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/data-model.md#temporality)
369+
is used, it is possible to choose a smaller cardinality limit by allowing the
370+
SDK to reclaim unused metric points.
371+
372+
> [!NOTE]
373+
> Reclaim unused metric points feature was introduced in OpenTelemetry .NET
374+
[1.7.0-alpha.1](../../src/OpenTelemetry/CHANGELOG.md#170-alpha1). It is
375+
currently an experimental feature which can be turned on by setting the
376+
environment variable
377+
`OTEL_DOTNET_EXPERIMENTAL_METRICS_RECLAIM_UNUSED_METRIC_POINTS=true`. Once the
378+
[OpenTelemetry
379+
Specification](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/sdk.md#overflow-attribute)
380+
become stable, this feature will be turned on by default.
381+
382+
### Memory Preallocation
383+
121384
### Modeling static tags as Resource
122385

123386
Tags such as `MachineName`, `Environment` etc. which are static throughout the
@@ -130,24 +393,3 @@ each metric measurement. Refer to this
130393
* The `Meter` used to create the instruments is not added to the
131394
`MeterProvider`. Use `AddMeter` method to enable the processing for the
132395
required metrics.
133-
* Instrument name is invalid. When naming instruments, ensure that the name you
134-
choose meets the criteria defined in the
135-
[spec](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/metrics/api.md#instrument-name-syntax).
136-
A few notable characters that are not allowed in the instrument name: `/`
137-
(forward slash), `\` (backward slash), any space character in the name.
138-
* MetricPoint limit is reached. By default, the SDK limits the number of maximum
139-
MetricPoints (unique combination of keys and values for a given Metric stream)
140-
to `2000`. This limit can be configured using
141-
`SetMaxMetricPointsPerMetricStream` method. Refer to this
142-
[doc](../../docs/metrics/customizing-the-sdk/README.md#changing-maximum-metricpoints-per-metricstream)
143-
for more information. The SDK would not process any newer unique key-value
144-
combination that it encounters, once this limit is reached.
145-
* MeterProvider is disposed. You need to ensure that the `MeterProvider`
146-
instance is kept active for metrics to be collected. In a typical application,
147-
a single MeterProvider is built at application startup, and is disposed of at
148-
application shutdown. For an ASP.NET Core application, use `AddOpenTelemetry`
149-
and `WithMetrics` methods from the `OpenTelemetry.Extensions.Hosting` package
150-
to correctly setup `MeterProvider`. Here's a [sample ASP.NET Core
151-
app](../../examples/AspNetCore/Program.cs) for reference. For simpler
152-
applications such as Console apps, refer to this
153-
[example](../../docs/metrics/getting-started-console/Program.cs).

docs/metrics/getting-started-console/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -76,7 +76,7 @@ is configured to subscribe to instruments from the Meter
7676
pre-aggregated metrics are exported to a `ConsoleExporter`.
7777

7878
```csharp
79-
using var meterProvider = Sdk.CreateMeterProviderBuilder()
79+
var meterProvider = Sdk.CreateMeterProviderBuilder()
8080
.AddMeter("MyCompany.MyProduct.MyLibrary")
8181
.AddConsoleExporter()
8282
.Build();

docs/trace/getting-started-console/README.md

+1-1
Original file line numberDiff line numberDiff line change
@@ -85,7 +85,7 @@ is configured to subscribe to the activities from the source
8585
`ConsoleExporter` simply displays it on the console.
8686

8787
```csharp
88-
using var tracerProvider = Sdk.CreateTracerProviderBuilder()
88+
var tracerProvider = Sdk.CreateTracerProviderBuilder()
8989
.AddSource("MyCompany.MyProduct.MyLibrary")
9090
.AddConsoleExporter()
9191
.Build();

0 commit comments

Comments
 (0)