Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenTelemetry Readme #134

Merged
merged 14 commits into from
Feb 10, 2025
88 changes: 88 additions & 0 deletions doc/opentelemetry/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,88 @@
# Databroker Tracing with OpenTelemetry

OpenTelemetry is an observability framework and toolkit designed to create and manage telemetry data such as traces, metrics, and logs.

By enabling the `otel` build feature, OpenTelemetry Traces are enabled in the databroker binary. When enabled, trace information is being actively sent to an OTLP endpoint, which allows call traces to be analyzed in frontend tools like Jaeger or Zipkin.

_Note: OpenTelemetry Logs and Metrics are not available._

# Manual infrastructure setup

To collect trace information and being able to analyze the data, some infrastructure services are needed. For development and debugging purposes, the Databroker, the OpenTelemetry Collector and the frontend UI (e.g. Jaeger) can be started locally. In a remote scenario, the databroker and OpenTelemetry Collector would be running on the target environment (e.g. in a virtual device or in a high-performance vehicle computer), wheres the backend collectors, its storage service and frontend UI components for analysis would be deployed on a cloud backend.

## Prometheus

_Note: Prometheus is only needed when Metrics will be available in the future._

```
curl --proto '=https' --tlsv1.2 -fOL https://github.com/prometheus/prometheus/releases/download/v3.1.0/prometheus-3.1.0.linux-amd64.tar.gz
tar xvfz prometheus-*.tar.gz
cd prometheus-*
./prometheus
```

## Jaeger

Jaeger is a frontend user interface to visualize call traces.

```
curl --proto '=https' --tlsv1.2 -fOL https://github.com/jaegertracing/jaeger/releases/download/v1.65.0/jaeger-2.2.0-linux-amd64.tar.gz
tar xzf jaeger-2.2.0-linux-amd64.tar.gz
cd jaeger-2.2.0-linux-amd64
./jaeger --config=config-jaeger.yaml
```

## OpenTelemetry Collector

The collector is the OTLP endpoint to which databroker is sending otel data.

```
cd doc/opentelemetry
curl --proto '=https' --tlsv1.2 -fOL https://github.com/open-telemetry/opentelemetry-collector-releases/releases/download/v0.118.0/otelcol_0.118.0_linux_amd64.tar.gz
tar -xvf otelcol_0.118.0_linux_amd64.tar.gz
./otelcol --config=config-otel-collector.yaml
```

## Kuksa Databroker

Enable the `otel` feature and start databroker binary with an increased buffer size for OTEL messages, as the trace information from databroker is extensive.

```
# in $workspace
cargo build --features=otel
OTEL_BSP_MAX_QUEUE_SIZE=8192 target/debug/databroker --vss data/vss-core/vss_release_4.0.json --enable-databroker-v1 --insecure
```

Open the Jaeger UI at http://localhost:16686

# Testing

To test the OpenTelemetry Trace feature, invoke Kuksa API operations.
The simplest way to do this is to use the databroker-cli, subscribe to a vehicle signal, list metadata and publish/actuare new data.

## Use databroker-cli to invoke some methods

```
databroker-cli
```

# Troubleshooting

## Channel is full
Error Message:
```
OpenTelemetry trace error occurred. cannot send span to the batch span processor because the channel is full
```
Solution:
- Increase `OTEL_BSP_MAX_QUEUE_SIZE` to 8192 or more, depending on the situation. The default is 2048, which is not enough for the amount of data being recorded during tracing.


## Connection refused

Repeated messages when OTLP server is down:
```
OpenTelemetry trace error occurred. Exporter otlp encountered the following error(s): the grpc server returns error (The service is currently unavailable): , detailed error message: error trying to connect: tcp connect error: Connection refused (os error 111)
```
Solution:
- (Re)Start the OpenTelemetry Collector
- Ensure hostname and port number are properly configured. Default is `localhost:4317` for HTTP-based communication. Set environment variable `OTEL_ENDPOINT` to override default.
76 changes: 76 additions & 0 deletions doc/opentelemetry/config-jaeger.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,76 @@
service:
extensions: [jaeger_storage, jaeger_query, remote_sampling, healthcheckv2]
pipelines:
traces:
receivers: [otlp, jaeger, zipkin]
processors: [batch, adaptive_sampling]
exporters: [jaeger_storage_exporter]
telemetry:
resource:
service.name: jaeger
metrics:
level: detailed
address: 0.0.0.0:8888
logs:
level: debug
# TODO Initialize telemetry tracer once OTEL released new feature.
# https://github.com/open-telemetry/opentelemetry-collector/issues/10663

extensions:
healthcheckv2:
use_v2: true
http:

# pprof:
# endpoint: 0.0.0.0:1777
# zpages:
# endpoint: 0.0.0.0:55679

jaeger_query:
storage:
traces: some_store
traces_archive: another_store
# The maximum duration that is considered for clock skew adjustments.
# Defaults to 0 seconds, which means it's disabled.
max_clock_skew_adjust: 0s

jaeger_storage:
backends:
some_store:
memory:
max_traces: 100000
another_store:
memory:
max_traces: 100000

remote_sampling:
# You can either use file or adaptive sampling strategy in remote_sampling
# file:
# path: ./cmd/jaeger/sampling-strategies.json
adaptive:
sampling_store: some_store
initial_sampling_probability: 0.1
http:
grpc:

receivers:
otlp:
protocols:
grpc:
endpoint: 127.0.0.1:4417

jaeger:
protocols:
grpc:

zipkin:

processors:
batch:
# Adaptive Sampling Processor is required to support adaptive sampling.
# It expects remote_sampling extension with `adaptive:` config to be enabled.
adaptive_sampling:

exporters:
jaeger_storage_exporter:
trace_storage: some_store
31 changes: 31 additions & 0 deletions doc/opentelemetry/config-otel-collector.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
exporters:
debug:
# 'basic' or 'detailed'
verbosity: basic
# Data sources: metrics
prometheusremotewrite:
endpoint: http://localhost:9090/api/v1/write
tls:
insecure: true
# Actually jaeger
otlp:
endpoint: localhost:4417
tls:
insecure: true

service:
pipelines:
traces:
receivers: [otlp]
exporters: [debug,otlp]
metrics:
receivers: [otlp]
exporters: [debug,prometheusremotewrite]
logs:
receivers: [otlp]
exporters: [debug]
4 changes: 4 additions & 0 deletions doc/opentelemetry/prometheus.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
scrape_configs:
- job_name: "otel"
static_configs:
- targets: ['localhost:8888']
Loading