Skip to content

Broker metrics endpoint returns 404 when any data point is NaN/Infinity #1820

@titaniper

Description

@titaniper

Describe the bug

GET /api/clusters/{clusterName}/brokers/{id}/metrics returns HTTP 404 whenever the broker's Prometheus exposition contains even a single NaN (or Infinity) value, instead of returning the metrics that are available.

The frontend then shows the broker's Metrics tab as empty / errored.

This is the same outward symptom as #1630 — but with a deterministic, easily reproducible root cause that #1630 did not pin down. (#1630 also reports the silent variant where the response body is empty with HTTP 200; the underlying mapping path is the same.)

Root cause

io.kafbat.ui.mapper.ClusterMapper#convert(Stream<MetricSnapshot>) does:

.value(BigDecimal.valueOf(readPointValue(p)))

BigDecimal.valueOf(double) throws NumberFormatException("Infinite or NaN") on Double.NaN / Double.POSITIVE_INFINITY / Double.NEGATIVE_INFINITY.

BrokersController#getBrokersMetrics catches all errors with:

.onErrorReturn(ResponseEntity.notFound().build())

So a single NaN data point anywhere in the broker's metric stream collapses the whole response to 404. Nothing is logged.

How NaN ends up in the exposition

JMX-Prometheus exporter (Strimzi's metricsConfig.type: jmxPrometheusExporter, MSK Open Monitoring, anything using io.prometheus.jmx) legitimately emits NaN for *_avg / *_max Kafka sensors when the underlying meter has never been hit. Concretely, on a fresh Strimzi broker we see ~24 NaN data points like:

kafka_server_socket_server_metrics_reauthentication_latency_avg{listener="PLAIN-9092",networkProcessor="3"} NaN
kafka_server_socket_server_metrics_reauthentication_latency_max{listener="REPLICATION-9091",networkProcessor="0"} NaN
kafka_server_socket_server_metrics_request_size_avg{listener="TLS-9093",networkProcessor="6"} NaN
...

Operators have no realistic way to make these non-NaN — the broker has simply never observed a reauthentication or a request on that listener. Today, that means the broker's Metrics tab never works.

Steps to reproduce

  1. Run a Kafka cluster whose Prometheus endpoint produces at least one NaN data point. The simplest reproduction is Strimzi with jmxPrometheusExporter and the default JMX ruleset, but any cluster with unused listeners/sensors works.
  2. Configure kafbat-ui:
    KAFKA_CLUSTERS_0_METRICS_TYPE: PROMETHEUS
    KAFKA_CLUSTERS_0_METRICS_PORT: 9404   # 11001 for MSK Open Monitoring
  3. Verify the endpoint serves data:
    wget -qO- http://broker-0:9404/metrics | grep -c ' NaN$'
    # > 0
    
  4. Open the broker detail → Metrics tab, or curl /api/clusters/<name>/brokers/0/metrics.

Expected behavior

Finite data points are returned; NaN/Infinity points are dropped (or otherwise represented in a JSON-safe way). A single bad point should not nuke the whole response.

Actual behavior

HTTP 404 Not Found (or empty body in some Spring/Reactor configurations — see #1630). No log line.

Environment

Fix

PR will follow this issue. The minimal fix is to filter non-finite data points in ClusterMapper#convert before they reach BigDecimal.valueOf. A separate concern — the silent onErrorReturn(notFound()) in BrokersController swallowing all errors — is intentionally left out of scope for this issue/PR.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions