Skip to content

Intermittent JVM SIGBUS (BUS_ADRERR) from DogStatsD-over-UDS: java-dogstatsd-client → jnr-unixsocket → jffi stub truncation (jnr/jffi#194) #11574

@Hexcles

Description

@Hexcles

Tracer Version(s)

1.62.0

Java Version(s)

21.0.6 (Azul Zulu 21.40+17-CA)

JVM Vendor

Azul Systems (Zulu OpenJDK)

Bug Report

Short-lived JVMs configured to reach the Agent over a UDS intermittently die with a JVM-level SIGBUS (si_code 2 BUS_ADRERR) — a native crash, not an application error — while the tracer's metrics subsystem brings up its DogStatsD connection.

The root cause is upstream in jnr/jffi (filed as jnr/jffi#194): when DogStatsD is sent over a Unix socket, the bundled com.datadoghq:java-dogstatsd-client constructs a jnr.unixsocket.UnixSocketAddress, which initializes jnr-ffijffi. jffi's StubLoader.unpackLibrary extracts its native stub with an InputStream.available()-guarded copy loop that silently truncates the .so (it does no length/digest check on a fresh extraction). System.load() of the short stub then faults in ld.so past EOF → SIGBUS/BUS_ADRERR.

hs_err excerpt:

# SIGBUS (0x7) ...
# Problematic frame: # C  [ld-linux-x86-64.so.2+0x26b4a]      (dlopen)
siginfo: si_signo: 7 (SIGBUS), si_code: 2 (BUS_ADRERR)
Current thread: JavaThread "dd-task-scheduler"
  C   [ld-linux-x86-64.so.2 ...]                              (dlopen)
  V   [libjvm.so ...] JVM_LoadLibrary
  j   com.kenai.jffi.internal.StubLoader.loadFromJar / <clinit>
  j   jnr.ffi.Runtime.getSystemRuntime
  j   jnr.unixsocket.UnixSocketAddress.<init>
  j   com.timgroup.statsd.NonBlockingStatsDClientBuilder.build
  j   datadog.metrics.impl.statsd.DDAgentStatsDConnection.doConnect
  j   datadog.trace.util.AgentTaskScheduler$PeriodicTask.run

The extracted …/jffi<rand>.so was truncated at a 4 KiB boundary; the dynamic linker's relocation write into the missing final page hit EOF → BUS_ADRERR.

Scope / notes:

  • Only the DogStatsD path is affected. The Agent's own trace/EVP transport over UDS uses the JDK-native socket (dd.jdk.socket.enabled, default true) and does not load jffi — consistent with jffi initializing ~20s in, inside the DogStatsD connect task, never at startup. The bundled java-dogstatsd-client is the sole remaining jnr/jffi consumer.
  • Tracer metrics are on by default, so this can fire in any UDS-configured JVM that lives long enough to run the periodic StatsD connect.
  • It is a timing race in jffi's copy loop — rare per-extraction, but frequent across a high-volume / CPU-saturated CI fleet (many thousands of short-lived JVMs). Long-lived production processes (one extraction at controlled startup) effectively never hit it.
  • The crash kills the JVM before data is flushed, so it is invisible in APM / CI Visibility and only recoverable from captured hs_err files.

Expected Behavior

Configuring the Agent connection as a UDS (and the tracer emitting its own metrics) must not be able to crash the host JVM. DogStatsD over UDS should not pull in a native FFI stub whose extraction can fail unsafely.

Reproduction Code

No deterministic repro — it is a timing race in jffi's stub extraction (see jnr/jffi#194). It reproduces statistically under load with:

  • dd-java-agent 1.62.0 attached, JDK 21
  • DD_TRACE_AGENT_URL=unix:///var/run/datadog/apm.socket (so DogStatsD also resolves to a UDS)
  • default tracer metrics (health metrics on)
  • many short-lived JVMs on CPU-saturated hosts

Diagnosed from captured -XX:ErrorFile hs_err logs showing the stack above and a truncated jffi*.so.

Suggested fixes

  1. Upstream: the actual defect is jffi's unpackLibrary (StubLoader.unpackLibrary uses InputStream.available() as copy-loop guard, silently truncating the extracted stub (→ SIGBUS BUS_ADRERR / "failed to map segment") jnr/jffi#194) — pick up the fix / bump jffi once available.
  2. Decouple DogStatsD-over-UDS from jnr/jffi: have the bundled java-dogstatsd-client use the JDK-native java.net.UnixDomainSocketAddress (JDK 16+) for UDS, as the Agent transport already does via dd.jdk.socket.enabled. This removes jffi from the metrics path entirely (cf. Dependency on JFFI when sending metrics to Unix socket java-dogstatsd-client#68, change span.type value #85).
  3. Or pin / pre-extract the jffi stub (-Djffi.boot.library.path=…) in the agent so the buggy unpackLibrary copy is never exercised.

Related: jnr/jffi#194, jnr/jffi#46, jnr/jffi#158, DataDog/java-dogstatsd-client#68 / #85 / #258, #7643, #7165.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions