Skip to content

[server][da-vinci][router][fast-client] Introduce self-contained control-plane signal stats#2839

Open
m-nagarajan wants to merge 1 commit into
linkedin:mainfrom
m-nagarajan:mnagaraj/replaceTehutiWithBaseClasses
Open

[server][da-vinci][router][fast-client] Introduce self-contained control-plane signal stats#2839
m-nagarajan wants to merge 1 commit into
linkedin:mainfrom
m-nagarajan:mnagaraj/replaceTehutiWithBaseClasses

Conversation

@m-nagarajan

Copy link
Copy Markdown
Contributor

Problem Statement

Control-plane signals used by Venice's adaptive throttler, Helix group
routing, and server load controller all read their values from Tehuti
sensors via MetricsRepository.getMetric(...).value(). This creates a
hard dependency on Tehuti being enabled: if Tehuti is removed, these
signals silently return zero and throttling/routing/load-shedding
decisions break.

Solution

Introduces a use.self.contained.stats config in VeniceMetricsConfig
(default false) that routes the three control signals to
library-independent implementations when enabled. Existing Tehuti paths
are completely unchanged when the config is false.

New classes (venice-client-common/utils/concurrent):

  • LatencyPercentileProvider — lock-free ring-buffer reservoir (4096
    slots per LatencyType) for the adaptive throttler's p99 read-latency
    signal. Write: O(1) atomic slot write. Read: O(n log n) sort of 4096
    longs, called only every 30s by the signal refresh scheduler.
  • SlidingWindowAverage — dual-bucket sliding window for the per-group
    routing average latency. Write: O(1) LongAdder increment. Read:
    O(cells) LongAdder.sum() merge, cached behind a single volatile
    CachedAverage reference for 100ms.

LoadController independent path: replaces SlidingWindowCount
with two plain LongAdders using sumThenReset(). The effective window
becomes rejectionRatioUpdateIntervalInSec (the existing cached
measurement cadence), making windowSizeInSec irrelevant for this path.

Wiring: all three components read the flag through
VeniceMetricsConfig.useSelfContainedStats(metricsRepository) — one
static helper call, no per-component config keys.

Code changes

  • Added new code behind a config: use.self.contained.stats in
    VeniceMetricsConfig (default false). When false, zero behavior
    change anywhere.
  • Introduced new log lines.

Concurrency-Specific Checks

  • Code has no race conditions or thread safety issues.
    LatencyPercentileProvider writes are lock-free (AtomicLongArray);
    concurrent reads tolerate a few in-flight writes for an approximate
    signal. SlidingWindowAverage uses double-checked locking for bucket
    rotation and a single volatile reference for cache atomicity.
  • Proper synchronization mechanisms are used where needed.
  • No blocking calls inside critical sections.
  • Thread-safe collections used (VeniceConcurrentHashMap).
  • Proper exception handling in multi-threaded code.

How was this PR tested?

  • New unit tests added: LatencyPercentileProviderTest,
    SlidingWindowAverageTest, SlidingWindowCountTest.
  • Modified or extended existing tests: LoadControllerTest (both
    counter paths parameterized), HelixGroupStatsTest, AdaptiveThrottlerSignalServiceTest.
  • Parity tests validate that new implementations produce equivalent
    results to the Tehuti counterparts for the same input sequences.
  • Verified backward compatibility: all tests pass with default
    use.self.contained.stats=false.

Does this PR introduce any user-facing or breaking changes?

  • No. You can skip the rest of this section.

control-plane signal stats

Adds USE_SELF_CONTAINED_STATS config in VeniceMetricsConfig to replace
Tehuti-backed control signals with library-independent implementations,
keeping throttling, routing, and load-shedding decisions functional when
the Tehuti dependency is removed. Default false — all existing Tehuti
paths are unchanged.

New classes (venice-client-common/utils/concurrent):
- LatencyPercentileProvider: lock-free ring-buffer reservoir (4096 slots
  per LatencyType) for adaptive throttler p99 signal. Moved from
  da-vinci.kafka.consumer to be accessible to all components.
- SlidingWindowAverage: dual-bucket sliding window for per-group routing
  avg with single-volatile-ref cache for per-request read performance.

LoadController independent path: two LongAdders with sumThenReset()
replacing SlidingWindowCount; window = rejectionRatioUpdateIntervalInSec.

HelixGroupStats/Selector/RoutingStrategy: useSelfContainedStats flag
selects SlidingWindowAverage (true) or Tehuti Avg metric (false).

ServerHttpRequestStats: observeLatencyForSignal() feeds the provider when
enabled; no-op null-check guard when disabled (zero overhead).
Copilot AI review requested due to automatic review settings June 1, 2026 09:03

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds an opt-in (use.self.contained.stats, default false) path to source key control-plane signals (adaptive throttler read-latency p99, Helix group routing avg latency, and server load-controller request/accept counts) from Venice-owned data structures instead of reading Tehuti sensor values via MetricsRepository.getMetric(...).value(). The intent is to keep throttling/routing/load-shedding decisions functional even if Tehuti is disabled/removed, while preserving existing behavior when the flag is off.

Changes:

  • Introduces self-contained signal primitives (LatencyPercentileProvider, SlidingWindowAverage) and wires them into server/router/fast-client paths behind VeniceMetricsConfig.useSelfContainedStats(metricsRepository).
  • Updates LoadController to support an independent counter backend (two LongAdders drained per update interval) and updates server handler wiring accordingly.
  • Adds/updates unit tests to validate correctness and parity across Tehuti vs self-contained paths.

Reviewed changes

Copilot reviewed 23 out of 23 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
services/venice-server/src/test/java/com/linkedin/venice/listener/ServerLoadControllerHandlerTest.java Updates handler construction to pass a MetricsRepository.
services/venice-server/src/main/java/com/linkedin/venice/stats/ServerHttpRequestStats.java Adds optional latency-sample sink to feed adaptive-throttler p99 provider.
services/venice-server/src/main/java/com/linkedin/venice/stats/AggServerHttpRequestStats.java Plumbs LatencyPercentileProvider through stats supplier/constructors.
services/venice-server/src/main/java/com/linkedin/venice/server/VeniceServer.java Wires adaptive-throttler provider into listener initialization.
services/venice-server/src/main/java/com/linkedin/venice/listener/ServerLoadControllerHandler.java Enables independent load-controller counter path based on use.self.contained.stats.
services/venice-server/src/main/java/com/linkedin/venice/listener/ListenerService.java Adds constructor overload to pass LatencyPercentileProvider into channel initializer.
services/venice-server/src/main/java/com/linkedin/venice/listener/HttpChannelInitializer.java Plumbs provider into request stats + passes repo into load-controller handler.
services/venice-router/src/main/java/com/linkedin/venice/router/api/routing/helix/HelixGroupSelector.java Enables self-contained per-group avg read path for routing stats.
internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/integration/utils/TestVeniceServer.java Updates overridden createListenerService signature to accept provider.
internal/venice-common/src/main/java/com/linkedin/venice/ConfigKeys.java Notes config is defined cross-cutting in VeniceMetricsConfig.
internal/venice-client-common/src/test/java/com/linkedin/venice/utils/concurrent/SlidingWindowAverageTest.java Adds unit tests for SlidingWindowAverage behavior and Tehuti parity.
internal/venice-client-common/src/test/java/com/linkedin/venice/stats/routing/HelixGroupStatsTest.java Parameterizes tests for Tehuti vs self-contained avg read path + parity test.
internal/venice-client-common/src/test/java/com/linkedin/venice/reliability/LoadControllerTest.java Parameterizes tests for Tehuti vs independent counter backends + parity test.
internal/venice-client-common/src/main/java/com/linkedin/venice/utils/concurrent/SlidingWindowAverage.java Adds self-contained sliding-window average with cached reads.
internal/venice-client-common/src/main/java/com/linkedin/venice/utils/concurrent/LatencyPercentileProvider.java Adds self-contained ring-buffer reservoir p99 provider.
internal/venice-client-common/src/main/java/com/linkedin/venice/stats/VeniceMetricsConfig.java Adds use.self.contained.stats config + static helper to read it from VeniceMetricsRepository.
internal/venice-client-common/src/main/java/com/linkedin/venice/stats/routing/HelixGroupStats.java Adds self-contained per-group avg path backed by SlidingWindowAverage.
internal/venice-client-common/src/main/java/com/linkedin/venice/stats/OpenTelemetryMetricsSetup.java Threads useSelfContainedStats into setup info for downstream consumers.
internal/venice-client-common/src/main/java/com/linkedin/venice/reliability/LoadController.java Adds independent-counter backend option (LongAdder) alongside Tehuti windowed counts.
clients/venice-client/src/main/java/com/linkedin/venice/fastclient/meta/HelixGroupRoutingStrategy.java Uses cross-cutting flag to select Helix group avg latency read path.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/kafka/consumer/LatencyPercentileProviderTest.java Adds unit tests for reservoir percentile behavior + concurrency.
clients/da-vinci-client/src/test/java/com/linkedin/davinci/kafka/consumer/AdaptiveThrottlerSignalServiceTest.java Splits tests by Tehuti vs provider path and adds parity coverage.
clients/da-vinci-client/src/main/java/com/linkedin/davinci/kafka/consumer/AdaptiveThrottlerSignalService.java Adds self-contained p99 provider path controlled by cross-cutting flag.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +133 to +142
if (hostSingleGetLatencyP99Metric != null) {
hostSingleGetLatencyP99 = hostSingleGetLatencyP99Metric.value();
}
if (hostMultiGetLatencyP99Metric != null) {
hostMultiGetLatencyP99 = hostMultiGetLatencyP99Metric.value();
}
if (hostReadComputeLatencyP99Metric != null) {
hostReadComputeLatencyP99 = hostReadComputeLatencyP99Metric.value();
}
applyLatencySignals(hostSingleGetLatencyP99, hostMultiGetLatencyP99, hostReadComputeLatencyP99);
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants