[server][da-vinci][router][fast-client] Introduce self-contained control-plane signal stats#2839
Open
m-nagarajan wants to merge 1 commit into
Open
Conversation
control-plane signal stats Adds USE_SELF_CONTAINED_STATS config in VeniceMetricsConfig to replace Tehuti-backed control signals with library-independent implementations, keeping throttling, routing, and load-shedding decisions functional when the Tehuti dependency is removed. Default false — all existing Tehuti paths are unchanged. New classes (venice-client-common/utils/concurrent): - LatencyPercentileProvider: lock-free ring-buffer reservoir (4096 slots per LatencyType) for adaptive throttler p99 signal. Moved from da-vinci.kafka.consumer to be accessible to all components. - SlidingWindowAverage: dual-bucket sliding window for per-group routing avg with single-volatile-ref cache for per-request read performance. LoadController independent path: two LongAdders with sumThenReset() replacing SlidingWindowCount; window = rejectionRatioUpdateIntervalInSec. HelixGroupStats/Selector/RoutingStrategy: useSelfContainedStats flag selects SlidingWindowAverage (true) or Tehuti Avg metric (false). ServerHttpRequestStats: observeLatencyForSignal() feeds the provider when enabled; no-op null-check guard when disabled (zero overhead).
There was a problem hiding this comment.
Pull request overview
This PR adds an opt-in (use.self.contained.stats, default false) path to source key control-plane signals (adaptive throttler read-latency p99, Helix group routing avg latency, and server load-controller request/accept counts) from Venice-owned data structures instead of reading Tehuti sensor values via MetricsRepository.getMetric(...).value(). The intent is to keep throttling/routing/load-shedding decisions functional even if Tehuti is disabled/removed, while preserving existing behavior when the flag is off.
Changes:
- Introduces self-contained signal primitives (
LatencyPercentileProvider,SlidingWindowAverage) and wires them into server/router/fast-client paths behindVeniceMetricsConfig.useSelfContainedStats(metricsRepository). - Updates
LoadControllerto support an independent counter backend (twoLongAdders drained per update interval) and updates server handler wiring accordingly. - Adds/updates unit tests to validate correctness and parity across Tehuti vs self-contained paths.
Reviewed changes
Copilot reviewed 23 out of 23 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| services/venice-server/src/test/java/com/linkedin/venice/listener/ServerLoadControllerHandlerTest.java | Updates handler construction to pass a MetricsRepository. |
| services/venice-server/src/main/java/com/linkedin/venice/stats/ServerHttpRequestStats.java | Adds optional latency-sample sink to feed adaptive-throttler p99 provider. |
| services/venice-server/src/main/java/com/linkedin/venice/stats/AggServerHttpRequestStats.java | Plumbs LatencyPercentileProvider through stats supplier/constructors. |
| services/venice-server/src/main/java/com/linkedin/venice/server/VeniceServer.java | Wires adaptive-throttler provider into listener initialization. |
| services/venice-server/src/main/java/com/linkedin/venice/listener/ServerLoadControllerHandler.java | Enables independent load-controller counter path based on use.self.contained.stats. |
| services/venice-server/src/main/java/com/linkedin/venice/listener/ListenerService.java | Adds constructor overload to pass LatencyPercentileProvider into channel initializer. |
| services/venice-server/src/main/java/com/linkedin/venice/listener/HttpChannelInitializer.java | Plumbs provider into request stats + passes repo into load-controller handler. |
| services/venice-router/src/main/java/com/linkedin/venice/router/api/routing/helix/HelixGroupSelector.java | Enables self-contained per-group avg read path for routing stats. |
| internal/venice-test-common/src/integrationTest/java/com/linkedin/venice/integration/utils/TestVeniceServer.java | Updates overridden createListenerService signature to accept provider. |
| internal/venice-common/src/main/java/com/linkedin/venice/ConfigKeys.java | Notes config is defined cross-cutting in VeniceMetricsConfig. |
| internal/venice-client-common/src/test/java/com/linkedin/venice/utils/concurrent/SlidingWindowAverageTest.java | Adds unit tests for SlidingWindowAverage behavior and Tehuti parity. |
| internal/venice-client-common/src/test/java/com/linkedin/venice/stats/routing/HelixGroupStatsTest.java | Parameterizes tests for Tehuti vs self-contained avg read path + parity test. |
| internal/venice-client-common/src/test/java/com/linkedin/venice/reliability/LoadControllerTest.java | Parameterizes tests for Tehuti vs independent counter backends + parity test. |
| internal/venice-client-common/src/main/java/com/linkedin/venice/utils/concurrent/SlidingWindowAverage.java | Adds self-contained sliding-window average with cached reads. |
| internal/venice-client-common/src/main/java/com/linkedin/venice/utils/concurrent/LatencyPercentileProvider.java | Adds self-contained ring-buffer reservoir p99 provider. |
| internal/venice-client-common/src/main/java/com/linkedin/venice/stats/VeniceMetricsConfig.java | Adds use.self.contained.stats config + static helper to read it from VeniceMetricsRepository. |
| internal/venice-client-common/src/main/java/com/linkedin/venice/stats/routing/HelixGroupStats.java | Adds self-contained per-group avg path backed by SlidingWindowAverage. |
| internal/venice-client-common/src/main/java/com/linkedin/venice/stats/OpenTelemetryMetricsSetup.java | Threads useSelfContainedStats into setup info for downstream consumers. |
| internal/venice-client-common/src/main/java/com/linkedin/venice/reliability/LoadController.java | Adds independent-counter backend option (LongAdder) alongside Tehuti windowed counts. |
| clients/venice-client/src/main/java/com/linkedin/venice/fastclient/meta/HelixGroupRoutingStrategy.java | Uses cross-cutting flag to select Helix group avg latency read path. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/kafka/consumer/LatencyPercentileProviderTest.java | Adds unit tests for reservoir percentile behavior + concurrency. |
| clients/da-vinci-client/src/test/java/com/linkedin/davinci/kafka/consumer/AdaptiveThrottlerSignalServiceTest.java | Splits tests by Tehuti vs provider path and adds parity coverage. |
| clients/da-vinci-client/src/main/java/com/linkedin/davinci/kafka/consumer/AdaptiveThrottlerSignalService.java | Adds self-contained p99 provider path controlled by cross-cutting flag. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+133
to
+142
| if (hostSingleGetLatencyP99Metric != null) { | ||
| hostSingleGetLatencyP99 = hostSingleGetLatencyP99Metric.value(); | ||
| } | ||
| if (hostMultiGetLatencyP99Metric != null) { | ||
| hostMultiGetLatencyP99 = hostMultiGetLatencyP99Metric.value(); | ||
| } | ||
| if (hostReadComputeLatencyP99Metric != null) { | ||
| hostReadComputeLatencyP99 = hostReadComputeLatencyP99Metric.value(); | ||
| } | ||
| applyLatencySignals(hostSingleGetLatencyP99, hostMultiGetLatencyP99, hostReadComputeLatencyP99); |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem Statement
Control-plane signals used by Venice's adaptive throttler, Helix group
routing, and server load controller all read their values from Tehuti
sensors via
MetricsRepository.getMetric(...).value(). This creates ahard dependency on Tehuti being enabled: if Tehuti is removed, these
signals silently return zero and throttling/routing/load-shedding
decisions break.
Solution
Introduces a
use.self.contained.statsconfig inVeniceMetricsConfig(default
false) that routes the three control signals tolibrary-independent implementations when enabled. Existing Tehuti paths
are completely unchanged when the config is
false.New classes (
venice-client-common/utils/concurrent):LatencyPercentileProvider— lock-free ring-buffer reservoir (4096slots per
LatencyType) for the adaptive throttler's p99 read-latencysignal. Write: O(1) atomic slot write. Read: O(n log n) sort of 4096
longs, called only every 30s by the signal refresh scheduler.
SlidingWindowAverage— dual-bucket sliding window for the per-grouprouting average latency. Write: O(1)
LongAdderincrement. Read:O(cells)
LongAdder.sum()merge, cached behind a single volatileCachedAveragereference for 100ms.LoadControllerindependent path: replacesSlidingWindowCountwith two plain
LongAdders usingsumThenReset(). The effective windowbecomes
rejectionRatioUpdateIntervalInSec(the existing cachedmeasurement cadence), making
windowSizeInSecirrelevant for this path.Wiring: all three components read the flag through
VeniceMetricsConfig.useSelfContainedStats(metricsRepository)— onestatic helper call, no per-component config keys.
Code changes
use.self.contained.statsinVeniceMetricsConfig(defaultfalse). Whenfalse, zero behaviorchange anywhere.
Concurrency-Specific Checks
LatencyPercentileProviderwrites are lock-free (AtomicLongArray);concurrent reads tolerate a few in-flight writes for an approximate
signal.
SlidingWindowAverageuses double-checked locking for bucketrotation and a single volatile reference for cache atomicity.
VeniceConcurrentHashMap).How was this PR tested?
LatencyPercentileProviderTest,SlidingWindowAverageTest,SlidingWindowCountTest.LoadControllerTest(bothcounter paths parameterized),
HelixGroupStatsTest,AdaptiveThrottlerSignalServiceTest.results to the Tehuti counterparts for the same input sequences.
use.self.contained.stats=false.Does this PR introduce any user-facing or breaking changes?