[server][fc] Fix gRPC read stats recording successful reads as failures#2847
Open
m-nagarajan wants to merge 1 commit into
Open
[server][fc] Fix gRPC read stats recording successful reads as failures#2847m-nagarajan wants to merge 1 commit into
m-nagarajan wants to merge 1 commit into
Conversation
GrpcOutboundStatsHandler chose between successRequest and errorRequest with:
!ctx.hasError() && !responseStatus.equals(OK) ||
responseStatus.equals(NOT_FOUND)
Java evaluates && before ||, so this parses as
(!hasError && !status.equals(OK)) || status.equals(NOT_FOUND). A successful
value-found read (no error, status OK) evaluates to false and falls into the
else branch, recording the read via errorRequest(). That emits
Read.CallCount{Http.Response.StatusCode=200,
Venice.Response.StatusCodeCategory=fail}
and increments the error_request sensor for every successful gRPC read. The
same
broken condition also recorded genuine 400/500 errors and 429 throttles as
successes.
Parenthesize the condition to mirror the Netty StatsHandler: record a success
only when no handler flagged an error and the status is OK or NOT_FOUND;
record
TOO_MANY_REQUESTS as neither (a throttled request is not a server-side
failure);
record everything else as an error.
Added GrpcOutboundStatsHandlerTest covering value-found -> success (regression
guard), key-absent -> success, error-flagged-with-OK -> error, 400/500 ->
error,
and 429 -> neither.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
This PR aims to correct gRPC read metric attribution in GrpcOutboundStatsHandler so successful reads are recorded as successes (matching the Netty StatsHandler behavior) instead of being incorrectly counted as failures.
Changes:
- Fixes the boolean condition used to categorize gRPC reads as success vs error (and to exclude throttles from both).
- Adds a parameterized unit test to cover key status/flag combinations and prevent regressions.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| services/venice-server/src/main/java/com/linkedin/venice/listener/grpc/handlers/GrpcOutboundStatsHandler.java | Updates success/error/throttle categorization logic for gRPC read stats. |
| services/venice-server/src/test/java/com/linkedin/venice/listener/grpc/handlers/GrpcOutboundStatsHandlerTest.java | Adds parameterized coverage for the new categorization behavior. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+46
to
50
| if (!ctx.hasError() | ||
| && (responseStatus.equals(HttpResponseStatus.OK) || responseStatus.equals(HttpResponseStatus.NOT_FOUND))) { | ||
| statsContext.successRequest(serverHttpRequestStats, elapsedTime); | ||
| } else { | ||
| } else if (!responseStatus.equals(HttpResponseStatus.TOO_MANY_REQUESTS)) { | ||
| statsContext.errorRequest(serverHttpRequestStats, elapsedTime); |
| @DataProvider(name = "responseStatusCases", parallel = true) | ||
| public Object[][] responseStatusCases() { | ||
| return new Object[][] { { OK, false, Boolean.TRUE }, // value found -> success (regression guard for 200-as-fail) | ||
| { NOT_FOUND, false, Boolean.TRUE }, // key absent -> success |
| { OK, true, Boolean.FALSE }, // error flagged despite OK status -> error | ||
| { BAD_REQUEST, false, Boolean.FALSE }, // malformed request -> error | ||
| { INTERNAL_SERVER_ERROR, false, Boolean.FALSE }, // server failure -> error | ||
| { TOO_MANY_REQUESTS, false, null } }; // throttled -> neither success nor error |
Comment on lines
+49
to
50
| } else if (!responseStatus.equals(HttpResponseStatus.TOO_MANY_REQUESTS)) { | ||
| statsContext.errorRequest(serverHttpRequestStats, elapsedTime); |
Comment on lines
+41
to
+45
| /* | ||
| * Record a success only when no handler flagged an error and the response is OK (value found) or NOT_FOUND | ||
| * (key absent); otherwise record an error. TOO_MANY_REQUESTS is recorded as neither, since a throttled | ||
| * request is not a server-side failure. Mirrors the Netty StatsHandler categorization. | ||
| */ |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem Statement
GrpcOutboundStatsHandlerdecides whether a read is recorded as a success or an error with:Because Java binds
&&tighter than||, this parses as(!hasError && !status.equals(OK)) || status.equals(NOT_FOUND). For a successfulvalue-found read (
hasError == false, statusOK) it evaluates tofalse, so the readis recorded via
errorRequest(). Every successful gRPC read therefore emitsVenice.Server.Read.CallCount{Http.Response.StatusCode=200, Venice.Response.StatusCodeCategory=fail}and increments the
error_requestsensor. The same condition also records genuine400/500responses and
429throttles as successes.This silently inflates the gRPC read error rate (and deflates real errors), which can falsely
trip
error_request-based alerts on any store served over gRPC fast-client. The Netty path(
StatsHandler) already categorizes correctly; only the gRPC handler is affected.Solution
Parenthesize the condition to mirror
StatsHandler:OKorNOT_FOUND;TOO_MANY_REQUESTSas neither (a throttled request is not a server-side failure);Code changes
Concurrency-Specific Checks
How was this PR tested?
Added
GrpcOutboundStatsHandlerTest(parameterized) covering:200, no error) → success — regression guard for the reported(200, fail)series;404) → success;OKstatus → error;400/500→ error;429→ neither.Verified the test fails against the pre-fix condition (the value-found,
400,500, and429cases) and passes with the fix. SpotBugs on
:services:venice-serveris clean.Does this PR introduce any user-facing or breaking changes?
Metric-only correctness change — corrects success/error attribution for gRPC reads. No API or
runtime behavior change otherwise.