Skip to content

[server][fc] Fix gRPC read stats recording successful reads as failures#2847

Open
m-nagarajan wants to merge 1 commit into
linkedin:mainfrom
m-nagarajan:mnagaraj/fix-grpc-stats-200-fail
Open

[server][fc] Fix gRPC read stats recording successful reads as failures#2847
m-nagarajan wants to merge 1 commit into
linkedin:mainfrom
m-nagarajan:mnagaraj/fix-grpc-stats-200-fail

Conversation

@m-nagarajan

Copy link
Copy Markdown
Contributor

Problem Statement

GrpcOutboundStatsHandler decides whether a read is recorded as a success or an error with:

!ctx.hasError() && !responseStatus.equals(OK) || responseStatus.equals(NOT_FOUND)

Because Java binds && tighter than ||, this parses as
(!hasError && !status.equals(OK)) || status.equals(NOT_FOUND). For a successful
value-found read (hasError == false, status OK) it evaluates to false, so the read
is recorded via errorRequest(). Every successful gRPC read therefore emits
Venice.Server.Read.CallCount{Http.Response.StatusCode=200, Venice.Response.StatusCodeCategory=fail}
and increments the error_request sensor. The same condition also records genuine 400/500
responses and 429 throttles as successes.

This silently inflates the gRPC read error rate (and deflates real errors), which can falsely
trip error_request-based alerts on any store served over gRPC fast-client. The Netty path
(StatsHandler) already categorizes correctly; only the gRPC handler is affected.

Solution

Parenthesize the condition to mirror StatsHandler:

  • record a success only when no handler flagged an error and the status is OK or NOT_FOUND;
  • record TOO_MANY_REQUESTS as neither (a throttled request is not a server-side failure);
  • record everything else as an error.

Code changes

  • Added new code behind a config. — No.
  • Introduced new log lines. — No.

Concurrency-Specific Checks

  • Code has no race conditions or thread safety issues. The change is a pure boolean-condition correction; no shared state, locking, or threading is touched.

How was this PR tested?

  • New unit tests added.

Added GrpcOutboundStatsHandlerTest (parameterized) covering:

  • value-found (200, no error) → success — regression guard for the reported (200, fail) series;
  • key-absent (404) → success;
  • error flagged with OK status → error;
  • 400 / 500 → error;
  • 429 → neither.

Verified the test fails against the pre-fix condition (the value-found, 400, 500, and 429
cases) and passes with the fix. SpotBugs on :services:venice-server is clean.

Does this PR introduce any user-facing or breaking changes?

  • No.

Metric-only correctness change — corrects success/error attribution for gRPC reads. No API or
runtime behavior change otherwise.

GrpcOutboundStatsHandler chose between successRequest and errorRequest with:

    !ctx.hasError() && !responseStatus.equals(OK) ||
responseStatus.equals(NOT_FOUND)

Java evaluates && before ||, so this parses as
(!hasError && !status.equals(OK)) || status.equals(NOT_FOUND). A successful
value-found read (no error, status OK) evaluates to false and falls into the
else branch, recording the read via errorRequest(). That emits
Read.CallCount{Http.Response.StatusCode=200,
Venice.Response.StatusCodeCategory=fail}
and increments the error_request sensor for every successful gRPC read. The
same
broken condition also recorded genuine 400/500 errors and 429 throttles as
successes.

Parenthesize the condition to mirror the Netty StatsHandler: record a success
only when no handler flagged an error and the status is OK or NOT_FOUND;
record
TOO_MANY_REQUESTS as neither (a throttled request is not a server-side
failure);
record everything else as an error.

Added GrpcOutboundStatsHandlerTest covering value-found -> success (regression
guard), key-absent -> success, error-flagged-with-OK -> error, 400/500 ->
error,
and 429 -> neither.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings June 5, 2026 22:51

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR aims to correct gRPC read metric attribution in GrpcOutboundStatsHandler so successful reads are recorded as successes (matching the Netty StatsHandler behavior) instead of being incorrectly counted as failures.

Changes:

  • Fixes the boolean condition used to categorize gRPC reads as success vs error (and to exclude throttles from both).
  • Adds a parameterized unit test to cover key status/flag combinations and prevent regressions.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
services/venice-server/src/main/java/com/linkedin/venice/listener/grpc/handlers/GrpcOutboundStatsHandler.java Updates success/error/throttle categorization logic for gRPC read stats.
services/venice-server/src/test/java/com/linkedin/venice/listener/grpc/handlers/GrpcOutboundStatsHandlerTest.java Adds parameterized coverage for the new categorization behavior.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +46 to 50
if (!ctx.hasError()
&& (responseStatus.equals(HttpResponseStatus.OK) || responseStatus.equals(HttpResponseStatus.NOT_FOUND))) {
statsContext.successRequest(serverHttpRequestStats, elapsedTime);
} else {
} else if (!responseStatus.equals(HttpResponseStatus.TOO_MANY_REQUESTS)) {
statsContext.errorRequest(serverHttpRequestStats, elapsedTime);
@DataProvider(name = "responseStatusCases", parallel = true)
public Object[][] responseStatusCases() {
return new Object[][] { { OK, false, Boolean.TRUE }, // value found -> success (regression guard for 200-as-fail)
{ NOT_FOUND, false, Boolean.TRUE }, // key absent -> success
{ OK, true, Boolean.FALSE }, // error flagged despite OK status -> error
{ BAD_REQUEST, false, Boolean.FALSE }, // malformed request -> error
{ INTERNAL_SERVER_ERROR, false, Boolean.FALSE }, // server failure -> error
{ TOO_MANY_REQUESTS, false, null } }; // throttled -> neither success nor error
Comment on lines +49 to 50
} else if (!responseStatus.equals(HttpResponseStatus.TOO_MANY_REQUESTS)) {
statsContext.errorRequest(serverHttpRequestStats, elapsedTime);
Comment on lines +41 to +45
/*
* Record a success only when no handler flagged an error and the response is OK (value found) or NOT_FOUND
* (key absent); otherwise record an error. TOO_MANY_REQUESTS is recorded as neither, since a throttled
* request is not a server-side failure. Mirrors the Netty StatsHandler categorization.
*/
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants