Skip to content

Conversation

@sushantmane
Copy link
Contributor

@sushantmane sushantmane commented Feb 12, 2026

Summary

  • Bug 1 (DispatchingAvroGenericStoreClient): When transportClient.get() throws synchronously, the catch block completes routeRequestFuture but not valueFuture, causing callers to hang forever.
  • Bug 2 (RetriableAvroGenericStoreClient - retry budget): When the long-tail retry fires but RetryManager.isRetryAllowed() returns false (budget exhausted), retryTask does nothing and retryFuture is never completed. If the original request subsequently fails, the error retry path is also skipped because timeoutFuture.isDone()==true. This leaves finalFuture (backed by allOf(original, retry)) hanging forever.
  • Bug 3 (RetriableAvroGenericStoreClient - 429 before long-tail): When a 429 is received before the long-tail retry timer fires, the timeout is cancelled but retryFuture is never completed since retryTask never executes.

All three bugs result in CompletableFutures that are never completed, which in sync-over-async callers leads to thread starvation and service-wide timeouts.

Production evidence

Observed in a production service using Venice fast client with ~278 hosts. During a 10-minute incident window, storage node heartbeat health checks returned HTTP 500 ("hardware not healthy"), triggering InstanceHealthMonitor to mark nodes unhealthy.

Heap dump analysis

  • Custom HPROF parser found 5,086 CompletableFuture instances, of which 2,710 (53.3%) were incomplete
  • Zero R2TransportClientCallback instances — confirming no in-flight network requests. Every R2 request had completed, but the higher-level futures were abandoned
  • All 48 IO reactor threads were idle in EPoll.wait()
  • Full Venice client chain confirmed in the heap: R2TransportClientDispatchingAvroSpecificStoreClientRetriableAvroSpecificStoreClientStatsAvroSpecificStoreClient

Metrics during incident (aggregated across fleet)

Metric Avg Peak Description
long_tail_retry_request 0.36/s 1.93/s Long-tail retries fired
rejected_retry 0.07/s 1.19/s Retries rejected due to budget exhaustion
retries_remaining (gauge) 7.2 min: -1 Token bucket drained to negative on some hosts
error_retry_request 0.005/s 0.27/s Error retries — near zero (path was skipped)
retry_request_win 0.21/s 1.32/s Retries that beat original request

Timeline reconstruction

  1. Storage node heartbeat returns 500, InstanceHealthMonitor marks nodes unhealthy
  2. Burst of request failures drains the RetryManager token bucket to 0 (some hosts reached -1)
  3. Long-tail retries fire but isRetryAllowed() returns false → retryTask does nothing → retryFuture never completed → finalFuture hangs → caller thread permanently blocked
  4. error_retry_request near-zero confirms the error retry path was skipped because timeoutFuture.isDone()==true (long-tail had already fired)
  5. Several hosts became completely unresponsive — zero metric telemetry — indicating full thread starvation

Fixes

  1. Add valueFuture.completeExceptionally(e) in the catch block of DispatchingAvroGenericStoreClient.get().
  2. Add an else clause in retryTask to complete retryFuture exceptionally when the retry budget is exhausted.
  3. Complete retryFuture exceptionally when 429 is received before the long-tail timer fires and the timeout is cancelled.

Test plan

  • Added testGetValueFutureNeverCompletedWhenTransportClientThrowsSynchronously to DispatchingAvroGenericStoreClientTest
  • Added testFinalFutureHangsWhenOriginalFailsWith429BeforeLongTailRetry to RetriableAvroGenericStoreClientTest
  • Added testFinalFutureHangsWhenRetryBudgetExhaustedAndOriginalFails to RetriableAvroGenericStoreClientTest
  • Added testFinalFutureHangsWhenLongTailFiresBeforeOriginalFailsAndBudgetExhausted to RetriableAvroGenericStoreClientTest
  • All 4 tests fail before fixes and pass after fixes

Tests demonstrate that CompletableFutures are never completed in certain
code paths, causing finalFuture to hang forever:

1. DispatchingAvroGenericStoreClient: catch block completes
   routeRequestFuture but not valueFuture when transportClient throws
   synchronously.

2. RetriableAvroGenericStoreClient: when 429 is received before
   long-tail retry fires, timeoutFuture is cancelled but retryFuture
   is never completed.

3. RetriableAvroGenericStoreClient: when long-tail retry fires but
   retry budget is exhausted, retryTask does nothing and retryFuture
   is never completed. If original request then fails, error retry is
   skipped because timeoutFuture.isDone()==true.
1. DispatchingAvroGenericStoreClient: complete valueFuture in catch
   block when transportClient.get() throws synchronously.

2. RetriableAvroGenericStoreClient: add else clause in retryTask to
   complete retryFuture when retry budget is exhausted, preventing
   finalFuture from hanging on allOf().

3. RetriableAvroGenericStoreClient: complete retryFuture when 429 is
   received before long-tail retry fires and the timeout is cancelled,
   since retryTask will never execute.
Copilot AI review requested due to automatic review settings February 12, 2026 10:12
@sushantmane sushantmane changed the title Fix 3 fast client bugs where CompletableFutures are never completed [fast-client] Fix 3 fast client bugs where CompletableFutures are never completed Feb 12, 2026
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes multiple fast-client edge cases where internal CompletableFutures were never completed, causing sync-over-async callers to hang indefinitely and triggering thread starvation/timeouts.

Changes:

  • Ensure DispatchingAvroGenericStoreClient.get() completes valueFuture exceptionally when synchronous failures occur during request setup.
  • Ensure RetriableAvroGenericStoreClient completes retryFuture when retry budget is exhausted and when a 429 arrives before long-tail retry fires.
  • Add regression tests reproducing the previously-hanging scenarios.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.

File Description
clients/venice-client/src/main/java/com/linkedin/venice/fastclient/DispatchingAvroGenericStoreClient.java Completes valueFuture exceptionally in the outer catch to prevent hangs on synchronous failures.
clients/venice-client/src/main/java/com/linkedin/venice/fastclient/RetriableAvroGenericStoreClient.java Completes retryFuture in additional edge cases (budget exhausted / 429-before-long-tail) to prevent allOf() from hanging.
clients/venice-client/src/test/java/com/linkedin/venice/fastclient/DispatchingAvroGenericStoreClientTest.java Adds a test covering synchronous transportClient.get() failure and verifying the returned future completes.
clients/venice-client/src/test/java/com/linkedin/venice/fastclient/RetriableAvroGenericStoreClientTest.java Adds tests reproducing hangs related to retry budget exhaustion and 429 timing.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

- Remove hard-coded line numbers from test failure messages and Javadoc,
  use semantic method/symbol references instead
- Remove redundant doReturn(null) stub in DispatchingAvro test
- Change retry-budget-exhausted fix to chain retryFuture to
  originalRequestFuture so the original error propagates instead of
  being masked by a synthetic "budget exhausted" exception
- Add package-private setSingleKeyLongTailRetryManager() with Venice's
  own @VisibleForTesting annotation for test access
- Remove all java.lang.reflect usage from tests
- Move inline type references to proper imports (RetryManager,
  ArgumentMatchers, doThrow)
Copilot AI review requested due to automatic review settings February 12, 2026 10:58
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines 866 to 867
* We use reflection to replace the internal RetryManager with a mock that always denies retries,
* ensuring deterministic reproduction regardless of timing.
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment says the test uses reflection to replace the internal RetryManager, but the implementation actually uses the new setSingleKeyLongTailRetryManager(...) test hook. Update the comment to match the current approach to avoid misleading future maintainers.

Suggested change
* We use reflection to replace the internal RetryManager with a mock that always denies retries,
* ensuring deterministic reproduction regardless of timing.
* This test uses the {@code setSingleKeyLongTailRetryManager(...)} test hook to install a
* {@link RetryManager} that always denies retries, ensuring deterministic reproduction
* of this scenario regardless of timing.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Updated the comment to reference setSingleKeyLongTailRetryManager(...) instead of reflection.

* BUG REPRODUCTION (variant with shorter threshold): Same as above but uses a 20ms threshold
* to clearly demonstrate the timing: long-tail fires at 20ms (budget denied → does nothing),
* original fails at 200ms (timeoutFuture.isDone()=true → error retry skipped).
* Uses reflection to mock the RetryManager for deterministic reproduction.
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment says the test uses reflection to mock the RetryManager, but it actually uses setSingleKeyLongTailRetryManager(...). Consider updating the comment to avoid confusion.

Suggested change
* Uses reflection to mock the RetryManager for deterministic reproduction.
* Uses setSingleKeyLongTailRetryManager(...) to inject a mock RetryManager for deterministic reproduction.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Updated the comment to reference setSingleKeyLongTailRetryManager(...) instead of reflection.

Comment on lines +924 to +927
} catch (ExecutionException e) {
// CORRECT: future completed with an exception (this means the bug is fixed)
assertTrue(e.getCause() instanceof VeniceClientException);
}
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertions in these new tests assume result.get() fails with an ExecutionException whose direct cause is VeniceClientException. However, RetriableAvroGenericStoreClient#get() completes finalFuture from a CompletableFuture.allOf(...) callback, and allOf completes exceptionally with a CompletionException wrapper. As a result, e.getCause() is likely CompletionException (with the Venice exception as its cause). Update the assertion to unwrap CompletionException (or use ExceptionUtils.recursiveClassEquals) so the test matches the actual exception shape.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The assertion is correct as-is. CompletableFuture.allOf() uses completeRelay internally which propagates the raw AltResult from the source future without wrapping in CompletionException. CompletionException wrapping only happens when exceptions propagate through transformation stages (thenApply, thenCompose, etc.), not through allOf. Since both originalRequestFuture and retryFuture are completed via completeExceptionally(), the throwable passed to the allOf().whenComplete() callback is the raw VeniceClientException, which is then passed to finalFuture.completeExceptionally(). So result.get() throws ExecutionException(VeniceClientException) and e.getCause() instanceof VeniceClientException is correct. All 4 tests pass with this assertion.

Comment on lines +985 to +988
} catch (ExecutionException e) {
// CORRECT: future completed with the 429 exception
assertTrue(e.getCause() instanceof VeniceClientRateExceededException);
}
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as above: when both original and retry futures complete exceptionally, finalFuture is completed exceptionally with the throwable from CompletableFuture.allOf(...), which is typically a CompletionException wrapper. This assertion should unwrap CompletionException (or use ExceptionUtils.recursiveClassEquals) rather than checking only e.getCause()'s direct type.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above — allOf() propagates the raw exception from completeExceptionally() without CompletionException wrapping. The assertion is correct and the test passes.

Comment on lines +1053 to +1056
} catch (ExecutionException e) {
// CORRECT behavior: future completed with exception
assertTrue(e.getCause() instanceof VeniceClientException);
}
Copy link

Copilot AI Feb 12, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue as above: finalFuture is completed exceptionally from the CompletableFuture.allOf(...) callback, so the thrown cause is likely a CompletionException wrapper. Unwrap before asserting on the underlying VeniceClientException to avoid false failures.

Copilot uses AI. Check for mistakes.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above — allOf() propagates the raw exception from completeExceptionally() without CompletionException wrapping. The assertion is correct and the test passes.

Updated two test method Javadoc comments that still referenced
reflection-based RetryManager replacement to reference the
setSingleKeyLongTailRetryManager(...) test hook instead.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant