-
Notifications
You must be signed in to change notification settings - Fork 114
[fast-client] Fix 3 fast client bugs where CompletableFutures are never completed #2478
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[fast-client] Fix 3 fast client bugs where CompletableFutures are never completed #2478
Conversation
Tests demonstrate that CompletableFutures are never completed in certain code paths, causing finalFuture to hang forever: 1. DispatchingAvroGenericStoreClient: catch block completes routeRequestFuture but not valueFuture when transportClient throws synchronously. 2. RetriableAvroGenericStoreClient: when 429 is received before long-tail retry fires, timeoutFuture is cancelled but retryFuture is never completed. 3. RetriableAvroGenericStoreClient: when long-tail retry fires but retry budget is exhausted, retryTask does nothing and retryFuture is never completed. If original request then fails, error retry is skipped because timeoutFuture.isDone()==true.
1. DispatchingAvroGenericStoreClient: complete valueFuture in catch block when transportClient.get() throws synchronously. 2. RetriableAvroGenericStoreClient: add else clause in retryTask to complete retryFuture when retry budget is exhausted, preventing finalFuture from hanging on allOf(). 3. RetriableAvroGenericStoreClient: complete retryFuture when 429 is received before long-tail retry fires and the timeout is cancelled, since retryTask will never execute.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Fixes multiple fast-client edge cases where internal CompletableFutures were never completed, causing sync-over-async callers to hang indefinitely and triggering thread starvation/timeouts.
Changes:
- Ensure
DispatchingAvroGenericStoreClient.get()completesvalueFutureexceptionally when synchronous failures occur during request setup. - Ensure
RetriableAvroGenericStoreClientcompletesretryFuturewhen retry budget is exhausted and when a 429 arrives before long-tail retry fires. - Add regression tests reproducing the previously-hanging scenarios.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| clients/venice-client/src/main/java/com/linkedin/venice/fastclient/DispatchingAvroGenericStoreClient.java | Completes valueFuture exceptionally in the outer catch to prevent hangs on synchronous failures. |
| clients/venice-client/src/main/java/com/linkedin/venice/fastclient/RetriableAvroGenericStoreClient.java | Completes retryFuture in additional edge cases (budget exhausted / 429-before-long-tail) to prevent allOf() from hanging. |
| clients/venice-client/src/test/java/com/linkedin/venice/fastclient/DispatchingAvroGenericStoreClientTest.java | Adds a test covering synchronous transportClient.get() failure and verifying the returned future completes. |
| clients/venice-client/src/test/java/com/linkedin/venice/fastclient/RetriableAvroGenericStoreClientTest.java | Adds tests reproducing hangs related to retry budget exhaustion and 429 timing. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
...ient/src/test/java/com/linkedin/venice/fastclient/DispatchingAvroGenericStoreClientTest.java
Outdated
Show resolved
Hide resolved
...ice-client/src/main/java/com/linkedin/venice/fastclient/RetriableAvroGenericStoreClient.java
Outdated
Show resolved
Hide resolved
...ient/src/test/java/com/linkedin/venice/fastclient/DispatchingAvroGenericStoreClientTest.java
Outdated
Show resolved
Hide resolved
...ient/src/test/java/com/linkedin/venice/fastclient/DispatchingAvroGenericStoreClientTest.java
Outdated
Show resolved
Hide resolved
- Remove hard-coded line numbers from test failure messages and Javadoc, use semantic method/symbol references instead - Remove redundant doReturn(null) stub in DispatchingAvro test - Change retry-budget-exhausted fix to chain retryFuture to originalRequestFuture so the original error propagates instead of being masked by a synthetic "budget exhausted" exception
- Add package-private setSingleKeyLongTailRetryManager() with Venice's own @VisibleForTesting annotation for test access - Remove all java.lang.reflect usage from tests - Move inline type references to proper imports (RetryManager, ArgumentMatchers, doThrow)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| * We use reflection to replace the internal RetryManager with a mock that always denies retries, | ||
| * ensuring deterministic reproduction regardless of timing. |
Copilot
AI
Feb 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment says the test uses reflection to replace the internal RetryManager, but the implementation actually uses the new setSingleKeyLongTailRetryManager(...) test hook. Update the comment to match the current approach to avoid misleading future maintainers.
| * We use reflection to replace the internal RetryManager with a mock that always denies retries, | |
| * ensuring deterministic reproduction regardless of timing. | |
| * This test uses the {@code setSingleKeyLongTailRetryManager(...)} test hook to install a | |
| * {@link RetryManager} that always denies retries, ensuring deterministic reproduction | |
| * of this scenario regardless of timing. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Updated the comment to reference setSingleKeyLongTailRetryManager(...) instead of reflection.
| * BUG REPRODUCTION (variant with shorter threshold): Same as above but uses a 20ms threshold | ||
| * to clearly demonstrate the timing: long-tail fires at 20ms (budget denied → does nothing), | ||
| * original fails at 200ms (timeoutFuture.isDone()=true → error retry skipped). | ||
| * Uses reflection to mock the RetryManager for deterministic reproduction. |
Copilot
AI
Feb 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment says the test uses reflection to mock the RetryManager, but it actually uses setSingleKeyLongTailRetryManager(...). Consider updating the comment to avoid confusion.
| * Uses reflection to mock the RetryManager for deterministic reproduction. | |
| * Uses setSingleKeyLongTailRetryManager(...) to inject a mock RetryManager for deterministic reproduction. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Updated the comment to reference setSingleKeyLongTailRetryManager(...) instead of reflection.
| } catch (ExecutionException e) { | ||
| // CORRECT: future completed with an exception (this means the bug is fixed) | ||
| assertTrue(e.getCause() instanceof VeniceClientException); | ||
| } |
Copilot
AI
Feb 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assertions in these new tests assume result.get() fails with an ExecutionException whose direct cause is VeniceClientException. However, RetriableAvroGenericStoreClient#get() completes finalFuture from a CompletableFuture.allOf(...) callback, and allOf completes exceptionally with a CompletionException wrapper. As a result, e.getCause() is likely CompletionException (with the Venice exception as its cause). Update the assertion to unwrap CompletionException (or use ExceptionUtils.recursiveClassEquals) so the test matches the actual exception shape.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The assertion is correct as-is. CompletableFuture.allOf() uses completeRelay internally which propagates the raw AltResult from the source future without wrapping in CompletionException. CompletionException wrapping only happens when exceptions propagate through transformation stages (thenApply, thenCompose, etc.), not through allOf. Since both originalRequestFuture and retryFuture are completed via completeExceptionally(), the throwable passed to the allOf().whenComplete() callback is the raw VeniceClientException, which is then passed to finalFuture.completeExceptionally(). So result.get() throws ExecutionException(VeniceClientException) and e.getCause() instanceof VeniceClientException is correct. All 4 tests pass with this assertion.
| } catch (ExecutionException e) { | ||
| // CORRECT: future completed with the 429 exception | ||
| assertTrue(e.getCause() instanceof VeniceClientRateExceededException); | ||
| } |
Copilot
AI
Feb 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue as above: when both original and retry futures complete exceptionally, finalFuture is completed exceptionally with the throwable from CompletableFuture.allOf(...), which is typically a CompletionException wrapper. This assertion should unwrap CompletionException (or use ExceptionUtils.recursiveClassEquals) rather than checking only e.getCause()'s direct type.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above — allOf() propagates the raw exception from completeExceptionally() without CompletionException wrapping. The assertion is correct and the test passes.
| } catch (ExecutionException e) { | ||
| // CORRECT behavior: future completed with exception | ||
| assertTrue(e.getCause() instanceof VeniceClientException); | ||
| } |
Copilot
AI
Feb 12, 2026
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same issue as above: finalFuture is completed exceptionally from the CompletableFuture.allOf(...) callback, so the thrown cause is likely a CompletionException wrapper. Unwrap before asserting on the underlying VeniceClientException to avoid false failures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above — allOf() propagates the raw exception from completeExceptionally() without CompletionException wrapping. The assertion is correct and the test passes.
Updated two test method Javadoc comments that still referenced reflection-based RetryManager replacement to reference the setSingleKeyLongTailRetryManager(...) test hook instead.
Summary
transportClient.get()throws synchronously, the catch block completesrouteRequestFuturebut notvalueFuture, causing callers to hang forever.RetryManager.isRetryAllowed()returns false (budget exhausted),retryTaskdoes nothing andretryFutureis never completed. If the original request subsequently fails, the error retry path is also skipped becausetimeoutFuture.isDone()==true. This leavesfinalFuture(backed byallOf(original, retry)) hanging forever.retryFutureis never completed sinceretryTasknever executes.All three bugs result in
CompletableFutures that are never completed, which in sync-over-async callers leads to thread starvation and service-wide timeouts.Production evidence
Observed in a production service using Venice fast client with ~278 hosts. During a 10-minute incident window, storage node heartbeat health checks returned HTTP 500 ("hardware not healthy"), triggering
InstanceHealthMonitorto mark nodes unhealthy.Heap dump analysis
CompletableFutureinstances, of which 2,710 (53.3%) were incompleteR2TransportClientCallbackinstances — confirming no in-flight network requests. Every R2 request had completed, but the higher-level futures were abandonedEPoll.wait()R2TransportClient→DispatchingAvroSpecificStoreClient→RetriableAvroSpecificStoreClient→StatsAvroSpecificStoreClientMetrics during incident (aggregated across fleet)
long_tail_retry_requestrejected_retryretries_remaining(gauge)error_retry_requestretry_request_winTimeline reconstruction
InstanceHealthMonitormarks nodes unhealthyRetryManagertoken bucket to 0 (some hosts reached -1)isRetryAllowed()returns false →retryTaskdoes nothing →retryFuturenever completed →finalFuturehangs → caller thread permanently blockederror_retry_requestnear-zero confirms the error retry path was skipped becausetimeoutFuture.isDone()==true(long-tail had already fired)Fixes
valueFuture.completeExceptionally(e)in the catch block ofDispatchingAvroGenericStoreClient.get().elseclause inretryTaskto completeretryFutureexceptionally when the retry budget is exhausted.retryFutureexceptionally when 429 is received before the long-tail timer fires and the timeout is cancelled.Test plan
testGetValueFutureNeverCompletedWhenTransportClientThrowsSynchronouslytoDispatchingAvroGenericStoreClientTesttestFinalFutureHangsWhenOriginalFailsWith429BeforeLongTailRetrytoRetriableAvroGenericStoreClientTesttestFinalFutureHangsWhenRetryBudgetExhaustedAndOriginalFailstoRetriableAvroGenericStoreClientTesttestFinalFutureHangsWhenLongTailFiresBeforeOriginalFailsAndBudgetExhaustedtoRetriableAvroGenericStoreClientTest