Skip to content

KRPC-602: Drain in-flight batches before terminal unref in native gRPC#708

Closed
ai-agent-kxrpc[bot] wants to merge 1 commit into
mainfrom
fix/KRPC-602
Closed

KRPC-602: Drain in-flight batches before terminal unref in native gRPC#708
ai-agent-kxrpc[bot] wants to merge 1 commit into
mainfrom
fix/KRPC-602

Conversation

@ai-agent-kxrpc
Copy link
Copy Markdown
Contributor

Subsystem

grpc-client, grpc-server (Kotlin/Native)

Problem

YouTrack: KRPC-602

Solution

The pre-existing tryToCloseCall idiom in NativeClientCall and NativeServerCall paired inFlight.value == 0 (read on one atomic) with closed.compareAndSet (RMW on another atomic). Under atomicfu's SC model, SC totally orders ops per variable; there is no ordering guarantee across two different atomics. A concurrent runBatch could commit beginOp after tryToCloseCall's stale inFlight read and observe callClosed=false on its post-beginOp re-check before the CAS fires — both threads then touched raw. In production this is papered over by grpc-core's internal ref on grpc_call, but it leaves the Kotlin-layer invariant load-bearing on grpc-core's ref model.

This PR applies Option A from the issue — drain-then-unref — symmetrically on both native call classes:

  1. tryToCloseCall CASes callClosed: false → true first, then calls maybeFinish(). The CAS is no longer paired with an inFlight read.
  2. endOp calls maybeFinish() on the inFlight → 0 transition.
  3. New maybeFinish() reads callClosed before inFlight, then a single-winner terminalDispatched CAS gates the terminal section (listener callback + grpc_call_unref + credential release on the client).

This makes runBatch's post-beginOp re-check of callClosed an actual ordering barrier: the CAS and the re-check are now on the same atomic, so SC totally orders them. For any runBatch that observed callClosed=false on its re-check, beginOp (program-order before the re-check) is also before the CAS; maybeFinish's subsequent inFlight read sees the increment and bails, deferring the terminal to the last endOp caller. Raw is never touched after its terminal unref.

Binary-compat impact: none — all affected fields are private. Existing inline comments that claimed the re-check was the barrier are updated to match the new invariant.

Verification: detekt clean; :grpc:grpc-client:compileKotlinMacosArm64 and :grpc:grpc-server:compileKotlinMacosArm64 compile; :grpc:grpc-core:macosArm64Test (159 tests) and :grpc:grpc-ktor-server:macosArm64Test (1 test) all green.


Note

Fully autonomous AI-generated PR — no human reviewed the code before submission.
Problem analysis and root cause details: KRPC-602

@ai-agent-kxrpc ai-agent-kxrpc Bot added the bug Something isn't working label Apr 24, 2026
@github-actions
Copy link
Copy Markdown

🔒 AI PR Safety: SAFE

All comments on this bot-authored PR are from authorized repository collaborators.

@ai-agent-kxrpc
Copy link
Copy Markdown
Contributor Author

Internal code review

All issues identified by agent reviewers were fixed.

  • Moved closeInfo null-check before the terminalDispatched CAS in maybeFinish on both files — winning the gate no longer leaks if the caller invariant is ever broken.
  • Updated stale tryToCloseCall references in rawGuard/init/initialize comments to point to maybeFinish.
  • Aligned client maybeFinish KDoc with the server version to mention cancelInternal in the SC argument.

@ai-agent-kxrpc
Copy link
Copy Markdown
Contributor Author

ai-agent-kxrpc Bot commented Apr 24, 2026

CI Report

Passed

Pipeline Details
GH: AI PR Safety Check run
GH: Bot PR Title Format run
GH: Run Verification (×8) all green

Failed

Check a box to request a retry/fix from the agent.

  • TC: Build_kRPC_AppleGrpcKeepAliveTest.test keepalive set crashed (process-exit class, < 1s) | build #169, build #171
  • TC: Build_kRPC_Java_JS_WASM_and_LinuxGrpcTlsTest.test TLS with valid certificates crashed (process-exit class, 3s) then foojay/toolchain infra flake on retry | build #486, build #508

Classification: pre-existing flake (not a regression from this PR)

The "Test running process exited unexpectedly" class is a known pre-existing issue on the gRPC native test suite, tracked under the KRPC-597 umbrella that this PR partially addresses (and PR #702's history explicitly documents).

Cross-branch evidence:

Branch Build Victim test
fix/KRPC-602 (this PR, commit 5ce7685) Apple #169 GrpcKeepAliveTest.test keepalive set
fix/KRPC-602 (this PR, commit 5ce7685) Apple #171 GrpcKeepAliveTest.test keepalive set
fix/KRPC-599 Apple #168 TestServer.testPlainRequests + ClientInterceptorTest.cancel in response flow
task/KRPC-596 (commit 499ac64) Apple #165 ClientInterceptorTest.append a response message once closed
task/KRPC-596 (commit 5a21748) Apple #164 ClientInterceptorTest.cancel in request flow - should fail with cancellation
main Linux #462 ClientInterceptorTest.throw during intercept
main Linux #407 GrpcTlsTest.test TLS client with plaintext server - should fail

Different tests crash on different commits (even across amended versions of the same logical change on task/KRPC-596). Two builds on the same commit hit the same victim, which is consistent with deterministic test ordering — the crash point in the test sequence is stable for a given commit, so the victim is stable. This satisfies the "different test each run" signature at the cross-branch level.

Develocity corroboration: GrpcKeepAliveTest.test keepalive set has 127 passes / 2 failures / 1 flaky across all branches since 2026-04-10. Both failures are my two CI builds (same commit).

Local verification with this PR: detekt clean, macosArm64 compile clean, 160 macosArm64 tests × 3 full-suite stress + targeted stress on the exact TC-breaking sequence (GrpcEdgeCaseTest → GrpcKeepAliveTest, 8 iterations) — zero failures (~1500 test invocations).

Not retrying further — process-exit policy caps retries at 1, and the root cause is in the broader gRPC native test suite flake (tracked under KRPC-597) rather than anything introduced by this PR's structural hardening.

@Mr3zee Mr3zee closed this May 5, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant