Skip to content

MINOR: Fix testRackAwareAssignment flake#22154

Merged
lianetm merged 1 commit intoapache:trunkfrom
confluentinc:squah-fix-testrackawareassignment-flake
Apr 28, 2026
Merged

MINOR: Fix testRackAwareAssignment flake#22154
lianetm merged 1 commit intoapache:trunkfrom
confluentinc:squah-fix-testrackawareassignment-flake

Conversation

@squah-confluent
Copy link
Copy Markdown
Contributor

@squah-confluent squah-confluent commented Apr 26, 2026

The last part of testRackAwareAssignment was found to be flaky. This
part moves all topic partitions to different racks and waits for
consumer assignments to settle. Each of the three consumers is expected
to revoke all its partitions and be assigned partitions previously held
by another within a 15 second timeout.

This timeout is not always sufficient. The consumer heartbeat interval
is left at the default of 5,000 ms and each consumer polls every
3,000 ms. In the worst case, it takes a consumer around 7,000 ms to
reconcile an assignment change. An additional 3,000 ms round of polling
may be required when a consumer needs to auto-commit offsets. Two rounds
of reconciliation must happen within 15,000 ms.

The timeline of an example failing run looks like:

-02.956 Group coordinator computes target assignment at epoch=6
consumer0=[0] consumer1=[1, 2] consumer2=[3, 4, 5]
+00.000 15 second timeout starts

+03.179 consumer0 heartbeats This is the first heartbeat since
the rack reassignments. +03.179 Group coordinator computes target
assignment at epoch=7 consumer0=[5] consumer1=[3, 4]
consumer2=[0, 1, 2] +03.186 consumer0 heartbeat receives assignment []

+04.151 consumer1 starts poll() +04.877 consumer1 heartbeats +04.878
consumer1 heartbeat receives assignment [] +05.155 consumer1 ends
poll()

+07.259 consumer1 starts poll() +07.259 consumer1 sends auto-commit
with offsets for [1, 2] +07.288 consumer1 receives auto-commit
response +08.263 consumer1 ends poll()

+10.379 consumer1 starts poll() +10.379 consumer1 calls
onPartitionsRevoked with [1, 2] +10.379 consumer1 calls
onPartitionsAssigned with [] +10.382 consumer1 heartbeats with
owned partitions [] +10.387 consumer1 heartbeat receives assignment
[3, 4] +10.483 consumer1 calls onPartitionsAssigned [3, 4] +10.483
consumer1 heartbeats with owned partitions [3, 4] +11.384 consumer1
ends poll()

+15.000 15 second timeout elapses and the test fails +15.300 consumer2
heartbeat receives assignment [0, 1, 2]

To fix the test we:

  • Make config changes to reduce the reconciliation time. This also
    reduces the test duration from 60 seconds to 20 seconds.
    • Disable auto-commit, since the consumers do not consume any records.
    • Reduce the heartbeat interval to 1,000 ms.
    • Reduce the poll timeouts to 100 ms, so that polls happen every
      300 ms.
  • Raise the final timeout to 30 seconds, since under heavy CI load, the
    reduced intervals above aren't effective.

Reviewers: Lianet Magrans lmagrans@confluent.io, David Jacot
djacot@confluent.io

@github-actions github-actions Bot added triage PRs from the community tests Test fixes (including flaky tests) clients small Small PRs labels Apr 26, 2026
@dajac dajac added ci-approved and removed triage PRs from the community labels Apr 27, 2026
@dajac dajac requested a review from lianetm April 27, 2026 06:11
@dajac
Copy link
Copy Markdown
Member

dajac commented Apr 27, 2026

@squah-confluent Should we reduce the heartbeat interval to make the test faster?

consumer1.assignment().equals(Set.of(new TopicPartition(topic, 3), new TopicPartition(topic, 4))) &&
consumer2.assignment().equals(Set.of(new TopicPartition(topic, 0), new TopicPartition(topic, 1), new TopicPartition(topic, 2)));
}, "Consumer with topic partition mapping should be 0 -> 5 | 1 -> 3, 4 | 2 -> 0, 1, 2");
}, 30000, "Consumer with topic partition mapping should be 0 -> 5 | 1 -> 3, 4 | 2 -> 0, 1, 2");
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, I would only add that if this is all because the consumer may take long to do the full reconciliation flow, we could disable auto-commit that is unrelated to this test (and would add an extra step when revoking partitions)

Copy link
Copy Markdown
Contributor Author

@squah-confluent squah-confluent Apr 27, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there could be more than once source of delay. For example, there could be two target assignments from the group coordinator, one halfway through the rack reassignments. And once we implement assignment offloading from KIP-1263, it can add another heartbeat interval's worth of delay to the test and we'll exceed the 15 second timeout again in the worst case. I haven't really checked how these all interact. Some of them might fold in with others.

I still think there's a case for making the test faster by disabling auto-commit, reducing the heartbeat interval and poll interval (@dajac's suggestion). Slow tests are really awful to work with as a developer.

@squah-confluent
Copy link
Copy Markdown
Contributor Author

Turning off auto-commit, reducing the heartbeat interval to 1,000 ms and reducing each poll to 100 ms (poll interval 300 ms overall) reduced the test runtime from 60 seconds to 20 seconds when there is no other system load.

Under loaded CI conditions, with assignor offloading implemented and assignment batching disabled, the flake rate drops from 40/100 failures to 1/100 failures. In the failing run, I saw that polls were happening a few seconds apart and not the expected 300 ms.

To summarize:

  • I believe reducing the heartbeat interval, etc improves the failure rate and test runtime a lot.
  • But when the machine running the test is overloaded, we can still see flakes.

So I think it's worth both adjusting the intervals and raising the timout.

The last part of testRackAwareAssignment was found to be flaky. This
part moves all topic partitions to different racks and waits for
consumer assignments to settle. Each of the three consumers is expected
to revoke all its partitions and be assigned partitions previously held
by another within a 15 second timeout.

This timeout is not always sufficient. The consumer heartbeat interval
is left at the default of 5,000 ms and each consumer polls every
3,000 ms. In the worst case, it takes a consumer around 7,000 ms to
reconcile an assignment change. An additional 3,000 ms round of polling
may be required when a consumer needs to auto-commit offsets. Two rounds
of reconciliation must happen within 15,000 ms.

The timeline of an example failing run looks like:

-02.956 Group coordinator computes target assignment at epoch=6
        consumer0=[0] consumer1=[1, 2] consumer2=[3, 4, 5]
+00.000 15 second timeout starts

+03.179 consumer0 heartbeats
        This is the first heartbeat since the rack reassignments.
+03.179 Group coordinator computes target assignment at epoch=7
        consumer0=[5] consumer1=[3, 4] consumer2=[0, 1, 2]
+03.186 consumer0 heartbeat receives assignment []

+04.151 consumer1 starts poll()
+04.877 consumer1 heartbeats
+04.878 consumer1 heartbeat receives assignment []
+05.155 consumer1 ends poll()

+07.259 consumer1 starts poll()
+07.259     consumer1 sends auto-commit with offsets for [1, 2]
+07.288     consumer1 receives auto-commit response
+08.263 consumer1 ends poll()

+10.379 consumer1 starts poll()
+10.379     consumer1 calls onPartitionsRevoked with [1, 2]
+10.379     consumer1 calls onPartitionsAssigned with []
+10.382     consumer1 heartbeats with owned partitions []
+10.387     consumer1 heartbeat receives assignment [3, 4]
+10.483     consumer1 calls onPartitionsAssigned [3, 4]
+10.483     consumer1 heartbeats with owned partitions [3, 4]
+11.384 consumer1 ends poll()

+15.000 15 second timeout elapses and the test fails
+15.300 consumer2 heartbeat receives assignment [0, 1, 2]

To fix the test we:
* Make config changes to reduce the reconciliation time. This also
  reduces the test duration from 60 seconds to 20 seconds.
  * Disable auto-commit, since the consumers do not consume any records.
  * Reduce the heartbeat interval to 1,000 ms.
  * Reduce the poll timeouts to 100 ms, so that polls happen every
    300 ms.
* Raise the final timeout to 30 seconds, since under heavy CI load, the
  reduced intervals above aren't effective.
@airlock-confluentinc airlock-confluentinc Bot force-pushed the squah-fix-testrackawareassignment-flake branch from 5568480 to c478e92 Compare April 27, 2026 16:27
Copy link
Copy Markdown
Member

@lianetm lianetm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM.
Will just wait in case @dajac has any other comments?

--update
synced offline, ok to merge

@lianetm lianetm merged commit 90a4c93 into apache:trunk Apr 28, 2026
24 checks passed
@lianetm lianetm deleted the squah-fix-testrackawareassignment-flake branch April 28, 2026 14:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci-approved clients small Small PRs tests Test fixes (including flaky tests)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants