MINOR: Fix testRackAwareAssignment flake by squah-confluent · Pull Request #22154 · apache/kafka

squah-confluent · 2026-04-26T20:46:30Z

The last part of testRackAwareAssignment was found to be flaky. This
part moves all topic partitions to different racks and waits for
consumer assignments to settle. Each of the three consumers is expected
to revoke all its partitions and be assigned partitions previously held
by another within a 15 second timeout.

This timeout is not always sufficient. The consumer heartbeat interval
is left at the default of 5,000 ms and each consumer polls every
3,000 ms. In the worst case, it takes a consumer around 7,000 ms to
reconcile an assignment change. An additional 3,000 ms round of polling
may be required when a consumer needs to auto-commit offsets. Two rounds
of reconciliation must happen within 15,000 ms.

The timeline of an example failing run looks like:

-02.956 Group coordinator computes target assignment at epoch=6
consumer0=[0] consumer1=[1, 2] consumer2=[3, 4, 5]
+00.000 15 second timeout starts

+03.179 consumer0 heartbeats This is the first heartbeat since
the rack reassignments. +03.179 Group coordinator computes target
assignment at epoch=7 consumer0=[5] consumer1=[3, 4]
consumer2=[0, 1, 2] +03.186 consumer0 heartbeat receives assignment []

+04.151 consumer1 starts poll() +04.877 consumer1 heartbeats +04.878
consumer1 heartbeat receives assignment [] +05.155 consumer1 ends
poll()

+07.259 consumer1 starts poll() +07.259 consumer1 sends auto-commit
with offsets for [1, 2] +07.288 consumer1 receives auto-commit
response +08.263 consumer1 ends poll()

+10.379 consumer1 starts poll() +10.379 consumer1 calls
onPartitionsRevoked with [1, 2] +10.379 consumer1 calls
onPartitionsAssigned with [] +10.382 consumer1 heartbeats with
owned partitions [] +10.387 consumer1 heartbeat receives assignment
[3, 4] +10.483 consumer1 calls onPartitionsAssigned [3, 4] +10.483
consumer1 heartbeats with owned partitions [3, 4] +11.384 consumer1
ends poll()

+15.000 15 second timeout elapses and the test fails +15.300 consumer2
heartbeat receives assignment [0, 1, 2]

To fix the test we:

Make config changes to reduce the reconciliation time. This also
reduces the test duration from 60 seconds to 20 seconds.
- Disable auto-commit, since the consumers do not consume any records.
- Reduce the heartbeat interval to 1,000 ms.
- Reduce the poll timeouts to 100 ms, so that polls happen every
  300 ms.
Raise the final timeout to 30 seconds, since under heavy CI load, the
reduced intervals above aren't effective.

Reviewers: Lianet Magrans lmagrans@confluent.io, David Jacot
djacot@confluent.io

dajac · 2026-04-27T06:13:20Z

@squah-confluent Should we reduce the heartbeat interval to make the test faster?

lianetm · 2026-04-27T13:15:09Z

                    consumer1.assignment().equals(Set.of(new TopicPartition(topic, 3), new TopicPartition(topic, 4))) &&
                    consumer2.assignment().equals(Set.of(new TopicPartition(topic, 0), new TopicPartition(topic, 1), new TopicPartition(topic, 2)));
-            }, "Consumer with topic partition mapping should be 0 -> 5 | 1 -> 3, 4 | 2 -> 0, 1, 2");
+            }, 30000, "Consumer with topic partition mapping should be 0 -> 5 | 1 -> 3, 4 | 2 -> 0, 1, 2");


makes sense, I would only add that if this is all because the consumer may take long to do the full reconciliation flow, we could disable auto-commit that is unrelated to this test (and would add an extra step when revoking partitions)

I think there could be more than once source of delay. For example, there could be two target assignments from the group coordinator, one halfway through the rack reassignments. And once we implement assignment offloading from KIP-1263, it can add another heartbeat interval's worth of delay to the test and we'll exceed the 15 second timeout again in the worst case. I haven't really checked how these all interact. Some of them might fold in with others.

I still think there's a case for making the test faster by disabling auto-commit, reducing the heartbeat interval and poll interval (@dajac's suggestion). Slow tests are really awful to work with as a developer.

squah-confluent · 2026-04-27T16:19:01Z

Turning off auto-commit, reducing the heartbeat interval to 1,000 ms and reducing each poll to 100 ms (poll interval 300 ms overall) reduced the test runtime from 60 seconds to 20 seconds when there is no other system load.

Under loaded CI conditions, with assignor offloading implemented and assignment batching disabled, the flake rate drops from 40/100 failures to 1/100 failures. In the failing run, I saw that polls were happening a few seconds apart and not the expected 300 ms.

To summarize:

I believe reducing the heartbeat interval, etc improves the failure rate and test runtime a lot.
But when the machine running the test is overloaded, we can still see flakes.

So I think it's worth both adjusting the intervals and raising the timout.

The last part of testRackAwareAssignment was found to be flaky. This part moves all topic partitions to different racks and waits for consumer assignments to settle. Each of the three consumers is expected to revoke all its partitions and be assigned partitions previously held by another within a 15 second timeout. This timeout is not always sufficient. The consumer heartbeat interval is left at the default of 5,000 ms and each consumer polls every 3,000 ms. In the worst case, it takes a consumer around 7,000 ms to reconcile an assignment change. An additional 3,000 ms round of polling may be required when a consumer needs to auto-commit offsets. Two rounds of reconciliation must happen within 15,000 ms. The timeline of an example failing run looks like: -02.956 Group coordinator computes target assignment at epoch=6 consumer0=[0] consumer1=[1, 2] consumer2=[3, 4, 5] +00.000 15 second timeout starts +03.179 consumer0 heartbeats This is the first heartbeat since the rack reassignments. +03.179 Group coordinator computes target assignment at epoch=7 consumer0=[5] consumer1=[3, 4] consumer2=[0, 1, 2] +03.186 consumer0 heartbeat receives assignment [] +04.151 consumer1 starts poll() +04.877 consumer1 heartbeats +04.878 consumer1 heartbeat receives assignment [] +05.155 consumer1 ends poll() +07.259 consumer1 starts poll() +07.259 consumer1 sends auto-commit with offsets for [1, 2] +07.288 consumer1 receives auto-commit response +08.263 consumer1 ends poll() +10.379 consumer1 starts poll() +10.379 consumer1 calls onPartitionsRevoked with [1, 2] +10.379 consumer1 calls onPartitionsAssigned with [] +10.382 consumer1 heartbeats with owned partitions [] +10.387 consumer1 heartbeat receives assignment [3, 4] +10.483 consumer1 calls onPartitionsAssigned [3, 4] +10.483 consumer1 heartbeats with owned partitions [3, 4] +11.384 consumer1 ends poll() +15.000 15 second timeout elapses and the test fails +15.300 consumer2 heartbeat receives assignment [0, 1, 2] To fix the test we: * Make config changes to reduce the reconciliation time. This also reduces the test duration from 60 seconds to 20 seconds. * Disable auto-commit, since the consumers do not consume any records. * Reduce the heartbeat interval to 1,000 ms. * Reduce the poll timeouts to 100 ms, so that polls happen every 300 ms. * Raise the final timeout to 30 seconds, since under heavy CI load, the reduced intervals above aren't effective.

lianetm

Thanks, LGTM.
Will just wait in case @dajac has any other comments?

--update
synced offline, ok to merge

github-actions Bot added triage PRs from the community tests Test fixes (including flaky tests) clients small Small PRs labels Apr 26, 2026

dajac added ci-approved and removed triage PRs from the community labels Apr 27, 2026

dajac requested a review from lianetm April 27, 2026 06:11

lianetm reviewed Apr 27, 2026

View reviewed changes

airlock-confluentinc Bot force-pushed the squah-fix-testrackawareassignment-flake branch from 5568480 to c478e92 Compare April 27, 2026 16:27

lianetm approved these changes Apr 28, 2026

View reviewed changes

lianetm merged commit 90a4c93 into apache:trunk Apr 28, 2026
24 checks passed

lianetm deleted the squah-fix-testrackawareassignment-flake branch April 28, 2026 14:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MINOR: Fix testRackAwareAssignment flake#22154

MINOR: Fix testRackAwareAssignment flake#22154
lianetm merged 1 commit intoapache:trunkfrom
confluentinc:squah-fix-testrackawareassignment-flake

squah-confluent commented Apr 26, 2026 •

edited by github-actions Bot

Loading

Uh oh!

dajac commented Apr 27, 2026

Uh oh!

lianetm Apr 27, 2026

Uh oh!

squah-confluent Apr 27, 2026 •

edited

Loading

Uh oh!

squah-confluent commented Apr 27, 2026

Uh oh!

lianetm left a comment •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

squah-confluent commented Apr 26, 2026 • edited by github-actions Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dajac commented Apr 27, 2026

Uh oh!

lianetm Apr 27, 2026

Choose a reason for hiding this comment

Uh oh!

squah-confluent Apr 27, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

squah-confluent commented Apr 27, 2026

Uh oh!

lianetm left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

squah-confluent commented Apr 26, 2026 •

edited by github-actions Bot

Loading

squah-confluent Apr 27, 2026 •

edited

Loading

lianetm left a comment •

edited

Loading