MINOR: Fix testRackAwareAssignment flake#22154
Conversation
|
@squah-confluent Should we reduce the heartbeat interval to make the test faster? |
| consumer1.assignment().equals(Set.of(new TopicPartition(topic, 3), new TopicPartition(topic, 4))) && | ||
| consumer2.assignment().equals(Set.of(new TopicPartition(topic, 0), new TopicPartition(topic, 1), new TopicPartition(topic, 2))); | ||
| }, "Consumer with topic partition mapping should be 0 -> 5 | 1 -> 3, 4 | 2 -> 0, 1, 2"); | ||
| }, 30000, "Consumer with topic partition mapping should be 0 -> 5 | 1 -> 3, 4 | 2 -> 0, 1, 2"); |
There was a problem hiding this comment.
makes sense, I would only add that if this is all because the consumer may take long to do the full reconciliation flow, we could disable auto-commit that is unrelated to this test (and would add an extra step when revoking partitions)
There was a problem hiding this comment.
I think there could be more than once source of delay. For example, there could be two target assignments from the group coordinator, one halfway through the rack reassignments. And once we implement assignment offloading from KIP-1263, it can add another heartbeat interval's worth of delay to the test and we'll exceed the 15 second timeout again in the worst case. I haven't really checked how these all interact. Some of them might fold in with others.
I still think there's a case for making the test faster by disabling auto-commit, reducing the heartbeat interval and poll interval (@dajac's suggestion). Slow tests are really awful to work with as a developer.
|
Turning off auto-commit, reducing the heartbeat interval to 1,000 ms and reducing each poll to 100 ms (poll interval 300 ms overall) reduced the test runtime from 60 seconds to 20 seconds when there is no other system load. Under loaded CI conditions, with assignor offloading implemented and assignment batching disabled, the flake rate drops from 40/100 failures to 1/100 failures. In the failing run, I saw that polls were happening a few seconds apart and not the expected 300 ms. To summarize:
So I think it's worth both adjusting the intervals and raising the timout. |
The last part of testRackAwareAssignment was found to be flaky. This part moves all topic partitions to different racks and waits for consumer assignments to settle. Each of the three consumers is expected to revoke all its partitions and be assigned partitions previously held by another within a 15 second timeout. This timeout is not always sufficient. The consumer heartbeat interval is left at the default of 5,000 ms and each consumer polls every 3,000 ms. In the worst case, it takes a consumer around 7,000 ms to reconcile an assignment change. An additional 3,000 ms round of polling may be required when a consumer needs to auto-commit offsets. Two rounds of reconciliation must happen within 15,000 ms. The timeline of an example failing run looks like: -02.956 Group coordinator computes target assignment at epoch=6 consumer0=[0] consumer1=[1, 2] consumer2=[3, 4, 5] +00.000 15 second timeout starts +03.179 consumer0 heartbeats This is the first heartbeat since the rack reassignments. +03.179 Group coordinator computes target assignment at epoch=7 consumer0=[5] consumer1=[3, 4] consumer2=[0, 1, 2] +03.186 consumer0 heartbeat receives assignment [] +04.151 consumer1 starts poll() +04.877 consumer1 heartbeats +04.878 consumer1 heartbeat receives assignment [] +05.155 consumer1 ends poll() +07.259 consumer1 starts poll() +07.259 consumer1 sends auto-commit with offsets for [1, 2] +07.288 consumer1 receives auto-commit response +08.263 consumer1 ends poll() +10.379 consumer1 starts poll() +10.379 consumer1 calls onPartitionsRevoked with [1, 2] +10.379 consumer1 calls onPartitionsAssigned with [] +10.382 consumer1 heartbeats with owned partitions [] +10.387 consumer1 heartbeat receives assignment [3, 4] +10.483 consumer1 calls onPartitionsAssigned [3, 4] +10.483 consumer1 heartbeats with owned partitions [3, 4] +11.384 consumer1 ends poll() +15.000 15 second timeout elapses and the test fails +15.300 consumer2 heartbeat receives assignment [0, 1, 2] To fix the test we: * Make config changes to reduce the reconciliation time. This also reduces the test duration from 60 seconds to 20 seconds. * Disable auto-commit, since the consumers do not consume any records. * Reduce the heartbeat interval to 1,000 ms. * Reduce the poll timeouts to 100 ms, so that polls happen every 300 ms. * Raise the final timeout to 30 seconds, since under heavy CI load, the reduced intervals above aren't effective.
5568480 to
c478e92
Compare
The last part of testRackAwareAssignment was found to be flaky. This
part moves all topic partitions to different racks and waits for
consumer assignments to settle. Each of the three consumers is expected
to revoke all its partitions and be assigned partitions previously held
by another within a 15 second timeout.
This timeout is not always sufficient. The consumer heartbeat interval
is left at the default of 5,000 ms and each consumer polls every
3,000 ms. In the worst case, it takes a consumer around 7,000 ms to
reconcile an assignment change. An additional 3,000 ms round of polling
may be required when a consumer needs to auto-commit offsets. Two rounds
of reconciliation must happen within 15,000 ms.
The timeline of an example failing run looks like:
-02.956 Group coordinator computes target assignment at epoch=6
consumer0=[0] consumer1=[1, 2] consumer2=[3, 4, 5]
+00.000 15 second timeout starts
+03.179 consumer0 heartbeats This is the first heartbeat since
the rack reassignments. +03.179 Group coordinator computes target
assignment at epoch=7 consumer0=[5] consumer1=[3, 4]
consumer2=[0, 1, 2] +03.186 consumer0 heartbeat receives assignment []
+04.151 consumer1 starts poll() +04.877 consumer1 heartbeats +04.878
consumer1 heartbeat receives assignment [] +05.155 consumer1 ends
poll()
+07.259 consumer1 starts poll() +07.259 consumer1 sends auto-commit
with offsets for [1, 2] +07.288 consumer1 receives auto-commit
response +08.263 consumer1 ends poll()
+10.379 consumer1 starts poll() +10.379 consumer1 calls
onPartitionsRevoked with [1, 2] +10.379 consumer1 calls
onPartitionsAssigned with [] +10.382 consumer1 heartbeats with
owned partitions [] +10.387 consumer1 heartbeat receives assignment
[3, 4] +10.483 consumer1 calls onPartitionsAssigned [3, 4] +10.483
consumer1 heartbeats with owned partitions [3, 4] +11.384 consumer1
ends poll()
+15.000 15 second timeout elapses and the test fails +15.300 consumer2
heartbeat receives assignment [0, 1, 2]
To fix the test we:
reduces the test duration from 60 seconds to 20 seconds.
300 ms.
reduced intervals above aren't effective.
Reviewers: Lianet Magrans lmagrans@confluent.io, David Jacot
djacot@confluent.io