Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flaky tests fromSegmentReplicationAllocationIT #17429

Merged

Conversation

ltaragi
Copy link
Contributor

@ltaragi ltaragi commented Feb 23, 2025

Description

testAllocationAndRebalanceWithDisruption

  • In the original test, we needed to make sure that the cluster had more no. of data nodes than the number of shard copies but it was not actually a guarantee leading to the flaky nature of tests.
  • We are now pre-generating the random index/shard counts and then spawning the data nodes accordingly.
  • Also decreased the upper limit on number of indices because original limit would have caused spawning of as high as 60 data nodes (10 * 2 * (1 + 2)) which was causing transport related failures unrelated to the scope of the test.
  • Tested the new changes with ~3.4K consecutive IT runs without any failures
BUILD SUCCESSFUL in 18s
60 actionable tasks: 1 executed, 59 up-to-date
[2025-02-22 23:19:08] : ===================================================================
[2025-02-22 23:19:08] : counter=3412
[2025-02-22 23:19:08] : ===================================================================

testSingleIndexShardAllocation

  • Based on the failure logs from recent PR builds (examples), we can see that the test used to fail occasionally with
java.lang.AssertionError: timed out waiting for green state
	at __randomizedtesting.SeedInfo.seed([7386032F53A52312:A9C55268C678FC7D]:0)
	at org.junit.Assert.fail(Assert.java:89)
	at org.opensearch.test.OpenSearchIntegTestCase.ensureColor(OpenSearchIntegTestCase.java:981)
	at org.opensearch.test.OpenSearchIntegTestCase.ensureColor(OpenSearchIntegTestCase.java:933)
	at org.opensearch.test.OpenSearchIntegTestCase.ensureGreen(OpenSearchIntegTestCase.java:898)
	at org.opensearch.indices.replication.SegmentReplicationAllocationIT.testSingleIndexShardAllocation(SegmentReplicationAllocationIT.java:179)
  • The test creates an index with 50 shards, each with 1 replica. Thus upon removal/addition of a node in the test, the relocations end up taking longer than the previously stipulated 60 seconds:
ensureGreen(TimeValue.timeValueSeconds(60));
  • Confirmed this by lowering the await time and the test started failing more frequently.
  • As a fix we have increased the the await time and also added an extra check for cluster's stability:
internalCluster().validateClusterFormed();
ensureGreen(TimeValue.timeValueSeconds(100));
  • Tested the new changes with ~2.5K+ consecutive IT runs without any failures
BUILD SUCCESSFUL in 27s
60 actionable tasks: 1 executed, 59 up-to-date
[2025-02-24 23:44:43] : ===================================================================
[2025-02-24 23:44:43] : counter=2506
[2025-02-24 23:44:43] : ===================================================================

Related Issues

Resolves #14327

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Contributor

✅ Gradle check result for b799100: SUCCESS

Copy link

codecov bot commented Feb 23, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 72.45%. Comparing base (5666982) to head (0651098).
Report is 7 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff              @@
##               main   #17429      +/-   ##
============================================
- Coverage     72.49%   72.45%   -0.05%     
+ Complexity    65717    65697      -20     
============================================
  Files          5303     5304       +1     
  Lines        304793   304981     +188     
  Branches      44202    44238      +36     
============================================
+ Hits         220966   220979      +13     
- Misses        65693    65881     +188     
+ Partials      18134    18121      -13     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ltaragi ltaragi changed the title Flaky segment replication allocation it Fix flaky test SegmentReplicationAllocationIT.testAllocationAndRebalanceWithDisruption Feb 24, 2025
@github-actions github-actions bot added >test-failure Test failure from CI, local build, etc. autocut flaky-test Random test failure that succeeds on second run good first issue Good for newcomers Storage Issues and PRs relating to data and metadata storage Storage:Remote labels Feb 24, 2025
@ltaragi ltaragi marked this pull request as ready for review February 24, 2025 08:56
@ltaragi ltaragi force-pushed the flaky-SegmentReplicationAllocationIT branch from b799100 to 0651098 Compare February 25, 2025 08:14
@ltaragi ltaragi changed the title Fix flaky test SegmentReplicationAllocationIT.testAllocationAndRebalanceWithDisruption Fix flaky tests fromSegmentReplicationAllocationIT Feb 25, 2025
Copy link
Contributor

❕ Gradle check result for 0651098: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.remotestore.multipart.RemoteStoreMultipartIT.testCloseIndexWithNoOpSyncAndFlushForAsyncTranslog

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@ltaragi
Copy link
Contributor Author

ltaragi commented Feb 25, 2025

❕ Gradle check result for 0651098: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.remotestore.multipart.RemoteStoreMultipartIT.testCloseIndexWithNoOpSyncAndFlushForAsyncTranslog

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Flaky test from #16145

@andrross andrross merged commit e397903 into opensearch-project:main Feb 25, 2025
33 checks passed
@andrross andrross added the backport 2.x Backport to 2.x branch label Feb 25, 2025
opensearch-trigger-bot bot pushed a commit that referenced this pull request Feb 25, 2025
* Fix flaky tests in SegmentReplicationAllocationIT

Signed-off-by: Lakshya Taragi <[email protected]>

* Remove extra logs

Signed-off-by: Lakshya Taragi <[email protected]>

* Account for replicas as well

Signed-off-by: Lakshya Taragi <[email protected]>

* Reduce upper limit on no. of indices

Signed-off-by: Lakshya Taragi <[email protected]>

* Only verified changes

Signed-off-by: Lakshya Taragi <[email protected]>

* Fix testSingleIndexShardAllocation

Signed-off-by: Lakshya Taragi <[email protected]>

---------

Signed-off-by: Lakshya Taragi <[email protected]>
(cherry picked from commit e397903)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
vinaykpud pushed a commit to vinaykpud/OpenSearch that referenced this pull request Feb 26, 2025
…ect#17429)

* Fix flaky tests in SegmentReplicationAllocationIT

Signed-off-by: Lakshya Taragi <[email protected]>

* Remove extra logs

Signed-off-by: Lakshya Taragi <[email protected]>

* Account for replicas as well

Signed-off-by: Lakshya Taragi <[email protected]>

* Reduce upper limit on no. of indices

Signed-off-by: Lakshya Taragi <[email protected]>

* Only verified changes

Signed-off-by: Lakshya Taragi <[email protected]>

* Fix testSingleIndexShardAllocation

Signed-off-by: Lakshya Taragi <[email protected]>

---------

Signed-off-by: Lakshya Taragi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
autocut backport 2.x Backport to 2.x branch flaky-test Random test failure that succeeds on second run good first issue Good for newcomers skip-changelog Storage:Remote Storage Issues and PRs relating to data and metadata storage >test-failure Test failure from CI, local build, etc.
Projects
Status: ✅ Done
Development

Successfully merging this pull request may close these issues.

[AUTOCUT] Gradle Check Flaky Test Report for SegmentReplicationAllocationIT
2 participants