Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix freeze and errors in tests/test_oml/test_ddp #597

Merged
merged 13 commits into from
Jun 23, 2024
Merged

Conversation

DaloroAT
Copy link
Collaborator

@DaloroAT DaloroAT commented Jun 16, 2024

  • I've checked contribution guide.

There were problems that DDP tests in tests/test_oml/test_ddp would either freeze or crash with an error.

I made the following changes:

  1. The file for communication between ranks is formed based on the test case (name of the function being tested, Python version, a hash of case argument values), which is unique and known to all ranks. Previously, the file had the same name and we hoped it would be deleted in time or unused.
  2. Increased timeout for DDP communication. The previous 10 seconds may not be enough for DDP tests on a matrix of 4 versions of pythons.
  3. I discovered that if you run only DDP tests, they almost always freeze. Most likely, CI machines on GitHub cannot handle such a load. This does not happen if you force the number of threads to be 1 for torch.

Runs with DDP only after changes (pytest -sv tests/test_oml/test_ddp hardcoded for short_tests):

@DaloroAT DaloroAT marked this pull request as draft June 16, 2024 11:27
@DaloroAT DaloroAT self-assigned this Jun 22, 2024
@DaloroAT DaloroAT changed the title DDP tests errors Fix freeze and errors in ests/test_oml/test_ddp Jun 22, 2024
@DaloroAT DaloroAT changed the title Fix freeze and errors in ests/test_oml/test_ddp Fix freeze and errors in tests/test_oml/test_ddp Jun 22, 2024
@DaloroAT DaloroAT requested a review from AlekseySh June 22, 2024 18:27
@DaloroAT DaloroAT marked this pull request as ready for review June 22, 2024 18:27
Copy link
Contributor

@AlekseySh AlekseySh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing! The problem was sooo annoying! Thank you :)

@AlekseySh AlekseySh merged commit d680ec6 into main Jun 23, 2024
8 checks passed
@AlekseySh AlekseySh deleted the check_ddp_errors branch June 23, 2024 17:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants