Skip to content

Feature request: support for exporting environment variables with parallel launchers #3207

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
casparvl opened this issue May 29, 2024 · 6 comments · May be fixed by #3464
Open

Feature request: support for exporting environment variables with parallel launchers #3207

casparvl opened this issue May 29, 2024 · 6 comments · May be fixed by #3464

Comments

@casparvl
Copy link

casparvl commented May 29, 2024

Some software requires environment variables to run - e.g. PyTorch's distributed framework requires MASTER_PORT (among others) to be set. As discussed on Slack, this is currently challenging if the test developer doesn't know the configured launcher in advance.

I.e. if we know that OpenMPI's mpirun will be the launcher, we can do

self.job.launcher.options = ['-x MASTER_PORT']

But if we are writing a test with the purpose of it being reused (e.g. a test for the hpctestlib), it would be nice to have a way of specifying this in a launcher-agnostic way. E.g.

test.env_vars['MASTER_PORT'] = '1234'
self.job.launcher.export_var = ['MASTER_PORT']

or

self.job.launcher.export_var['MASTER_PORT] = ['1234']

(the 2nd is probably more convenient, but not sure which API is easiest to support from the ReFrame side).

ReFrame would then abstract how each particular launcher exports environment variables. E.g. for OpenMPI, the ReFrame backend would add -x MASTER_PORT=1234 as extra launcher argument, whereas for srun it would add --export=MASTERPORT=1234.

Note that right now, I worked around this issue by making a wrapper shell script that sets the environment variables, similar to what is used here by CSCS in their PyTorch test.

@casparvl casparvl changed the title Feature request: support for exporting environment variables Feature request: support for exporting environment variables with parallel launchers May 29, 2024
@vkarak vkarak moved this to Todo in ReFrame Backlog Jun 3, 2024
@vkarak
Copy link
Contributor

vkarak commented Jun 3, 2024

I think that

self.job.launcher.env_vars = {'MASTER_PORT': '1234'}

or

self.job.launcher.env_vars['MASTER_PORT] = '1234'

is the best and matches the test's env_vars in the syntax.

@vkarak vkarak added this to the ReFrame 4.8 milestone Nov 20, 2024
@vkarak
Copy link
Contributor

vkarak commented Feb 15, 2025

@casparvl I guess that your suggestion is for cases that the scheduler does not forward the environment to the compute nodes, right? If that's the case, maybe we would just need a configuration parameter to tell ReFrame to export the test's env_vars when launching the job. Do you see a case that you would need to pass a different set of environment variables to the launcher?

@casparvl
Copy link
Author

casparvl commented Mar 5, 2025

@casparvl I guess that your suggestion is for cases that the scheduler does not forward the environment to the compute nodes, right?

Hmm, that was actually not my problem. What you're talking about (I think) is exporting variables from the submission environment to the batch job. What I'm talking about is exporting the environment from the job's head-node of the allocation, to the processes launched by a parallel launcher like srun or mpirun. The trouble is that there is no launcher-agnostic way to do that, and it would be nice if ReFrame would abstract that away, so that we can define e.g. in the test:

self.job.launcher.env_vars['MASTER_PORT] = '1234'
self.executable = python
self.executable_opts = ['pytorch_benchmark.py']

And that, when configured with e.g. mpirun as parallel launcher, ReFrame would create the following job script:

#SBATCH ...
module load ...
mpirun -x MASTER_PORT=1234 python pytorch_benchmark.py

But, when configured with srun as parallel launcher, ReFrame would created:

#SBATCH ...
module load ...
srun --export=MASTER_PORT=1234 python pytorch_benchmark.py

I.e. in that way, we can write tests that can tell the parallel launcher to export an environment variable to the parallel processes being launched without the test developer having to know which parallel launcher will be used.

@vkarak
Copy link
Contributor

vkarak commented Mar 12, 2025

Thanks @casparvl for the clarification; it makes sense. I was confused at the beginning.

@jack-morrison
Copy link
Contributor

jack-morrison commented Mar 26, 2025

@vkarak, we discussed modifying the JobLauncher base class and implementing MPI-specific launchers, like an IntelMPI mpirun (impi_mpirun) and an OpenMPI mpirun (ompi_mpirun).

After thinking about this some more, doesn't this just shift the reuse problem from a test definition into the config files?

If I'm not mistaken, in order to use these MPI-specific launchers, a user would have to have multiple system.partition.launchers, meaning new system:partitions. This seems inconvenient.

Am I thinking about this correctly?

Edit: Maybe this approach is fine... system.partition.launcher already implies a single launcher is configured. I'm just thinking about situations where a user may want to use either OpenMPI or IntelMPI (or others) on a single partition.

@vkarak
Copy link
Contributor

vkarak commented Mar 27, 2025

Edit: Maybe this approach is fine... system.partition.launcher already implies a single launcher is configured. I'm just thinking about situations where a user may want to use either OpenMPI or IntelMPI (or others) on a single partition.

Yes, this is a general assumption of the system partitions, so we are not doing anything different. The key problem we are trying to solve here is that exporting variables is specific to the MPI launcher flavor. We will not replace the generic mpirun launcher, though it won't support/interpret the discussed env_vars. If users want to export variables at the launcher level, would then have to pick a more specific launcher for their partition.

@vkarak vkarak moved this from Todo to In Progress in ReFrame Backlog Apr 9, 2025
@vkarak vkarak modified the milestones: ReFrame 4.8, ReFrame 4.9 Apr 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: In Progress
Development

Successfully merging a pull request may close this issue.

3 participants