-
Notifications
You must be signed in to change notification settings - Fork 7
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: parallel inference without slurm #121
base: main
Are you sure you want to change the base?
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #121 +/- ##
=======================================
Coverage 98.03% 98.03%
=======================================
Files 3 3
Lines 51 51
=======================================
Hits 50 50
Misses 1 1 ☔ View full report in Codecov by Sentry. |
for more information, see https://pre-commit.ci
…oi-inference into feature/par-inf-without-slurm
# This is identical to the 'run' method in commands/run.py except the config file is not loaded | ||
def _run_subproc(config): | ||
runner = create_runner(config) | ||
|
||
input = runner.create_input() | ||
output = runner.create_output() | ||
|
||
input_state = input.create_input_state(date=config.date) | ||
|
||
output.write_initial_state(input_state) | ||
|
||
for state in runner.run(input_state=input_state, lead_time=config.lead_time): | ||
output.write_state(state) | ||
|
||
output.close() | ||
|
||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how well this will generalise for users other then the command interface. I suppose those issues are out of scope, as this primarily concerns the cmd.
I'll need to think more on it
|
||
# Ensure each parallel model instance uses the same seed | ||
if self.global_rank == 0: | ||
seed = torch.initial_seed() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should a user be able to control the initial seed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for all the feedback. I added a check for 'ANEMOI_BASE_SEED'. Currently only the parallel runner looks for this, not the default runner. It would be better if this check was moved somewhere in the default runner
note to self: update parallel docs to clearly state it requires a minimum version of anemoi-models |
This PR extends the Parallel Inference added in #55 to work without slurm within 1 node. It's much nicer to debug and run now :D
When running parallel inference without slurm, you have to add
world_size
to the config. At the moment, this is ignored in favour ofSLURM_NTASKS
when running with srun. An example of a config running parallel inference across 4 nodes is shown below. You can launch this job as normal withanemoi-inference run parinf.yaml
.How it works is:
config.world_size
processesmaster_addr
islocalhost
master_port
is a hash of the node name, within a range.RunCmd.run
but with the config preloadedIssues:
RunCmd.run
inrunners/parallel.py
. Would be nice to be able to use that code directly, rather then having to maintain a copy. To do this, I would just have to be able to pass a loaded config instead of a path📚 Documentation preview 📚: https://anemoi-inference--121.org.readthedocs.build/en/121/