Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: parallel inference without slurm #121

Open
wants to merge 13 commits into
base: main
Choose a base branch
from

Conversation

cathalobrien
Copy link
Contributor

@cathalobrien cathalobrien commented Feb 3, 2025

This PR extends the Parallel Inference added in #55 to work without slurm within 1 node. It's much nicer to debug and run now :D

When running parallel inference without slurm, you have to add world_size to the config. At the moment, this is ignored in favour of SLURM_NTASKS when running with srun. An example of a config running parallel inference across 4 nodes is shown below. You can launch this job as normal with anemoi-inference run parinf.yaml.

checkpoint: /path/to/inference-last.ckpt
lead_time: 60
runner: parallel
world_size: 4 #Only required if running parallel inference without Slurm
input:
  grib: /path/to/input.grib
output:
  grib: /path/to/output.grib

How it works is:

  • check if anemoi-inference is launched by srun
  • if not, spawn config.world_size processes
    • master_addr is localhost
    • master_port is a hash of the node name, within a range.
  • Each spawned process runs a slimmed down version RunCmd.run but with the config preloaded

Issues:

  • The master port calculation would lead to a clash if two parallel inference processes ran on the same node at the same time.
  • At the moment, I have a copy of RunCmd.run in runners/parallel.py. Would be nice to be able to use that code directly, rather then having to maintain a copy. To do this, I would just have to be able to pass a loaded config instead of a path

📚 Documentation preview 📚: https://anemoi-inference--121.org.readthedocs.build/en/121/

@codecov-commenter
Copy link

codecov-commenter commented Feb 3, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.03%. Comparing base (90728d5) to head (9d22d57).
Report is 44 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #121   +/-   ##
=======================================
  Coverage   98.03%   98.03%           
=======================================
  Files           3        3           
  Lines          51       51           
=======================================
  Hits           50       50           
  Misses          1        1           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@HCookie HCookie changed the title parallel inference without slurm feat: parallel inference without slurm Feb 6, 2025
CHANGELOG.md Outdated Show resolved Hide resolved
src/anemoi/inference/config.py Outdated Show resolved Hide resolved
Comment on lines +30 to +46
# This is identical to the 'run' method in commands/run.py except the config file is not loaded
def _run_subproc(config):
runner = create_runner(config)

input = runner.create_input()
output = runner.create_output()

input_state = input.create_input_state(date=config.date)

output.write_initial_state(input_state)

for state in runner.run(input_state=input_state, lead_time=config.lead_time):
output.write_state(state)

output.close()


Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure how well this will generalise for users other then the command interface. I suppose those issues are out of scope, as this primarily concerns the cmd.
I'll need to think more on it

src/anemoi/inference/runners/parallel.py Outdated Show resolved Hide resolved

# Ensure each parallel model instance uses the same seed
if self.global_rank == 0:
seed = torch.initial_seed()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should a user be able to control the initial seed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for all the feedback. I added a check for 'ANEMOI_BASE_SEED'. Currently only the parallel runner looks for this, not the default runner. It would be better if this check was moved somewhere in the default runner

src/anemoi/inference/runners/parallel.py Outdated Show resolved Hide resolved
@github-actions github-actions bot added the documentation Improvements or additions to documentation label Feb 6, 2025
@cathalobrien
Copy link
Contributor Author

cathalobrien commented Feb 6, 2025

note to self: update parallel docs to clearly state it requires a minimum version of anemoi-models
and point out that, if you can't update models bc of breaking changes, you can cherry pick this PR instead

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
Status: Under Review
Development

Successfully merging this pull request may close these issues.

3 participants