feat: parallel inference without slurm #121

cathalobrien · 2025-02-03T15:49:31Z

This PR extends the Parallel Inference added in #55 to work without slurm within 1 node. It's much nicer to debug and run now :D

When running parallel inference without slurm, you have to add world_size to the config. At the moment, this is ignored in favour of SLURM_NTASKS when running with srun. An example of a config running parallel inference across 4 nodes is shown below. You can launch this job as normal with anemoi-inference run parinf.yaml.

checkpoint: /path/to/inference-last.ckpt
lead_time: 60
runner: parallel
world_size: 4 #Only required if running parallel inference without Slurm
input:
  grib: /path/to/input.grib
output:
  grib: /path/to/output.grib

How it works is:

check if anemoi-inference is launched by srun
if not, spawn config.world_size processes
- master_addr is localhost
- master_port is a hash of the node name, within a range.
Each spawned process runs a slimmed down version RunCmd.run but with the config preloaded

Issues:

The master port calculation would lead to a clash if two parallel inference processes ran on the same node at the same time.
At the moment, I have a copy of RunCmd.run in runners/parallel.py. Would be nice to be able to use that code directly, rather then having to maintain a copy. To do this, I would just have to be able to pass a loaded config instead of a path

📚 Documentation preview 📚: https://anemoi-inference--121.org.readthedocs.build/en/121/

codecov-commenter · 2025-02-03T16:25:37Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 98.03%. Comparing base (90728d5) to head (9d22d57).
Report is 44 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #121   +/-   ##
=======================================
  Coverage   98.03%   98.03%           
=======================================
  Files           3        3           
  Lines          51       51           
=======================================
  Hits           50       50           
  Misses          1        1

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

for more information, see https://pre-commit.ci

…oi-inference into feature/par-inf-without-slurm

CHANGELOG.md

src/anemoi/inference/config.py

HCookie · 2025-02-06T08:47:13Z

src/anemoi/inference/runners/parallel.py

+# This is identical to the 'run' method in commands/run.py except the config file is not loaded
+def _run_subproc(config):
+    runner = create_runner(config)
+
+    input = runner.create_input()
+    output = runner.create_output()
+
+    input_state = input.create_input_state(date=config.date)
+
+    output.write_initial_state(input_state)
+
+    for state in runner.run(input_state=input_state, lead_time=config.lead_time):
+        output.write_state(state)
+
+    output.close()
+
+


I'm not sure how well this will generalise for users other then the command interface. I suppose those issues are out of scope, as this primarily concerns the cmd.
I'll need to think more on it

src/anemoi/inference/runners/parallel.py

HCookie · 2025-02-06T08:49:23Z

src/anemoi/inference/runners/parallel.py

+
+            # Ensure each parallel model instance uses the same seed
+            if self.global_rank == 0:
+                seed = torch.initial_seed()


Should a user be able to control the initial seed?

Thanks for all the feedback. I added a check for 'ANEMOI_BASE_SEED'. Currently only the parallel runner looks for this, not the default runner. It would be better if this check was moved somewhere in the default runner

src/anemoi/inference/runners/parallel.py

cathalobrien · 2025-02-06T12:46:17Z

note to self: update parallel docs to clearly state it requires a minimum version of anemoi-models
and point out that, if you can't update models bc of breaking changes, you can cherry pick this PR instead

cathalobrien added 5 commits February 3, 2025 14:32

wip

746e928

forgot

7521b1e

works now except for detecting srun use

2b831b4

works now :)

b099f62

pre-commit why hast thou forsaken me

6936d67

cathalobrien requested review from gmertes and HCookie February 3, 2025 15:49

cathalobrien mentioned this pull request Feb 3, 2025

Feature: parallel inference without slurm #112

Open

changelog

35e87ee

cathalobrien and others added 5 commits February 4, 2025 09:47

added some more guards for invalid config entries

8398c0b

updated docs to explain to how to launch parinf without slurm

a08119f

[pre-commit.ci] auto fixes from pre-commit.com hooks

b124d6a

for more information, see https://pre-commit.ci

Frailty, thy name is pre-commit

777c43e

Merge branch 'feature/par-inf-without-slurm' of github.com:ecmwf/anem…

475dd22

…oi-inference into feature/par-inf-without-slurm

HCookie changed the title ~~parallel inference without slurm~~ feat: parallel inference without slurm Feb 6, 2025

HCookie reviewed Feb 6, 2025

View reviewed changes

feedback

997c500

github-actions bot added the documentation Improvements or additions to documentation label Feb 6, 2025

Docs and error state minimum models version and give workarounds

9d22d57

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: parallel inference without slurm #121

feat: parallel inference without slurm #121

cathalobrien commented Feb 3, 2025 •

edited by github-actions bot

Loading

codecov-commenter commented Feb 3, 2025 •

edited

Loading

HCookie Feb 6, 2025

HCookie Feb 6, 2025

cathalobrien Feb 6, 2025

cathalobrien commented Feb 6, 2025 •

edited

Loading

feat: parallel inference without slurm #121

Are you sure you want to change the base?

feat: parallel inference without slurm #121

Conversation

cathalobrien commented Feb 3, 2025 • edited by github-actions bot Loading

codecov-commenter commented Feb 3, 2025 • edited Loading

Codecov Report

HCookie Feb 6, 2025

Choose a reason for hiding this comment

HCookie Feb 6, 2025

Choose a reason for hiding this comment

cathalobrien Feb 6, 2025

Choose a reason for hiding this comment

cathalobrien commented Feb 6, 2025 • edited Loading

cathalobrien commented Feb 3, 2025 •

edited by github-actions bot

Loading

codecov-commenter commented Feb 3, 2025 •

edited

Loading

cathalobrien commented Feb 6, 2025 •

edited

Loading