run-amemoi is a collection of utility scripts and packages to use anemoi-training.
Use of virtual python environments is strongly dicouraged on LUMI, with a container based approach being the prefered solution. Therefore we use a singularity container which contains the entire software environment except for the anemoi repositories themselves (training, graphs, models, datasets, utils). These are installed in a lightweight virtual environment that we load on top of the container, which enables us to edit these packages without rebuilding the container.
- The virtual environment is set up by executing
bash make_env.shin /lumi. This will download the anemoi-packages and install them in a .venv folder inside /lumi.
You can now train a model through the following steps:
- Setup the desider model config file and make sure it is placed in /lumi. This file should not be named
config.yamlor any other config name allready in anemoi-training. - Specify the config file name in
lumi_jobscript.shalong with preferred sbatch settings for the job. - Submit the job with
sbatch lumi_jobscript.sh
autorun-anemoi is a lightweight Python package for submitting Anemoi training runs to the SLURM queue.
Features:
- Chained dependency jobs for long training
- Auto-run inference after training is finalised
- Modify config on-the-fly for efficient testing
- Back ups config and jobscript to avoid overwriting
This package is not available on PyPi. To install, run:
pip install git+https://github.com/metno/run-anemoi.gitautorun-anemoi comes with a command-line interface and a Python interface. The examples will focus on the command-line interface, but the python interface has the same support.
The command-line interface comes with two required arguments: config-name and sbatch-yaml:
run-anemoi <config-name> <sbatch-yaml>The first is the path to the config to be used, and the second is a YAML-file containing all SBATCH commands to be used in the job script. An example file can be found as job.yaml:
output: output.out
error: error.err
nodes: 1
ntasks-per-node: 4
gpus-per-node: 4
mem: 450G
account: DestE_330_24
partition: boost_usr_prod
job-name: test
exclusive: Nonerun-anemoi anemoi/config/config.yaml job.yamlThe same operation can be done by creating an AutoRunAnemoi-object in Python:
from autorun_anemoi import AutoRunAnemoi
obj = AutoRunAnemoi('aifs/config/config.yaml', 'job.yaml')
obj.run()If total training time is longer than what is practical for a single job (due to system limitations or queue times), multiple dependency jobs can be submitted. This happens if the total_time, which is the expected time for the training procedure specified in the config, exceeds the max_time_per_job. Set total_time with the --total_time or -t argument (follows the SLURM time format):
run-anemoi anemoi/config/config.yaml job.yaml -t 3-00:00:00The default max_time_per_job is set to the maximum running time for the specified partition. To override this, use the --max_time_per_job or -m argument:
run-anemoi anemoi/config/config.yaml job.yaml -t 3-00:00:00 -m 12:00:00The command above will submit 6 jobs in total (one initial job and five dependency jobs), each with a total time of 12 hours.
We can also run inference after training is finalised. Similar to the training job, the inference job needs a config name and a sbatch yaml, which can be specified by --inference_config_name (-i) or --inference_job_yaml (-j), respectively:
run-anemoi anemoi/config/config.yaml job.yaml -i inference.yaml -j inference_job.yamlUse the argument --inference_python_script to change name of the inference script from inference.py.
Config overrides can be passed as command line arguments:
run-anemoi anemoi/config/config.yaml job.yaml diagnostics.plot.enabled=FalseThis is in particular useful if we want to submit a series of experiments with just small changes in the config:
for NCHANNELS in 256 512
do
run-anemoi aifs/config/config.yaml job.yaml model.num_channels=$NCHANNELS
doneIn python, use the modify_config-method:
from autorun_anemoi import AutoRunAnemoi
obj = AutoRunAnemoi('aifs/config/config.yaml', 'job.yaml')
for i in [256, 512]:
obj.modify_config(f'model.num_channels={i}')
obj.run()run-anemoi --help