Skip to content

Commit e8c0772

Browse files
authored
[Env] SKYPILOT_JOB_ID for all tasks (skypilot-org#1377)
* Add run id for normal job * add example for the run id * fix env_check * fix env_check * fix * address comments * Rename to SKYPILOT_JOB_ID * rename the controller's job id to avoid confusion * rename env variables * fix
1 parent a613b43 commit e8c0772

File tree

22 files changed

+127
-72
lines changed

22 files changed

+127
-72
lines changed

.github/workflows/pytest.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -53,4 +53,4 @@ jobs:
5353
pip install pytest pytest-xdist pytest-env>=0.6
5454
5555
- name: Run tests with pytest
56-
run: SKY_DISABLE_USAGE_COLLECTION=1 pytest ${{ matrix.test-path }}
56+
run: SKYPILOT_DISABLE_USAGE_COLLECTION=1 pytest ${{ matrix.test-path }}

docs/source/examples/spot-jobs.rst

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -150,7 +150,7 @@ Below we show an `example <https://github.com/skypilot-org/skypilot/blob/master/
150150
--max_seq_length 384 \
151151
--doc_stride 128 \
152152
--report_to wandb \
153-
--run_name $SKYPILOT_RUN_ID \
153+
--run_name $SKYPILOT_JOB_ID \
154154
--output_dir /checkpoint/bert_qa/ \
155155
--save_total_limit 10 \
156156
--save_steps 1000
@@ -162,11 +162,11 @@ the output directory and frequency of checkpointing (see more
162162
on `Huggingface API <https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.save_steps>`_).
163163
You may also refer to another example `here <https://github.com/skypilot-org/skypilot/tree/master/examples/spot/resnet_ddp>`_ for periodically checkpointing with PyTorch.
164164

165-
We also set :code:`--run_name` to :code:`$SKYPILOT_RUN_ID` so that the loggings will be saved
165+
We also set :code:`--run_name` to :code:`$SKYPILOT_JOB_ID` so that the loggings will be saved
166166
to the same run in Weights & Biases.
167167

168168
.. note::
169-
The environment variable :code:`$SKYPILOT_RUN_ID` can be used to identify the same job, i.e., it is kept identical across all
169+
The environment variable :code:`$SKYPILOT_JOB_ID` (example: "sky-2022-10-06-05-17-09-750781_spot_id-22") can be used to identify the same job, i.e., it is kept identical across all
170170
recoveries of the job.
171171
It can be accessed in the task's :code:`run` commands or directly in the program itself (e.g., access
172172
via :code:`os.environ` and pass to Weights & Biases for tracking purposes in your training script). It is made available to

docs/source/running-jobs/distributed-jobs.rst

Lines changed: 10 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -30,27 +30,27 @@ For example, here is a simple PyTorch Distributed training example:
3030
run: |
3131
cd pytorch-distributed-resnet
3232
33-
num_nodes=`echo "$SKY_NODE_IPS" | wc -l`
34-
master_addr=`echo "$SKY_NODE_IPS" | head -n1`
35-
python3 -m torch.distributed.launch --nproc_per_node=$SKY_NUM_GPUS_PER_NODE \
36-
--nnodes=$num_nodes --node_rank=${SKY_NODE_RANK} --master_addr=$master_addr \
33+
num_nodes=`echo "$SKYPILOT_NODE_IPS" | wc -l`
34+
master_addr=`echo "$SKYPILOT_NODE_IPS" | head -n1`
35+
python3 -m torch.distributed.launch --nproc_per_node=$SKYPILOT_NUM_GPUS_PER_NODE \
36+
--nnodes=$num_nodes --node_rank=${SKYPILOT_NODE_RANK} --master_addr=$master_addr \
3737
--master_port=8008 resnet_ddp.py --num_epochs 20
3838
3939
In the above, :code:`num_nodes: 2` specifies that this task is to be run on 2
4040
nodes. The :code:`setup` and :code:`run` commands are executed on both nodes.
4141

4242
SkyPilot exposes these environment variables that can be accessed in a task's ``run`` commands:
4343

44-
- :code:`SKY_NODE_RANK`: rank (an integer ID from 0 to :code:`num_nodes-1`) of
44+
- :code:`SKYPILOT_NODE_RANK`: rank (an integer ID from 0 to :code:`num_nodes-1`) of
4545
the node executing the task.
46-
- :code:`SKY_NODE_IPS`: a string of IP addresses of the nodes reserved to execute
46+
- :code:`SKYPILOT_NODE_IPS`: a string of IP addresses of the nodes reserved to execute
4747
the task, where each line contains one IP address.
4848

49-
- You can retrieve the number of nodes by :code:`echo "$SKY_NODE_IPS" | wc -l`
50-
and the IP address of the third node by :code:`echo "$SKY_NODE_IPS" | sed -n
49+
- You can retrieve the number of nodes by :code:`echo "$SKYPILOT_NODE_IPS" | wc -l`
50+
and the IP address of the third node by :code:`echo "$SKYPILOT_NODE_IPS" | sed -n
5151
3p`.
5252

5353
- To manipulate these IP addresses, you can also store them to a file in the
54-
:code:`run` command with :code:`echo $SKY_NODE_IPS >> ~/sky_node_ips`.
55-
- :code:`SKY_NUM_GPUS_PER_NODE`: number of GPUs reserved on each node to execute the
54+
:code:`run` command with :code:`echo $SKYPILOT_NODE_IPS >> ~/sky_node_ips`.
55+
- :code:`SKYPILOT_NUM_GPUS_PER_NODE`: number of GPUs reserved on each node to execute the
5656
task; the same as the count in ``accelerators: <name>:<count>`` (rounded up if a fraction).

examples/env_check.yaml

Lines changed: 23 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,27 @@ run: |
2424
exit 1
2525
fi
2626
27-
echo NODE ID: $SKY_NODE_RANK
28-
echo NODE IPS: "$SKY_NODE_IPS"
29-
worker_addr=`echo "$SKY_NODE_IPS" | sed -n 2p`
27+
if [[ -z "$SKYPILOT_NODE_RANK" ]]; then
28+
echo "SKYPILOT_NODE_RANK is not set"
29+
exit 1
30+
else
31+
echo "SKYPILOT_NODE_RANK is set to ${SKYPILOT_NODE_RANK}"
32+
fi
33+
34+
if [[ -z "$SKYPILOT_NODE_IPS" ]]; then
35+
echo "SKYPILOT_NODE_IPS is not set"
36+
exit 1
37+
else
38+
echo "SKYPILOT_NODE_IPS is set to ${SKYPILOT_NODE_IPS}"
39+
echo "${SKYPILOT_NODE_IPS}"
40+
echo "${SKYPILOT_NODE_IPS}" | wc -l | grep 2 || exit 1
41+
fi
42+
worker_addr=`echo "$SKYPILOT_NODE_IPS" | sed -n 2p`
3043
echo Worker IP: $worker_addr
44+
45+
if [[ -z "$SKYPILOT_JOB_ID" ]]; then
46+
echo "SKYPILOT_JOB_ID is not set"
47+
exit 1
48+
else
49+
echo "SKYPILOT_JOB_ID is set to ${SKYPILOT_JOB_ID}"
50+
fi

examples/ray_tune_app.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,6 @@ setup: |
1111
pip3 install ray[tune] pytorch-lightning==1.4.9 lightning-bolts torchvision
1212
1313
run: |
14-
if [ "${SKY_NODE_RANK}" == "0" ]; then
14+
if [ "${SKYPILOT_NODE_RANK}" == "0" ]; then
1515
python3 tune_ptl_example.py
1616
fi

examples/resnet_distributed_torch.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -19,8 +19,8 @@ setup: |
1919
run: |
2020
cd pytorch-distributed-resnet
2121
22-
num_nodes=`echo "$SKY_NODE_IPS" | wc -l`
23-
master_addr=`echo "$SKY_NODE_IPS" | head -n1`
22+
num_nodes=`echo "$SKYPILOT_NODE_IPS" | wc -l`
23+
master_addr=`echo "$SKYPILOT_NODE_IPS" | head -n1`
2424
python3 -m torch.distributed.launch --nproc_per_node=1 \
25-
--nnodes=$num_nodes --node_rank=${SKY_NODE_RANK} --master_addr=$master_addr \
25+
--nnodes=$num_nodes --node_rank=${SKYPILOT_NODE_RANK} --master_addr=$master_addr \
2626
--master_port=8008 resnet_ddp.py --num_epochs 20

examples/resnet_distributed_torch_scripts/run.sh

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -4,9 +4,9 @@ conda activate resnet
44
conda env list
55

66
cd pytorch-distributed-resnet
7-
num_nodes=`echo "$SKY_NODE_IPS" | wc -l`
8-
master_addr=`echo "$SKY_NODE_IPS" | head -n1`
7+
num_nodes=`echo "$SKYPILOT_NODE_IPS" | wc -l`
8+
master_addr=`echo "$SKYPILOT_NODE_IPS" | head -n1`
99
echo MASTER_ADDR $master_addr
1010
python -m torch.distributed.launch --nproc_per_node=1 \
11-
--nnodes=$num_nodes --node_rank=${SKY_NODE_RANK} --master_addr=$master_addr \
11+
--nnodes=$num_nodes --node_rank=${SKYPILOT_NODE_RANK} --master_addr=$master_addr \
1212
--master_port=8008 resnet_ddp.py --num_epochs 20

examples/spot/bert_qa.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ run: |
3636
--max_seq_length 384 \
3737
--doc_stride 128 \
3838
--report_to wandb \
39-
--run_name $SKYPILOT_RUN_ID \
39+
--run_name $SKYPILOT_JOB_ID \
4040
--output_dir /checkpoint/bert_qa/ \
4141
--save_total_limit 10 \
4242
--save_steps 1000

examples/spot/resnet.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -46,9 +46,9 @@ run: |
4646
# modify your run id for each different run!
4747
run_id="resnet-run-1"
4848
49-
num_nodes=`echo "$SKY_NODE_IPS" | wc -l`
50-
master_addr=`echo "$SKY_NODE_IPS" | head -n1`
49+
num_nodes=`echo "$SKYPILOT_NODE_IPS" | wc -l`
50+
master_addr=`echo "$SKYPILOT_NODE_IPS" | head -n1`
5151
python3 -m torch.distributed.launch --nproc_per_node=1 \
52-
--nnodes=$num_nodes --node_rank=${SKY_NODE_RANK} --master_addr=$master_addr \
52+
--nnodes=$num_nodes --node_rank=${SKYPILOT_NODE_RANK} --master_addr=$master_addr \
5353
--master_port=8008 resnet_ddp.py --num_epochs 100000 --model_dir /checkpoint/torch_ddp_resnet/ \
5454
--resume --model_filename resnet_distributed-with-epochs.pth --run_id $run_id --wandb_dir /checkpoint/

examples/storage/checkpointed_training.yaml

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -43,9 +43,9 @@ run: |
4343
cd pytorch-distributed-resnet
4444
git pull
4545
46-
num_nodes=`echo "$SKY_NODE_IPS" | wc -l`
47-
master_addr=`echo "$SKY_NODE_IPS" | head -n1`
46+
num_nodes=`echo "$SKYPILOT_NODE_IPS" | wc -l`
47+
master_addr=`echo "$SKYPILOT_NODE_IPS" | head -n1`
4848
python3 -m torch.distributed.launch --nproc_per_node=1 \
49-
--nnodes=$num_nodes --node_rank=${SKY_NODE_RANK} --master_addr=$master_addr \
49+
--nnodes=$num_nodes --node_rank=${SKYPILOT_NODE_RANK} --master_addr=$master_addr \
5050
--master_port=8008 resnet_ddp.py --num_epochs 100 --model_dir /checkpoints/torch_ddp_resnet/ \
5151
--resume --model_filename resnet_distributed-with-epochs.pth

0 commit comments

Comments
 (0)