You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
* Add run id for normal job
* add example for the run id
* fix env_check
* fix env_check
* fix
* address comments
* Rename to SKYPILOT_JOB_ID
* rename the controller's job id to avoid confusion
* rename env variables
* fix
Copy file name to clipboardexpand all lines: docs/source/examples/spot-jobs.rst
+3-3
Original file line number
Diff line number
Diff line change
@@ -150,7 +150,7 @@ Below we show an `example <https://github.com/skypilot-org/skypilot/blob/master/
150
150
--max_seq_length 384 \
151
151
--doc_stride 128 \
152
152
--report_to wandb \
153
-
--run_name $SKYPILOT_RUN_ID \
153
+
--run_name $SKYPILOT_JOB_ID \
154
154
--output_dir /checkpoint/bert_qa/ \
155
155
--save_total_limit 10 \
156
156
--save_steps 1000
@@ -162,11 +162,11 @@ the output directory and frequency of checkpointing (see more
162
162
on `Huggingface API <https://huggingface.co/docs/transformers/main_classes/trainer#transformers.TrainingArguments.save_steps>`_).
163
163
You may also refer to another example `here <https://github.com/skypilot-org/skypilot/tree/master/examples/spot/resnet_ddp>`_ for periodically checkpointing with PyTorch.
164
164
165
-
We also set :code:`--run_name` to :code:`$SKYPILOT_RUN_ID` so that the loggings will be saved
165
+
We also set :code:`--run_name` to :code:`$SKYPILOT_JOB_ID` so that the loggings will be saved
166
166
to the same run in Weights & Biases.
167
167
168
168
.. note::
169
-
The environment variable :code:`$SKYPILOT_RUN_ID` can be used to identify the same job, i.e., it is kept identical across all
169
+
The environment variable :code:`$SKYPILOT_JOB_ID` (example: "sky-2022-10-06-05-17-09-750781_spot_id-22") can be used to identify the same job, i.e., it is kept identical across all
170
170
recoveries of the job.
171
171
It can be accessed in the task's :code:`run` commands or directly in the program itself (e.g., access
172
172
via :code:`os.environ` and pass to Weights & Biases for tracking purposes in your training script). It is made available to
Copy file name to clipboardexpand all lines: sky/templates/gcp-ray.yml.j2
+4-4
Original file line number
Diff line number
Diff line change
@@ -171,15 +171,15 @@ head_start_ray_commands:
171
171
# Line "which prlimit ..": increase the limit of the number of open files for the raylet process, as the `ulimit` may not take effect at this point, because it requires
172
172
# all the sessions to be reloaded. This is a workaround.
0 commit comments