Skip to content

Commit d5db2ac

Browse files
authored
Add SkyPilot Benchmark doc (skypilot-org#1066)
* Add benchmark doc * Address comments * Address comments
1 parent 6defe8c commit d5db2ac

File tree

9 files changed

+308
-8
lines changed

9 files changed

+308
-8
lines changed

docs/source/index.rst

+1
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,7 @@ Key features:
4949
reference/job-queue
5050
reference/auto-stop
5151
examples/spot-jobs
52+
reference/benchmark/index
5253

5354
.. toctree::
5455
:maxdepth: 1
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,134 @@
1+
.. _benchmark-skycallback:
2+
3+
SkyCallback
4+
===========
5+
6+
SkyCallback is a simple Python library that works in conjunction with SkyPilot Benchmark.
7+
It enables SkyPilot to provide a more detailed benchmark report without the need to wait until the task finishes.
8+
9+
What SkyCallback is for
10+
--------------------------------------------
11+
12+
SkyCallback is designed for **machine learning tasks** which have a loop iterating many `steps`.
13+
SkyCallback measures the average time taken by each step, and extrapolates it to the total execution time of the task.
14+
15+
Installing SkyCallback
16+
--------------------------------------------
17+
18+
Unlike SkyPilot, SkyCallback must be installed and imported `in your program`.
19+
To install it, add the following line in the ``setup`` section of your task YAML.
20+
21+
.. code-block:: yaml
22+
23+
setup:
24+
# Activate conda or virtualenv if you use one
25+
# Then, install SkyCallback
26+
pip install "git+https://github.com/skypilot-org/skypilot.git#egg=sky-callback&subdirectory=sky/callbacks/"
27+
28+
29+
Using SkyCallback generic APIs
30+
--------------------------------------------
31+
32+
The SkyCallback generic APIs are for **PyTorch, TensorFlow, and JAX** programs where training loops are exposed to the users.
33+
Below we provide the instructions for using the APIs.
34+
35+
First, import the SkyCallback package and initialize it using ``init``.
36+
37+
.. code-block:: python
38+
39+
import sky_callback
40+
sky_callback.init()
41+
42+
Next, mark the beginning and end of each step using one of the three equivalent methods.
43+
44+
.. code-block:: python
45+
46+
# Method 1: wrap your iterable (e.g., dataloader) with `step_iterator`.
47+
from sky_callback import step_iterator
48+
for batch in step_iterator(train_dataloader):
49+
...
50+
51+
# Method 2: wrap your loop body with the `step` context manager.
52+
for batch in train_dataloader:
53+
with sky_callback.step():
54+
...
55+
56+
# Method 3: call `step_begin` and `step_end` directly.
57+
for batch in train_dataloader:
58+
sky_callback.step_begin()
59+
...
60+
sky_callback.step_end()
61+
62+
That's it.
63+
Now you can launch your task and get a detailed benchmark report using SkyPilot Benchmark CLI.
64+
`Here <https://github.com/skypilot-org/skypilot/blob/master/examples/benchmark/timm/callback.patch>`_ we provide an example of applying SkyCallback to Pytorch ImageNet training.
65+
66+
.. note::
67+
68+
Optionally in ``sky_callback.init``, you can specify the total number of steps that the task will iterate through.
69+
This information is needed to estimate the total execution time/cost of your task.
70+
71+
.. code-block:: python
72+
73+
sky_callback.init(
74+
total_steps=num_epochs * len(train_dataloader), # Optional
75+
)
76+
77+
.. note::
78+
In distributed training, ``global_rank`` should be additionally passed to ``sky_callback.init`` as follows:
79+
80+
.. code-block:: python
81+
82+
# PyTorch DDP users
83+
global_rank = torch.distributed.get_rank()
84+
85+
# Horovod users
86+
global_rank = hvd.rank()
87+
88+
sky_callback.init(
89+
global_rank=global_rank,
90+
total_steps=num_epochs * len(train_dataloader), # Optional
91+
)
92+
93+
Integrations with ML frameworks
94+
----------------------------------------------------------
95+
96+
Using SkyCallback is even easier for **Keras, PytorchLightning, and HuggingFace Transformers** programs where trainer APIs are used.
97+
SkyCallback natively supports these frameworks with simple interface.
98+
99+
* Keras example
100+
101+
.. code-block:: python
102+
103+
from sky_callback import SkyKerasCallback
104+
105+
# Add the callback to your Keras model.
106+
model.fit(..., callbacks=[SkyKerasCallback()])
107+
108+
`Here <https://github.com/skypilot-org/skypilot/blob/master/examples/benchmark/keras_asr/callback.patch>`_ you can find an example of applying SkyCallback to Keras ASR model training.
109+
110+
* PytorchLightning example
111+
112+
.. code-block:: python
113+
114+
from sky_callback import SkyLightningCallback
115+
116+
# Add the callback to your trainer.
117+
trainer = pl.Trainer(..., callbacks=[SkyLightningCallback()])
118+
119+
`Here <https://github.com/skypilot-org/skypilot/blob/master/examples/benchmark/lightning_gan/callback.patch>`_ you can find an example of applying SkyCallback to PyTorchLightning GAN model training.
120+
121+
* HuggingFace Transformers example
122+
123+
.. code-block:: python
124+
125+
from sky_callback import SkyTransformersCallback
126+
127+
# Add the callback to your trainer.
128+
trainer = transformers.Trainer(..., callbacks=[SkyTransformersCallback()])
129+
130+
`Here <https://github.com/skypilot-org/skypilot/blob/master/examples/benchmark/transformers_qa/callback.patch>`_ you can find an example of applying SkyCallback to HuggingFace BERT fine-tuning.
131+
132+
.. note::
133+
When using the framework-integrated callbacks, do not call ``sky_callback.init`` for initialization.
134+
The callbacks will do it for you.
+78
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,78 @@
1+
.. _benchmark-cli:
2+
3+
CLI
4+
=============
5+
6+
Workflow
7+
--------------------------------
8+
9+
You can use SkyPilot Benchmark by simply replacing your ``sky launch`` command with ``sky bench launch``:
10+
11+
.. code-block:: bash
12+
13+
# Launch mytask on a V100 VM and a T4 VM
14+
$ sky bench launch mytask.yaml --gpus V100,T4 --benchmark mybench
15+
16+
The second command will launch ``mytask.yaml`` on a V100 VM and a T4 VM simultaneously, with a benchmark name ``mybench``.
17+
After the task finishes, you can check the benchmark results using ``sky bench show``:
18+
19+
.. code-block:: bash
20+
21+
# Show the benchmark report on `mybench`
22+
$ sky bench show mybench
23+
24+
CLUSTER RESOURCES STATUS DURATION SPENT($) STEPS SEC/STEP $/STEP EST(hr) EST($)
25+
sky-bench-mybench-0 1x GCP(n1-highmem-8, {'V100': 1}) FINISHED 12m 51s 0.6317 - - - - -
26+
sky-bench-mybench-1 1x AWS(g4dn.xlarge, {'T4': 1}) FINISHED 16m 19s 0.1430 - - - - -
27+
28+
In the report, SkyPilot shows the duration and cost of ``mybench`` on each VM.
29+
The VMs can be terminated by either ``sky bench down`` or ``sky down``:
30+
31+
.. code-block:: bash
32+
33+
# Terminate all the clusters used for `mybench`
34+
$ sky bench down mybench
35+
36+
# Terminate all the clusters used for `mybench` except `sky-bench-mybench-0`
37+
$ sky bench down mybench --exclude sky-bench-mybench-0
38+
39+
# Terminate individual clusters as usual
40+
$ sky down sky-bench-mybench-0
41+
42+
.. note::
43+
44+
Each cluster launched by ``sky bench launch`` will automatically **stop** itself 5 minutes after the task is finished.
45+
However, you don't have to restart those clusters.
46+
Regardless of the status of the clusters, ``sky bench show`` will provide the benchmark results.
47+
48+
.. note::
49+
50+
SkyPilot Benchmark does not consider the time/cost of provisioning and setup.
51+
The columns (such as ``DURATION`` and ``SPENT($)``) in the report indicate the time/cost spent in executing the ``run`` section of your task YAML.
52+
53+
.. note::
54+
55+
Here, the columns other than ``DURATION`` and ``SPENT($)`` are empty.
56+
To get a complete benchmark report, please refer to :ref:`SkyCallback`.
57+
58+
59+
Managing benchmark reports
60+
---------------------------
61+
62+
``sky bench ls`` shows the list of the benchmark reports you have:
63+
64+
.. code-block:: bash
65+
66+
# List all the benchmark reports
67+
$ sky bench ls
68+
69+
BENCHMARK TASK LAUNCHED CANDIDATE 1 CANDIDATE 2 CANDIDATE 3 CANDIDATE 4
70+
bert bert_qa 2022-08-10 10:07:27 1x Standard_NC6_Promo (K80:1) 1x g4dn.xlarge (T4:1) 1x g5.xlarge (A10G:1) 1x n1-highmem-8 (V100:1)
71+
mybench mytask 2022-08-10 11:24:27 1x n1-highmem-8 (V100:1) 1x g4dn.xlarge (T4:1)
72+
73+
To delete a benchmark report, use ``sky bench delete``:
74+
75+
.. code-block:: bash
76+
77+
# Delete the benchmark report on `mybench`
78+
$ sky bench delete mybench
+48
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,48 @@
1+
.. _benchmark-yaml:
2+
3+
YAML Configuration
4+
===================
5+
6+
The resources to benchmark can be configured in the SkyPilot YAML interface.
7+
Below we provide an example:
8+
9+
.. code-block:: yaml
10+
11+
# Only shows `resources` as other fields do not change.
12+
resources:
13+
cloud: gcp # Works as a default value for `cloud`.
14+
15+
# Added only for SkyPilot Benchmark.
16+
candidates:
17+
- {accelerators: A100}
18+
- {accelerators: V100, instance_type: n1-highmem-16}
19+
- {accelerators: T4, cloud: aws} # Overrides `cloud` to `aws`.
20+
21+
For SkyPilot Benchmark, ``candidates`` is newly added under the ``resources`` field.
22+
``candidates`` is the list of dictionaries that configure the resources to benchmark.
23+
Any subfield of ``resources`` (``accelerators``, ``instance_type``, etc.) can be re-defined in the dictionaries.
24+
Subfields defined outside ``candidates`` (e.g. ``cloud`` in this example) are used as default values and are overriden by those defined in the dictionaries.
25+
Thus, the above example can be interpreted as follows:
26+
27+
.. code-block:: yaml
28+
29+
# Configuration of the first candidate.
30+
resources:
31+
cloud: gcp
32+
accelerators: A100
33+
34+
# Configuration of the second candidate.
35+
resources:
36+
cloud: gcp
37+
accelerators: V100
38+
instance_type: n1-highmem-16
39+
40+
# Configuration of the third candidate.
41+
resources:
42+
cloud: aws
43+
accelerators: T4
44+
45+
.. note::
46+
47+
Currently, SkyPilot Benchmark does not support on-prem jobs and managed spot jobs.
48+
While you can set ``use_spot: True`` to benchmark spot VMs, automatic recovery will not be provided when preemption occurs.
+43
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,43 @@
1+
.. _benchmark-overview:
2+
3+
Benchmark
4+
================================================
5+
6+
SkyPilot allows **easy measurement of performance and cost of different kinds of cloud resources** through the benchmark feature.
7+
With minimal effort, you can find the right cloud resource for your task that fits your performance goals and budget constraints.
8+
9+
For example, say you want to fine-tune a BERT model and you do not know which GPU type is the best for you.
10+
With SkyPilot Benchmark, you can quickly run your task on different types of VMs and get a benchmark report like the following:
11+
12+
.. code-block:: bash
13+
14+
Legend:
15+
- STEPS: Number of steps taken.
16+
- SEC/STEP, $/STEP: Average time (cost) per step.
17+
- EST(hr), EST($): Estimated total time (cost) to complete the benchmark.
18+
19+
CLUSTER RESOURCES STATUS DURATION SPENT($) STEPS SEC/STEP $/STEP EST(hr) EST($)
20+
sky-bench-bert-0 1x Azure(Standard_NC6_Promo, {'K80': 1}) TERMINATED 12m 48s 0.0384 1415 1.1548 0.000058 10.60 1.91
21+
sky-bench-bert-1 1x AWS(g4dn.xlarge, {'T4': 1}) TERMINATED 14m 2s 0.1230 2387 0.6429 0.000094 5.92 3.11
22+
sky-bench-bert-2 1x AWS(g5.xlarge, {'A10G': 1}) TERMINATED 13m 57s 0.2339 7423 0.1859 0.000052 1.75 1.76
23+
sky-bench-bert-3 1x GCP(n1-highmem-8, {'V100': 1}) TERMINATED 13m 45s 0.6768 7306 0.2005 0.000165 1.87 5.51
24+
25+
The report shows the benchmarking results of 4 VMs each with a different GPU type.
26+
Based on the report, you can pick the VM with either the lowest cost (``EST($)``) or the fastest execution time (``EST(hr)``), or find a sweet spot between them.
27+
In this example, AWS g5.xlarge (NVIDIA A10G GPU) turns out to be the best choice in terms of both cost and time.
28+
29+
Using SkyPilot Benchmark
30+
------------------------
31+
32+
A part of the SkyPilot Benchmark report relies on the :ref:`SkyCallback` library instrumented in the training code to report step completion.
33+
Depending on the level of detail required by you in the benchmark report, SkyPilot Benchmark can be used in two modes:
34+
35+
1. Without SkyCallback - You can get a basic benchmark report using SkyPilot Benchmark :ref:`benchmark-cli`. **This requires zero changes in your code**.
36+
2. With SkyCallback - You can get a more detailed benchmark report **by a few lines of code changes**. Please refer to :ref:`SkyCallback`.
37+
38+
Table of Contents
39+
-------------------
40+
.. toctree::
41+
cli
42+
config
43+
callback

examples/benchmark/keras_asr.yaml

+1-2
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,7 @@ setup: |
1212
conda activate keras
1313
1414
# Install SkyCallback
15-
git clone [email protected]:skypilot-org/skypilot.git
16-
pip install skypilot/sky/callbacks/
15+
pip install "git+https://github.com/skypilot-org/skypilot.git#egg=sky-callback&subdirectory=sky/callbacks/"
1716
1817
# User setup
1918
pip install numpy pandas tensorflow

examples/benchmark/lightning_gan.yaml

+1-2
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,7 @@ setup: |
1212
conda activate pl
1313
1414
# Install SkyCallback
15-
git clone [email protected]:skypilot-org/skypilot.git
16-
pip install skypilot/sky/callbacks/
15+
pip install "git+https://github.com/skypilot-org/skypilot.git#egg=sky-callback&subdirectory=sky/callbacks/"
1716
1817
# User setup
1918
pip install "torchvision" "pytorch-lightning>=1.4" "torch>=1.6, <1.9"

examples/benchmark/timm.yaml

+1-2
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,7 @@ setup: |
1212
conda activate timm
1313
1414
# Install SkyCallback
15-
git clone [email protected]:skypilot-org/skypilot.git
16-
pip install skypilot/sky/callbacks/
15+
pip install "git+https://github.com/skypilot-org/skypilot.git#egg=sky-callback&subdirectory=sky/callbacks/"
1716
1817
# User setup
1918
git clone https://github.com/rwightman/pytorch-image-models.git timm

examples/benchmark/transformers_qa.yaml

+1-2
Original file line numberDiff line numberDiff line change
@@ -12,8 +12,7 @@ setup: |
1212
conda activate hf
1313
1414
# Install SkyCallback
15-
git clone [email protected]:skypilot-org/skypilot.git
16-
pip install skypilot/sky/callbacks/
15+
pip install "git+https://github.com/skypilot-org/skypilot.git#egg=sky-callback&subdirectory=sky/callbacks/"
1716
1817
# User setup
1918
pip install transformers

0 commit comments

Comments
 (0)