You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
SkyPilot supports running jobs on Google's `Cloud TPU <https://cloud.google.com/tpu>`_, a specialized hardware accelerator for ML workloads.
11
8
12
-
Both are supported by SkyPilot.
13
9
14
-
The two architectures differ as follows.
15
-
For TPU Nodes, a host VM communicates with the TPU host over gRPC.
16
-
For TPU VMs, you can SSH directly into a VM that is physically connected to the TPU device.
17
-
For more details please refer to GCP `documentation <https://cloud.google.com/tpu/docs/system-architecture-tpu-vm#tpu-arch>`_.
18
-
19
-
20
-
.. note::
10
+
Free TPUs via TPU Research Cloud (TRC)
11
+
======================================
21
12
22
-
We encourage researchers to apply for free TPU access through `TPU Research Cloud (TRC) <https://sites.research.google/trc/about/>`_ program.
13
+
ML researchers and students are encouraged to apply for free TPU access through `TPU Research Cloud (TRC) <https://sites.research.google/trc/about/>`_ program!
23
14
24
15
25
16
Getting TPUs in one command
26
-
--------------------------------
17
+
===========================
27
18
28
19
Like :ref:`GPUs <interactive-nodes>`, SkyPilot provides a simple command to quickly get TPUs for development:
29
20
@@ -35,48 +26,132 @@ Like :ref:`GPUs <interactive-nodes>`, SkyPilot provides a simple command to quic
35
26
sky tpunode --instance-type n1-highmem-16 # Change the host VM type to n1-highmem-16
36
27
sky tpunode --tpu-vm # Use TPU VM (instead of TPU Node)
37
28
38
-
After the command has finished, you will be dropped into the host VM and can start develop code right away!
29
+
After the command finishes, you will be dropped into a TPU host VM and can start developing code right away.
30
+
31
+
Below, we show examples of using SkyPilot to run MNIST training on (1) TPU VMs and (2) TPU Nodes.
32
+
33
+
TPU Architectures
34
+
=================
35
+
36
+
Two different TPU architectures are available on GCP:
Both are supported by SkyPilot. We recommend TPU VMs which is a newer architecture encouraged by GCP.
42
+
43
+
The two architectures differ as follows.
44
+
For TPU VMs, you can directly SSH into the "TPU host" VM that is physically connected to the TPU device.
45
+
For TPU Nodes, a user VM (an `n1` instance) must be separately provisioned to communicate with an inaccessible TPU host over gRPC.
46
+
More details can be found on GCP `documentation <https://cloud.google.com/tpu/docs/system-architecture-tpu-vm#tpu-arch>`_.
47
+
48
+
TPU VMs
49
+
-------
50
+
51
+
To use TPU VMs, set the following in a task YAML's ``resources`` field:
52
+
53
+
.. code-block:: yaml
54
+
55
+
resources:
56
+
accelerators: tpu-v2-8
57
+
accelerator_args:
58
+
tpu_vm: True
59
+
runtime_version: tpu-vm-base # optional
60
+
61
+
The ``accelerators`` field specifies the TPU type, and the :code:`accelerator_args` dict includes the :code:`tpu_vm` bool (defaults to false, which means TPU Node is used), and an optional TPU ``runtime_version`` field.
62
+
To show what TPU types are supported, run :code:`sky show-gpus`.
63
+
64
+
Here is a complete task YAML that runs `MNIST training <https://cloud.google.com/tpu/docs/run-calculation-jax#running_jax_code_on_a_tpu_vm>`_ on a TPU VM using JAX.
65
+
66
+
.. code-block:: yaml
67
+
68
+
name: mnist-tpu-vm
69
+
70
+
resources:
71
+
accelerators: tpu-v2-8
72
+
accelerator_args:
73
+
tpu_vm: True
74
+
runtime_version: tpu-vm-base
75
+
76
+
setup: |
77
+
git clone https://github.com/google/flax.git
78
+
79
+
conda activate flax
80
+
if [ $? -eq 0 ]; then
81
+
echo 'conda env exists'
82
+
else
83
+
conda create -n flax python=3.8 -y
84
+
conda activate flax
85
+
# Make sure to install TPU related packages in a conda env to avoid package conflicts.
This YAML lives under the `SkyPilot repo <https://github.com/skypilot-org/skypilot/tree/master/examples/tpu>`_ (``examples/tpu/tpuvm_mnist.yaml``), or you can paste it into a local file.
The above YAML considers :code:`n1-highmem-8` as the host machine and :code:`tpu-v2-8` as the TPU node resource.
57
-
You may modify the host instance type or TPU type as you wish.
58
-
To show more TPU accelerators, you may run the command :code:`sky show-gpus`.
133
+
You can modify the host instance type or the TPU type.
134
+
135
+
Here is a complete task YAML that runs `MNIST training <https://cloud.google.com/tpu/docs/run-calculation-jax#running_jax_code_on_a_tpu_vm>`_ on a TPU Node using TensorFlow.
59
136
60
-
Now, we show a complete YAML for running `MNIST training <https://cloud.google.com/tpu/docs/tutorials/mnist-2.x>`_ on TPU node with TensorFlow.
61
137
62
138
.. code-block:: yaml
63
139
64
-
# Task name (optional), used for display purposes.
65
140
name: mnist-tpu-node
66
141
67
142
resources:
68
143
accelerators: tpu-v2-8
69
144
accelerator_args:
70
-
runtime_version: 2.5.0 #TPU software version to be used.
# The command to run. Will be run under the working directory.
93
167
run: |
94
168
conda activate mnist
95
169
cd models/official/legacy/image_classification/
@@ -110,17 +184,20 @@ Now, we show a complete YAML for running `MNIST training <https://cloud.google.c
110
184
111
185
.. note::
112
186
113
-
TPU node requires loading data from a GCS bucket, so we add a :code:`file_mounts` to create a new bucket.
114
-
Check :ref:`SkyPilot Storage <sky-storage>` for more details.
187
+
TPU node requires loading data from a GCS bucket. The :code:`file_mounts` spec above simplifies this by using :ref:`SkyPilot Storage <sky-storage>` to create a new bucket/mount an existing bucket.
188
+
If you encounter a bucket :code:`Permission denied` error,
189
+
make sure the bucket is created in the same region as the Host VM/TPU Nodes and IAM permission for Cloud TPU is
The environment variable :code:`$TPU_NAME` is automatically set by SkyPilot for connecting TPU devices.
193
+
The special environment variable :code:`$TPU_NAME` is automatically set by SkyPilot at run time, so it can be used in the ``run`` commands.
118
194
119
-
With the above YAML, you should be able to launch the training job with :code:`sky launch`!
195
+
196
+
This YAML lives under the `SkyPilot repo <https://github.com/skypilot-org/skypilot/tree/master/examples/tpu>`_ (``examples/tpu/tpu_node_mnist.yaml``). Launch it with:
@@ -131,63 +208,101 @@ With the above YAML, you should be able to launch the training job with :code:`s
131
208
132
209
133
210
134
-
TPU VMs
135
-
--------------------------------
136
211
137
-
To use TPU VMs, user only needs to add :code:`tpu_vm: True` and the desired TPU runtime version in :code:`accelerator_args` shown below:
212
+
213
+
214
+
Using TPU Pods
215
+
==============
216
+
217
+
A `TPU Pod <https://cloud.google.com/tpu/docs/training-on-tpu-pods>`_ is a collection of TPU devices connected by dedicated high-speed network interfaces for high-performance training.
218
+
219
+
To use a TPU Pod, simply change the ``accelerators`` field in the task YAML (e.g., :code:`v2-8` -> :code:`v2-32`).
138
220
139
221
.. code-block:: yaml
222
+
:emphasize-lines: 2-2
140
223
141
224
resources:
142
-
accelerators: tpu-v2-8
225
+
accelerators: tpu-v2-32 # Pods have > 8 cores (the last number)
143
226
accelerator_args:
144
227
runtime_version: tpu-vm-base
145
228
tpu_vm: True
146
229
230
+
.. note::
231
+
232
+
Both TPU architectures, TPU VMs and TPU Nodes, can be used with TPU Pods. The example below is based on TPU VMs.
233
+
234
+
To show all available TPU Pod types, run :code:`sky show-gpus` (more than 8 cores means Pods):
147
235
148
-
Note that :code:`instance_type` is no longer needed because TPU VMs is a standalone host VM that physically connects to the TPU device.
236
+
.. code-block:: console
237
+
238
+
GOOGLE_TPU AVAILABLE_QUANTITIES
239
+
tpu-v2-8 1
240
+
tpu-v2-32 1
241
+
tpu-v2-128 1
242
+
tpu-v2-256 1
243
+
tpu-v2-512 1
244
+
tpu-v3-8 1
245
+
tpu-v3-32 1
246
+
tpu-v3-64 1
247
+
tpu-v3-128 1
248
+
tpu-v3-256 1
249
+
tpu-v3-512 1
250
+
tpu-v3-1024 1
251
+
tpu-v3-2048 1
252
+
253
+
After creating a TPU Pod, multiple host VMs (e.g., :code:`v2-32` comes with 4 host VMs) are launched.
254
+
Normally, the user needs to SSH into all hosts (depending on the architecture used, either the ``n1`` User VMs or the TPU Host VMs) to prepare files and setup environments, and
255
+
then launch the job on each host, which is a tedious and error-prone process.
256
+
257
+
SkyPilot automates away this complexity. From your laptop, a single :code:`sky launch` command will perform:
149
258
150
-
Now we show an example of running `mnist training <https://cloud.google.com/tpu/docs/run-calculation-jax#running_jax_code_on_a_tpu_vm>`_ on TPU VM with JAX.
259
+
- workdir/file_mounts syncing; and
260
+
- execute the setup/run commands on every host of the pod.
261
+
262
+
Here is a task YAML for a cifar10 training job on a :code:`v2-32` TPU Pod with JAX (`code repo <https://github.com/infwinston/tpu-example>`_):
151
263
152
264
.. code-block:: yaml
153
265
154
-
name: mnist-tpu-vm
266
+
name: cifar-tpu-pod
155
267
156
268
resources:
157
-
accelerators: tpu-v2-8
269
+
accelerators: tpu-v2-32
158
270
accelerator_args:
159
271
runtime_version: tpu-vm-base
160
272
tpu_vm: True
161
273
162
274
setup: |
163
-
git clone https://github.com/google/flax.git
164
-
165
-
conda activate flax
166
-
if [ $? -eq 0 ]; then
167
-
echo 'conda env exists'
168
-
else
169
-
conda create -n flax python=3.8 -y
170
-
conda activate flax
171
-
# Make sure to install TPU related packages in a conda env to avoid package conflicts.
0 commit comments