Skip to content

Commit c21253b

Browse files
Storage mounting (skypilot-org#658)
* squash * fix * yapf workaround
1 parent 10b4b79 commit c21253b

24 files changed

+1704
-732
lines changed

docs/source/reference/storage.rst

+181-85
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
.. _sky-storage:
2+
23
Sky Storage
3-
=======
4+
===========
45

56
A Sky Storage object represents an abstract data store containing large data
67
files required by the task. Compared to file_mounts, storage is faster and
@@ -9,13 +10,20 @@ Behind the scenes, storage automatically uploads all data in the source
910
to a backing object store in a particular cloud (S3/GCS/Azure Blob).
1011

1112
A storage object is used by "mounting" it to a task. On mounting, the data
12-
specified in the source becomes available at the destination mount_path.
13-
Please note that sky.Storage does not guarantee preservation of file
14-
permissions - you may need to set file permissions during task execution.
13+
specified in the source becomes available at the destination mount path.
14+
15+
A storage object can used in either :code:`MOUNT` mode or :code:`COPY` mode.
16+
17+
* In :code:`MOUNT` mode, the backing store is directly "mounted" to the remote VM.
18+
I.e., files are fetched when accessed by the task and files written to the
19+
mount path are also written to the remote store.
20+
21+
* In :code:`COPY` mode, the files are pre-fetched and cached on the local disk.
22+
Writes are not replicated on the remote store.
1523

1624
.. note::
17-
Sky file mounting currently does not support syncing writes.
18-
Any writes made at a mounted folder will not reflect at the mounting source.
25+
sky.Storage does not guarantee preservation of file
26+
permissions - you may need to set file permissions during task execution.
1927

2028
Using Sky Storage
2129
-----------------
@@ -27,141 +35,229 @@ the files to a cloud store (e.g. S3, GCS) and have them persist there by
2735
specifying the :code:`name`, :code:`source` and :code:`persistent` fields. By
2836
enabling persistence, file_mount sync can be made significantly faster.
2937

30-
.. note::
31-
Symbolic links are handled differently in :code:`file_mounts` depending on whether Sky Storage is used. For mounts backed by Sky Storage, referenced data for all symbolic links is copied to remote. For mounts not using Sky Storage (e.g., those using rsync) the symbolic links are directly copied. Their targets must be separately mounted or else the symlinks may break.
38+
Your usage of sky storage can fall under four broad use cases:
39+
40+
1. **You want to upload your local data to remote VM -** specify the name and
41+
source fields. Name sets the bucket name that will be used, and source
42+
specifies the local path to be uploaded.
43+
44+
2. **You want to mount an existing S3/GCS bucket to your remote VM -** specify
45+
just the source field (e.g., s3://my-bucket/)
46+
47+
3. **You want to have a write-able path to directly write files to S3 buckets
48+
-** specify a name (to create a bucket if it doesn't exist) and set the mode
49+
to MOUNT. This is useful for writing code outputs, such as checkpoints or
50+
logs directly to a S3 bucket.
51+
52+
4. **You want to have a shared file-system across workers running on different
53+
nodes -** specify a name (to create a bucket if it doesn't exist) and set
54+
the mode to MOUNT. This will create an empty scratch space that workers
55+
can write to. Any writes will show up on all worker's mount points.
56+
57+
When specifying a storage object, you can specify either of two modes:
58+
59+
- :code:`mode: MOUNT` (default)
60+
This mode directly mounts the bucket at the specified path on the VM.
61+
In effect, files are streamed from the backing source bucket as and when
62+
they are accessed by applications. This mode also allows applications to
63+
write to the mount path. All writes are replicated to remote bucket (and
64+
any other VMs mounting the same bucket). Please note that this mode
65+
uses a close-to-open consistency model, which means a file write is
66+
committed to the backing store only after :code:`close()` is called on it.
67+
68+
- :code:`mode: COPY`
69+
This mode pre-fetches your files from remote storage and caches them on the
70+
local disk. Note that in this mode, any writes to the mount path are not
71+
replicated to the source bucket.
72+
73+
Here are a few examples covering a range of use cases for sky file_mounts
74+
and storage mounting:
3275

3376
.. code-block:: yaml
3477
3578
name: storage-demo
3679
3780
resources:
3881
cloud: aws
39-
instance_type: m5.2xlarge
82+
4083
4184
file_mounts:
42-
# This uses rsync to directly copy files from your machine to the remote
43-
# VM at /datasets. Since this uses rsync, the ~/datasets folder is
44-
# uploaded on each execution.
85+
# *** Copying files from local ***
86+
#
87+
# This uses rsync to directly copy files from your machine to the remote VM at
88+
# /datasets.
4589
/datasets: ~/datasets
4690
91+
# *** Copying files from S3 ***
92+
#
93+
# This re-uses a predefined bucket (public bucket used here, but can be
94+
# private) and copies it's contents directly to /datasets-s3.
95+
/datasets-s3: s3://enriched-topical-chat
96+
97+
# *** Copying files from GCS ***
98+
#
99+
# This copies a single object (train-00001-of-01024) from a remote cloud
100+
# storage to local disk.
101+
/train-00001-of-01024: gs://cloud-tpu-test-datasets/fake_imagenet/train-00001-of-01024
102+
103+
# *** Persistent Data Storage by copying from S3 ***
104+
#
47105
# This uses sky Storage to first create a S3 bucket named sky-dataset,
48106
# copies the contents of ~/datasets to the remote bucket and makes the
49107
# bucket persistent (i.e., the bucket is not deleted after the completion of
50108
# this sky task, and future invocations of this bucket will be much faster).
51-
# The bucket is mounted at /datasets-storage.
52-
# If the bucket already exists, it is fetched and re-used.
109+
# When the VM is initialized, the contents of the bucket are copied to
110+
# /datasets-storage. If the bucket already exists, it is fetched and re-used.
53111
/datasets-storage:
54-
name: sky-dataset
112+
name: sky-dataset-romil # Make sure this name is unique or you own this bucket
55113
source: ~/datasets
56-
force_stores: [s3] # Could be [s3, gcs], [gcs] default: None
114+
store: s3 # Could be either of [s3, gcs]; default: None
57115
persistent: True # Defaults to True, can be set to false.
116+
mode: COPY # Defaults to MOUNT if not specified
117+
118+
# *** Persistent Data Storage by MOUNTING S3 ***
119+
#
120+
# This uses the exact same storage object defined above, but uses the MOUNT
121+
# mode. This means instead of copying contents of the remote bucket to the VM,
122+
# sky "mounts" the bucket at /dataset-storage-mount. Files are streamed from
123+
# S3 as they are read by the task. Any writes made at /dataset-storage-mount
124+
# are also replicated on the remote S3 bucket and any other storage mounts
125+
# using the same bucket with MOUNT mode. Note that the source is synced with
126+
# the remote bucket everytime this task is run.
127+
/dataset-storage-mount:
128+
name: sky-dataset-romil
129+
source: ~/datasets
130+
mode: MOUNT
131+
132+
# *** Mounting very large public buckets ***
133+
#
134+
# This uses the MOUNT mode to mount a mount at 3.5 TB public bucket at the
135+
# specified path. Since MOUNT mode is used, the bucket is not copied at init,
136+
# instead contents are streamed from S3 as they are requested. This saves disk
137+
# space on the remote VM.
138+
# Since this is a public bucket, any writes to the path will fail.
139+
/huge-dataset-mount:
140+
source: s3://digitalcorpora
141+
mode: MOUNT
142+
143+
# *** Collecting outputs of tasks on S3 ***
144+
#
145+
# This uses the MOUNT mode to create an output mount path. This creates an
146+
# empty bucket with the specified name and mounts it at the path.
147+
# Any files written to /outputs-mount will also be synced to my-output-bucket.
148+
# This is useful when you want to collect outputs of your task directly in a
149+
# S3 bucket and browse it from your laptop later.
150+
#
151+
# Since writes are synced across workers mounting the same bucket,
152+
# this approach can also be used to create a shared filesystem across workers.
153+
# See examples/storage/pingpong.yaml for an example.
154+
/outputs-mount:
155+
name: romil-output-bucket
156+
mode: MOUNT
58157
59158
run: |
60159
pwd
61160
ls -la /
62161
63-
If you have files that already exist on remote cloud stores, you can also
64-
directly mount s3/gcs buckets and objects in your remote VM by providing the
65-
path to the s3/gcs bucket or object.
162+
# Remember to run `sky storage ls` and `sky storage delete` to delete the
163+
# created storage objects!
66164
67-
.. code-block:: yaml
165+
.. note::
166+
Stopping a running cluster will cause any Storage mounted with :code:`MOUNT`
167+
mode to be unmounted. These mounts will not be re-mounted on running
168+
:code:`sky start`, or even :code:`sky exec`. Please run :code:`sky launch`
169+
again on the same cluster to ensure :code:`MOUNT` mode Storages are mounted
170+
again.
68171

69-
name: storage-demo
172+
.. note::
173+
Symbolic links are handled differently in :code:`file_mounts` depending on whether Sky Storage is used.
174+
For mounts backed by Sky Storage, referenced data for all symbolic links is copied to remote.
175+
For mounts not using Sky Storage (e.g., those using rsync) the symbolic links are directly copied.
176+
Their targets must be separately mounted or else the symlinks may break.
70177

71-
resources:
72-
cloud: aws
73-
instance_type: m5.2xlarge
178+
Creating a shared file system
179+
-----------------------------
180+
181+
Sky Storage can also be used to create a shared file-system that multiple tasks
182+
on different nodes can read and write to. This allows developers to pass files
183+
between workers and even use files as a medium for inter-process communication (IPC).
184+
185+
To create a shared filesystem, simply create a Storage object without a source
186+
and use mount mode when attaching it to your tasks like so:
187+
188+
.. code-block:: yaml
74189
75190
file_mounts:
76-
# This re-uses a predefined bucket (sky-dataset, defined above) and
77-
# mounts it directly at /datasets-s3.
78-
/datasets-s3: s3://sky-dataset
191+
/sharedfs:
192+
name: my-sky-sharedfs
193+
mode: MOUNT
79194
80-
# This copies a single object (train-00001-of-01024) from a remote cloud
81-
# storage to local disk.
82-
/train-00001-of-01024: gs://cloud-tpu-test-datasets/fake_imagenet/train-00001-of-01024
83195
84-
run: |
85-
pwd
86-
ls -la /
196+
Here is a `simple example <https://github.com/sky-proj/sky/blob/master/examples/storage/pingpong.yaml>`_
197+
using sky storage to perform communication between processes using files.
87198

88-
Alternate Usage - Declarative Storage API
89-
------------------------------------------
90-
.. warning::
91-
The declarative storage YAML API has been deprecated.
92-
If you need to create Storage objects but not mount them, use the storage
93-
CLI once it is supported.
94199

95-
Some power users may want to only upload their files to an object store
96-
without mounting it, while others may want to re-use pre-existing storage
97-
objects. They can do so using the storage and storage_mount fields, which are
98-
at 1:1 parity with the sky.Storage python API.
200+
Using Sky Storage CLI tools
201+
---------------------------
99202

100-
Here's an example using the declarative API.
203+
To manage persistent Storage objects, the sky CLI provides two useful commands -
204+
:code:`sky storage ls` and :code:`sky storage delete`.
101205

102-
.. code-block:: yaml
206+
1. :code:`sky storage ls` shows the currently provisioned Storage objects.
103207

104-
name: storage-demo
208+
.. code-block:: console
105209
106-
resources:
107-
cloud: aws
108-
instance_type: m5.2xlarge
210+
$ sky storage ls
211+
NAME CREATED STORE COMMAND STATUS
212+
sky-dataset-romil 3 mins ago S3 sky launch -c demo examples/storage_demo.yaml READY
109213
110-
storage:
111-
- name: sky-dataset-decl
112-
source: ~/datasets
113-
#force_stores: [s3] # Could be [s3, gcs], [gcs] default: None
114-
persistent: True
214+
2. :code:`sky storage delete` allows you to delete any Storage objects managed
215+
by sky.
115216

116-
storage_mounts:
117-
- storage: sky-dataset-decl # Name of the storage defined above
118-
mount_path: /datasets-decl # Path to mount the storage at
217+
.. code-block:: console
119218
120-
run: |
121-
pwd
122-
ls -la /
219+
$ sky storage delete sky-dataset-romil
220+
Deleting storage object sky-dataset-romil...
221+
I 04-02 19:42:24 storage.py:336] Detected existing storage object, loading Storage: sky-dataset-romil
222+
I 04-02 19:42:26 storage.py:683] Deleting S3 Bucket sky-dataset-romil
123223
224+
.. note::
225+
:code:`sky storage ls` only shows Storage objects whose buckets were created
226+
by sky. Storage objects using externally managed buckets or public buckets
227+
are not listed in :code:`sky storage ls` and cannot be managed through sky.
124228

125-
Storage YAML field reference:
229+
Storage YAML reference
230+
----------------------
126231

127232
::
128233

129-
storage: List[sky.Storage]
234+
sky.Storage
130235

131236
Fields:
132237
sky.Storage.name: str
133-
Identifier for the storage object, used as reference in storage_mount
238+
Identifier for the storage object.
134239

135240
sky.Storage.source: str
136241
The source attribute specifies the local path that must be made available
137242
in the storage object. It can either be a local path, in which case data
138243
is uploaded to the cloud to an appropriate object store (s3 or gcs), or it
139-
can be a remote path (s3://, gs://), in which case it is mounted directly.
244+
can be a remote path (s3://, gs://), in which case it is copied or mounted
245+
directly (see mode flag below).
140246

141-
sky.Storage.force_stores: List[str]
142-
If you wish to force sky.Storage to be backed by specific cloud object
143-
stores, you can specify them here. If the Storage object does not already
144-
exist there, it will be replicated onto those clouds.
247+
sky.Storage.store: str; either of 's3' or 'gcs'
248+
If you wish to force sky.Storage to be backed by a specific cloud object
249+
store, you can specify it here.
145250

146-
sky.Storage.persistent: str
251+
sky.Storage.persistent: bool
147252
Whether the remote backing stores in the cloud should be deleted after
148253
execution of this task or not. Set to True to avoid uploading files again
149254
in subsequent runs (at the cost of storing your data in the cloud). If
150255
files change between runs, new files are synced to the bucket.
151256

152-
153-
Storage Mounts YAML field reference:
154-
155-
::
156-
157-
storage_mounts: List[sky.storage_mounts]
158-
159-
Storage mounts specify where the storage objects defined above should be
160-
mounted when the task is run.
161-
162-
Fields:
163-
sky.StorageMount.storage: str
164-
Name reference to the storage object being mounted
165-
166-
sky.StorageMount.mount_path: str
167-
Path where the storage object is to be mounted
257+
sky.Storage.mode: str; either of MOUNT or COPY, defaults to MOUNT
258+
Whether to mount the storage object by copying files, or actually
259+
mounting the remote storage object. With MOUNT mode, files are streamed
260+
from the remote object store and writes are replicated to the object
261+
store (and consequently, to other workers mounting the same Storage).
262+
With mount mode, files are copied at VM initialization and any writes to
263+
the mount path will not be replicated on the object store.

docs/source/reference/yaml-spec.rst

+5-4
Original file line numberDiff line numberDiff line change
@@ -67,10 +67,11 @@ describe all fields available.
6767
#
6868
# Mounts the bucket at /datasets-storage on every node of the cluster.
6969
/datasets-storage:
70-
name: sky-dataset
71-
source: /local/path/datasets
72-
force_stores: [s3] # Could be [s3, gcs], [gcs]; default: None
73-
persistent: True # Defaults to True; can be set to false
70+
name: sky-dataset # Name of storage, optional when source is bucket URI
71+
source: /local/path/datasets # Source path, can be local or s3/gcs URL. Optional, do not specify to create an empty bucket.
72+
store: s3 # Could be either 's3' or 'gcs'; default: None. Optional.
73+
persistent: True # Defaults to True; can be set to false. Optional.
74+
mode: MOUNT # Either MOUNT or COPY. Optional.
7475
7576
# Copies a cloud object store URI to the cluster. Can be private buckets.
7677
/datasets-s3: s3://my-awesome-dataset

0 commit comments

Comments
 (0)