Skip to content

Commit 1f7a5cd

Browse files
authored
[Core] Port ray 2.0.1 (skypilot-org#1133)
* update ray node provider to 2.0.0 update patches Adapt to ray functions in 2.0.0 update azure-cli version for faster installation format [Onprem] Automatically install sky dependencies (skypilot-org#1116) * Remove root user, move ray cluster to admin * Automatically install sky dependencies * Fix admin alignment * Fix PR * Address romil's comments * F * Addressed Romil's comments Add `--retry-until-up`, `--region`, `--zone`, and `--idle-minutes-to-autostop` for interactive nodes (skypilot-org#1207) * Add --retry-until-up flag for interactive nodes * Add --region flag for interactive nodes * Add --idle-minutes-to-autostop flag for interactive nodes * Add --zone flag for interactive nodes * Update help messages * Address nit Add all region option in catalog fetcher and speed up azure fetcher (skypilot-org#1204) * Port changes * format * add t2a exclusion back * fix A100 for GCP * fix aws fetching for p4de.24xlarge * Fill GPUInfo * fix * address part of comments * address comments * add test for A100 * patch GpuInfo * Add generation info * Add capabilities back to azure and fix aws * fix azure catalog * format * lint * remove zone from azure * fix azure * Add analyze for csv * update catalog analysis * format * backward compatible for azure_catalog * yapf * fix GCP catalog * fix A100-80GB * format * increase version number * only keep useful columns for aws * remove capabilities from azure * add az to AWS Revert "Add `--retry-until-up`, `--region`, `--zone`, and `--idle-minutes-to-autostop` for interactive nodes" (skypilot-org#1220) Revert "Add `--retry-until-up`, `--region`, `--zone`, and `--idle-minutes-to-autostop` for interactive nodes (skypilot-org#1207)" This reverts commit f06416d. [Storage] Add `StorageMode` to __init__ (skypilot-org#1223) * Add storage mode to __init__ * fix [Example] Minimal containerized app example (skypilot-org#1212) * Container example * parenthesis * Add explicit StorageMode * lint Fix Mac Version in Setup.py (skypilot-org#1224) Fix mac Reduce iops for aws instances (skypilot-org#1221) * set the default iops to be same as console for AWS * fix Revert "Reduce iops for aws instances" (skypilot-org#1229) Revert "Reduce iops for aws instances (skypilot-org#1221)" This reverts commit 29f1458. update back compat test * parent 06afd93 author Zhanghao Wu <[email protected]> 1665364265 -0700 committer Zhanghao Wu <[email protected]> 1665899898 -0700 parent 06afd93 author Zhanghao Wu <[email protected]> 1665364265 -0700 committer Zhanghao Wu <[email protected]> 1665899681 -0700 Support for autodown Change API to terminate fix flag address comment format Rename terminate to down add smoke test format fix syntax use gcp for autodown test fix smoke test fix smoke test address comments Switch back to terminate Change back to tear down Change to tear down fix comment * Fix rebase issue * address comments * address * fix setup.py * upgrade to 2.0.1 * Fix docs for ray version * Fix example * fix backward compatibility test * Fix onprem job submission * add steps for backward compat test
1 parent 5f81002 commit 1f7a5cd

27 files changed

+192
-139
lines changed

docs/source/reference/local/setup.rst

+2-2
Original file line numberDiff line numberDiff line change
@@ -14,13 +14,13 @@ For further reference, `here <https://docs.ray.io/en/latest/ray-core/configure.h
1414
Installing SkyPilot dependencies
1515
-----------------------------------
1616

17-
SkyPilot On-prem requires :code:`python3`, :code:`ray==1.13.0`, and :code:`sky` to be setup on all local nodes and globally available to all users.
17+
SkyPilot On-prem requires :code:`python3`, :code:`ray==2.0.1`, and :code:`sky` to be setup on all local nodes and globally available to all users.
1818

1919
To install Ray and SkyPilot for all users, run the following commands on all local nodes:
2020

2121
.. code-block:: console
2222
23-
$ pip3 install ray[default]==1.13.0
23+
$ pip3 install ray[default]==2.0.1
2424
2525
$ # SkyPilot requires python >= 3.6.
2626
$ pip3 install skypilot

examples/local/cluster-config.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -4,7 +4,7 @@
44
# The system administrator must have `sudo` access to the local nodes.
55
# Requirements:
66
# 1) Python (> 3.6) on all nodes.
7-
# 2) Ray CLI (= 1.13.0) on all nodes.
7+
# 2) Ray CLI (= 2.0.1) on all nodes.
88
#
99
# Example usage:
1010
# >> sky admin deploy cluster-config.yaml

sky/backends/backend_utils.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -1418,7 +1418,7 @@ def _ray_launch_hash(cluster_name: str, ray_config: Dict[str, Any]) -> Set[str]:
14181418
return set(ray_launch_hashes)
14191419
with suppress_output():
14201420
ray_config = ray_commands._bootstrap_config(ray_config) # pylint: disable=protected-access
1421-
# Adopted from https://github.com/ray-project/ray/blob/ray-1.13.0/python/ray/autoscaler/_private/node_launcher.py#L56-L64
1421+
# Adopted from https://github.com/ray-project/ray/blob/ray-2.0.1/python/ray/autoscaler/_private/node_launcher.py#L87-L97
14221422
# TODO(zhwu): this logic is duplicated from the ray code above (keep in sync).
14231423
launch_hashes = set()
14241424
head_node_type = ray_config['head_node_type']

sky/backends/cloud_vm_ray_backend.py

+8-6
Original file line numberDiff line numberDiff line change
@@ -154,7 +154,7 @@ def add_prologue(self,
154154
# Should use 'auto' or 'ray://<internal_head_ip>:10001' rather than
155155
# 'ray://localhost:10001', or 'ray://127.0.0.1:10001', for public cloud.
156156
# Otherwise, it will a bug of ray job failed to get the placement group
157-
# in ray <= 2.0.0.
157+
# in ray <= 2.0.1.
158158
# TODO(mluo): Check why 'auto' not working with on-prem cluster and
159159
# whether the placement group issue also occurs in on-prem cluster.
160160
ray_address = 'ray://localhost:10001' if is_local else 'auto'
@@ -401,6 +401,8 @@ def add_epilogue(self) -> None:
401401
# Need this to set the job status in ray job to be FAILED.
402402
sys.exit(1)
403403
else:
404+
sys.stdout.flush()
405+
sys.stderr.flush()
404406
job_lib.set_status({self.job_id!r}, job_lib.JobStatus.SUCCEEDED)
405407
# This waits for all streaming logs to finish.
406408
time.sleep(1)
@@ -1341,7 +1343,7 @@ def _ensure_cluster_ray_started(self,
13411343
if isinstance(launched_resources.cloud, clouds.Local):
13421344
raise RuntimeError(
13431345
'The command `ray status` errored out on the head node '
1344-
'of the local cluster. Check if ray[default]==1.13.0 '
1346+
'of the local cluster. Check if ray[default]==2.0.1 '
13451347
'is installed or running correctly.')
13461348
backend.run_on_head(handle, 'ray stop', use_cached_head_ip=False)
13471349

@@ -2066,8 +2068,8 @@ def _exec_code_on_head(
20662068
else:
20672069
job_submit_cmd = (
20682070
f'{cd} && mkdir -p {remote_log_dir} && ray job submit '
2069-
f'--address=http://127.0.0.1:8265 --job-id {ray_job_id} '
2070-
'--no-wait -- '
2071+
f'--address=http://127.0.0.1:8265 --submission-id {ray_job_id} '
2072+
'--no-wait '
20712073
f'"{executable} -u {script_path} > {remote_log_path} 2>&1"')
20722074

20732075
returncode, stdout, stderr = self.run_on_head(handle,
@@ -2151,8 +2153,8 @@ def _setup_and_create_job_cmd_on_local_head(
21512153
switch_user_cmd = ' '.join(switch_user_cmd)
21522154
job_submit_cmd = (
21532155
'ray job submit '
2154-
f'--address=http://127.0.0.1:8265 --job-id {ray_job_id} --no-wait '
2155-
f'-- {switch_user_cmd}')
2156+
f'--address=http://127.0.0.1:8265 --submission-id {ray_job_id} '
2157+
f'--no-wait -- {switch_user_cmd}')
21562158
return job_submit_cmd
21572159

21582160
def _add_job(self, handle: ResourceHandle, job_name: str,

sky/design_docs/onprem-design.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@
99
- Does not support different types of accelerators within the same node (intranode).
1010

1111
## Installing Ray and SkyPilot
12-
- Admin installs Ray==1.13.0 and SkyPilot globally on all machines. It is assumed that the admin regularly keeps SkyPilot updated on the cluster.
12+
- Admin installs Ray==2.0.1 and SkyPilot globally on all machines. It is assumed that the admin regularly keeps SkyPilot updated on the cluster.
1313
- Python >= 3.6 for all users.
1414
- When a regular user runs `sky launch`, a local version of SkyPilot will be installed on the machine for each user. The local installation of Ray is specified in `sky/templates/local-ray.yml.j2`.
1515

@@ -36,7 +36,7 @@ ray.get(ray.remote(f).remote())
3636
```
3737

3838
- Therefore, SkyPilot On-prem transparently includes user-switching so that SkyPilot tasks are still run as the calling, unprivileged user. This user-switching (`sudo -H su --login [USER]` in appropriate places) works as follows:
39-
- In `sky/backends/cloud_vm_ray_backend.py::_setup_and_create_job_cmd_on_local_head`, switching between users is called during Ray job submission. The command `ray job submit --address=http://127.0.0.1:8265 --job-id {ray_job_id} -- sudo -H su --login [SSH_USER] -c \"[JOB_COMMAND]\"` switches job submission execution from admin back to the original user `SSH_USER`. The `JOB_COMMAND` argument runs a bash script with the user's run commands.
39+
- In `sky/backends/cloud_vm_ray_backend.py::_setup_and_create_job_cmd_on_local_head`, switching between users is called during Ray job submission. The command `ray job submit --address=http://127.0.0.1:8265 --submission-id {ray_job_id} -- sudo -H su --login [SSH_USER] -c \"[JOB_COMMAND]\"` switches job submission execution from admin back to the original user `SSH_USER`. The `JOB_COMMAND` argument runs a bash script with the user's run commands.
4040
- In `sky/skylet/log_lib.py::run_bash_command_with_log`, there is also another `sudo -H su` command to switch users. The function `run_bash_command_with_log` is part of the `RayCodeGen` job execution script uploaded to remote for job submission (located in `~/.sky/sky_app/sky_app_[JOB_ID].py`). This program initially runs under the calling user, but it executes the function `run_bash_command_with_log` from the context of the admin, as the function is executed within the Ray cluster as a Ray remote function (see above for why all Ray remote functions are run under admin).
4141
- SkyPilot ensures Ray-related environment variables (that are critical for execution) are preserved across switching users (check with `examples/env_check.yaml`).
4242

sky/setup_files/setup.py

+5-3
Original file line numberDiff line numberDiff line change
@@ -56,7 +56,7 @@ def parse_footnote(readme: str) -> str:
5656

5757
install_requires = [
5858
'wheel',
59-
# NOTE: ray 1.13.0 requires click<=8.0.4,>=7.0; We disable the
59+
# NOTE: ray 2.0.1 requires click<=8.0.4,>=7.0; We disable the
6060
# shell completion for click<8.0 for backward compatibility.
6161
'click<=8.0.4,>=7.0',
6262
'colorama',
@@ -70,14 +70,14 @@ def parse_footnote(readme: str) -> str:
7070
'PrettyTable',
7171
# Lower local ray version is not fully supported, due to the
7272
# autoscaler issues (also tracked in #537).
73-
'ray[default]>=1.9.0,<=1.13.0',
73+
'ray[default]>=1.9.0,<=2.0.1',
7474
'rich',
7575
'tabulate',
7676
'filelock', # TODO(mraheja): Enforce >=3.6.0 when python version is >= 3.7
7777
# This is used by ray. The latest 1.44.0 will generate an error
7878
# `Fork support is only compatible with the epoll1 and poll
7979
# polling strategies`
80-
'grpcio<=1.43.0',
80+
'grpcio>=1.32.0,<=1.43.0',
8181
'packaging',
8282
# The latest 4.21.1 will break ray. Enforce < 4.0.0 until Ray releases the
8383
# fix.
@@ -98,6 +98,8 @@ def parse_footnote(readme: str) -> str:
9898
],
9999
# TODO(zongheng): azure-cli is huge and takes a long time to install.
100100
# Tracked in: https://github.com/Azure/azure-cli/issues/7387
101+
# azure-cli need to be pinned to 2.31.0 due to later versions
102+
# do not have azure-identity (used in node_provider) installed
101103
'azure': ['azure-cli==2.31.0', 'azure-core'],
102104
'gcp': ['google-api-python-client', 'google-cloud-storage'],
103105
'docker': ['docker'],

sky/skylet/LICENCE

+6-6
Original file line numberDiff line numberDiff line change
@@ -203,16 +203,16 @@
203203
--------------------------------------------------------------------------------
204204

205205
Code in providers/azure from
206-
https://github.com/ray-project/ray/tree/ray-1.13.0/python/ray/autoscaler/_private/_azure
207-
Git commit of the release 1.13.0: 4ce38d001dbbe09cd21c497fedd03d692b2be3e
206+
https://github.com/ray-project/ray/tree/ray-2.0.1/python/ray/autoscaler/_private/_azure
207+
Git commit of the release 2.0.1: 03b6bc7b5a305877501110ec04710a9c57011479
208208

209209
Code in providers/gcp from
210-
https://github.com/ray-project/ray/tree/ray-1.13.0/python/ray/autoscaler/_private/gcp
211-
Git commit of the release 1.13.0: 4ce38d001dbbe09cd21c497fedd03d692b2be3e
210+
https://github.com/ray-project/ray/tree/ray-2.0.1/python/ray/autoscaler/_private/gcp
211+
Git commit of the release 2.0.1: 03b6bc7b5a305877501110ec04710a9c57011479
212212

213213
Code in providers/aws from
214-
https://github.com/ray-project/ray/tree/ray-1.13.0/python/ray/autoscaler/_private/aws
215-
Git commit of the release 1.13.0: 4ce38d001dbbe09cd21c497fedd03d692b2be3e
214+
https://github.com/ray-project/ray/tree/ray-2.0.1/python/ray/autoscaler/_private/aws
215+
Git commit of the release 2.0.1: 03b6bc7b5a305877501110ec04710a9c57011479
216216

217217

218218
Copyright 2016-2022 Ray developers

sky/skylet/constants.py

+1-1
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22

33
SKY_LOGS_DIRECTORY = '~/sky_logs'
44
SKY_REMOTE_WORKDIR = '~/sky_workdir'
5-
SKY_REMOTE_RAY_VERSION = '1.13.0'
5+
SKY_REMOTE_RAY_VERSION = '2.0.1'
66

77
# TODO(mluo): Make explicit `sky launch -c <name> ''` optional.
88
UNINITIALIZED_ONPREM_CLUSTER_MESSAGE = (

sky/skylet/job_lib.py

+16-6
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,7 @@
77
import pathlib
88
import shlex
99
import time
10+
import typing
1011
from typing import Any, Dict, List, Optional
1112

1213
import filelock
@@ -17,6 +18,9 @@
1718
from sky.utils import db_utils
1819
from sky.utils import log_utils
1920

21+
if typing.TYPE_CHECKING:
22+
from ray.dashboard.modules.job import pydantic_models as ray_pydantic
23+
2024
logger = sky_logging.init_logger(__name__)
2125

2226
_JOB_STATUS_LOCK = '~/.sky/locks/.job_{}.lock'
@@ -331,7 +335,7 @@ def update_job_status(job_owner: str,
331335
we still need this to handle staleness problem, caused by instance
332336
restarting and other corner cases (if any).
333337
334-
This function should only be run on the remote instance with ray==1.13.0.
338+
This function should only be run on the remote instance with ray==2.0.1.
335339
"""
336340
if len(job_ids) == 0:
337341
return []
@@ -341,13 +345,19 @@ def update_job_status(job_owner: str,
341345

342346
job_client = _create_ray_job_submission_client()
343347

344-
# In ray 1.13.0, job_client.list_jobs returns a dict of job_id to job_info,
345-
# where job_info contains the job status (str).
346-
ray_job_infos = job_client.list_jobs()
348+
# In ray 2.0.1, job_client.list_jobs returns a list of JobDetails,
349+
# which contains the job status (str) and submission_id (str).
350+
job_detail_lists: List['ray_pydantic.JobDetails'] = job_client.list_jobs()
351+
352+
job_details = dict()
353+
ray_job_ids_set = set(ray_job_ids)
354+
for job_detail in job_detail_lists:
355+
if job_detail.submission_id in ray_job_ids_set:
356+
job_details[job_detail.submission_id] = job_detail
347357
job_statuses: List[JobStatus] = [None] * len(ray_job_ids)
348358
for i, ray_job_id in enumerate(ray_job_ids):
349-
if ray_job_id in ray_job_infos:
350-
ray_status = ray_job_infos[ray_job_id].status
359+
if ray_job_id in job_details:
360+
ray_status = job_details[ray_job_id].status
351361
job_statuses[i] = _RAY_TO_JOB_STATUS_MAP[ray_status]
352362

353363
assert len(job_statuses) == len(job_ids), (job_statuses, job_ids)

sky/skylet/providers/aws/cloudwatch/cloudwatch_helper.py

+7-5
Original file line numberDiff line numberDiff line change
@@ -1,13 +1,15 @@
1-
import botocore
21
import copy
2+
import hashlib
33
import json
4-
import os
54
import logging
5+
import os
66
import time
7-
import hashlib
8-
from typing import Any, Dict, List, Union, Tuple
7+
from typing import Any, Dict, List, Tuple, Union
8+
9+
import botocore
10+
911
from sky.skylet.providers.aws.utils import client_cache, resource_cache
10-
from ray.autoscaler.tags import TAG_RAY_CLUSTER_NAME, NODE_KIND_HEAD, TAG_RAY_NODE_KIND
12+
from ray.autoscaler.tags import NODE_KIND_HEAD, TAG_RAY_CLUSTER_NAME, TAG_RAY_NODE_KIND
1113

1214
logger = logging.getLogger(__name__)
1315

sky/skylet/providers/aws/config.py

+24-23
Original file line numberDiff line numberDiff line change
@@ -1,30 +1,29 @@
1-
from distutils.version import StrictVersion
2-
from functools import lru_cache
3-
from functools import partial
41
import copy
52
import itertools
63
import json
4+
import logging
75
import os
86
import time
7+
from distutils.version import StrictVersion
8+
from functools import lru_cache, partial
99
from typing import Any, Dict, List, Optional, Set, Tuple
10-
import logging
1110

1211
import boto3
1312
import botocore
1413

15-
from ray.autoscaler._private.util import check_legacy_fields
16-
from ray.autoscaler.tags import NODE_TYPE_LEGACY_HEAD, NODE_TYPE_LEGACY_WORKER
17-
from ray.autoscaler._private.providers import _PROVIDER_PRETTY_NAMES
14+
from sky.skylet.providers.aws.cloudwatch.cloudwatch_helper import (
15+
CloudwatchHelper as cwh,
16+
)
1817
from sky.skylet.providers.aws.utils import (
1918
LazyDefaultDict,
2019
handle_boto_error,
2120
resource_cache,
2221
)
23-
from ray.autoscaler._private.cli_logger import cli_logger, cf
22+
from ray.autoscaler._private.cli_logger import cf, cli_logger
2423
from ray.autoscaler._private.event_system import CreateClusterEvent, global_event_system
25-
from sky.skylet.providers.aws.cloudwatch.cloudwatch_helper import (
26-
CloudwatchHelper as cwh,
27-
)
24+
from ray.autoscaler._private.providers import _PROVIDER_PRETTY_NAMES
25+
from ray.autoscaler._private.util import check_legacy_fields
26+
from ray.autoscaler.tags import NODE_TYPE_LEGACY_HEAD, NODE_TYPE_LEGACY_WORKER
2827

2928
logger = logging.getLogger(__name__)
3029

@@ -33,20 +32,22 @@
3332
DEFAULT_RAY_IAM_ROLE = RAY + "-v1"
3433
SECURITY_GROUP_TEMPLATE = RAY + "-{}"
3534

36-
DEFAULT_AMI_NAME = "AWS Deep Learning AMI (Ubuntu 18.04) V30.0"
35+
# V61.0 has CUDA 11.2
36+
DEFAULT_AMI_NAME = "AWS Deep Learning AMI (Ubuntu 18.04) V61.0"
3737

38-
# Obtained from https://aws.amazon.com/marketplace/pp/B07Y43P7X5 on 8/4/2020.
38+
# Obtained from https://aws.amazon.com/marketplace/pp/B07Y43P7X5 on 6/10/2022.
39+
# NOTE(skypilot): these are not used; skypilot instead uses the default AMIs in aws.py.
3940
DEFAULT_AMI = {
40-
"us-east-1": "ami-029510cec6d69f121", # US East (N. Virginia)
41-
"us-east-2": "ami-08bf49c7b3a0c761e", # US East (Ohio)
42-
"us-west-1": "ami-0cc472544ce594a19", # US West (N. California)
43-
"us-west-2": "ami-0a2363a9cff180a64", # US West (Oregon)
44-
"ca-central-1": "ami-0a871851b2ab39f01", # Canada (Central)
45-
"eu-central-1": "ami-049fb1ea198d189d7", # EU (Frankfurt)
46-
"eu-west-1": "ami-0abcbc65f89fb220e", # EU (Ireland)
47-
"eu-west-2": "ami-0755b39fd4dab7cbe", # EU (London)
48-
"eu-west-3": "ami-020485d8df1d45530", # EU (Paris)
49-
"sa-east-1": "ami-058a6883cbdb4e599", # SA (Sao Paulo)
41+
"us-east-1": "ami-0dd6adfad4ad37eec", # US East (N. Virginia)
42+
"us-east-2": "ami-0c77cd5ca05bf1281", # US East (Ohio)
43+
"us-west-1": "ami-020ab1b368a5ed1db", # US West (N. California)
44+
"us-west-2": "ami-0387d929287ab193e", # US West (Oregon)
45+
"ca-central-1": "ami-07dbafdbd38f18d98", # Canada (Central)
46+
"eu-central-1": "ami-0383bd0c1fc4c63ec", # EU (Frankfurt)
47+
"eu-west-1": "ami-0a074b0a311a837ac", # EU (Ireland)
48+
"eu-west-2": "ami-094ba2b4651f761ca", # EU (London)
49+
"eu-west-3": "ami-031da10fbf225bf5f", # EU (Paris)
50+
"sa-east-1": "ami-0be7c1f1dd96d7337", # SA (Sao Paulo)
5051
}
5152

5253
# todo: cli_logger should handle this assert properly

0 commit comments

Comments
 (0)