Skip to content

Commit 36181a6

Browse files
[Tests] Add options for test to select clouds (skypilot-org#1587)
* Add options for test to select clouds * strip image_id * strip * not error out if azure status not exist * fix gcp images * reduce test name * format * fix * fix gcp tests * name length * Fix smoke test * fix gcp region * use west3 * fix some more tests * use ubuntu instead * fix name * fix * Only refresh status for cluster * remnant * add spot queue waiting * revert gcp * use ":" instead for gcloud * fix launch hash filter str * Use our own lsof to avoid not installed issue * instruction for rerun failed tests * spot queue left * fix templates * wait longer for autostop * Fix autodown test * adopt the mylsof to azure * lint * change zone * format * reduce test name length * wait longer for autostop * longer autostop * fix format exception * shorter cluster name * Update CONTRIBUTING.md Co-authored-by: Zongheng Yang <[email protected]> * address comments * format * contributing * Document * pyproject * revert gcloud filter * fix * fix * format * fix test_smoke matching * fix * reduce length * reduce cluster name length * increase cancel wait time * Reset the meaning of --cloud * fix comment * format * address comments * yapf Co-authored-by: Zongheng Yang <[email protected]>
1 parent 983f5fa commit 36181a6

31 files changed

+720
-279
lines changed

.github/pull_request_template.md

+2-2
Original file line numberDiff line numberDiff line change
@@ -8,6 +8,6 @@
88
Tested (run the relevant ones):
99

1010
- [ ] Any manual or new tests for this PR (please specify below)
11-
- [ ] All smoke tests: `bash tests/run_smoke_tests.sh`
12-
- [ ] Relevant individual smoke tests: `bash tests/run_smoke_tests.sh test_fill_in_the_name`
11+
- [ ] All smoke tests: `pytest tests/test_smoke.py`
12+
- [ ] Relevant individual smoke tests: `pytest tests/test_smoke.py::test_fill_in_the_name`
1313
- [ ] Backward compatibility tests: `bash tests/backward_comaptibility_tests.sh`

CONTRIBUTING.md

+12-2
Original file line numberDiff line numberDiff line change
@@ -26,10 +26,20 @@ pip install -r requirements-dev.txt
2626
### Testing
2727
To run smoke tests (NOTE: Running all smoke tests launches ~20 clusters):
2828
```
29-
bash tests/run_smoke_tests.sh
29+
# Run all tests except for AWS
30+
pytest tests/test_smoke.py
31+
32+
# Re-run last failed tests
33+
pytest --lf
3034
3135
# Run one of the smoke tests
32-
bash tests/run_smoke_tests.sh test_minimal
36+
pytest tests/test_smoke.py::test_minimal
37+
38+
# Only run test for AWS + generic tests
39+
pytest tests/test_smoke.py --aws
40+
41+
# Change cloud for generic tests to aws
42+
pytest tests/test_smoke.py --generic-cloud aws
3343
```
3444

3545
For profiling code, use:

examples/env_check.yaml

-3
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,3 @@
1-
resources:
2-
cloud: aws
3-
41
num_nodes: 2
52

63
workdir: .

examples/job_queue/cluster.yaml

-1
Original file line numberDiff line numberDiff line change
@@ -7,5 +7,4 @@
77
# sky exec jq job.yaml
88

99
resources:
10-
cloud: aws
1110
accelerators: K80

examples/job_queue/cluster_multinode.yaml

-1
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,6 @@
88
# sky exec mjq job.yaml
99

1010
resources:
11-
cloud: aws
1211
accelerators: T4
1312

1413
num_nodes: 2

examples/job_queue/job_multinode.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ setup: |
2121
run: |
2222
timestamp=$(date +%s)
2323
conda env list
24-
for i in {1..240}; do
24+
for i in {1..360}; do
2525
echo "$timestamp $i"
2626
sleep 1
2727
done

examples/managed_spot.yaml

-5
Original file line numberDiff line numberDiff line change
@@ -1,10 +1,5 @@
11
name: minimal
22

3-
resources:
4-
cloud: aws
5-
use_spot: true
6-
spot_recovery: failover
7-
83
setup: |
94
echo "running setup"
105
pip install tqdm

examples/multi_echo.py

+10-3
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@
88
import sky
99

1010

11-
def run(cluster: Optional[str] = None):
11+
def run(cluster: Optional[str] = None, cloud: Optional[str] = None):
1212
if cluster is None:
1313
# (username, last 4 chars of hash of hostname): for uniquefying users on
1414
# shared-account cloud providers.
@@ -17,9 +17,13 @@ def run(cluster: Optional[str] = None):
1717
_user_and_host = f'{getpass.getuser()}-{hostname_hash}'
1818
cluster = f'test-multi-echo-{_user_and_host}'
1919

20+
if cloud is None:
21+
cloud = 'gcp'
22+
cloud = sky.clouds.CLOUD_REGISTRY.from_str(cloud)
23+
2024
# Create the cluster.
2125
with sky.Dag() as dag:
22-
cluster_resources = sky.Resources(sky.AWS(), accelerators={'K80': 1})
26+
cluster_resources = sky.Resources(cloud, accelerators={'K80': 1})
2327
task = sky.Task(num_nodes=2).set_resources(cluster_resources)
2428
# `detach_run` will only detach the `run` command. The provision and
2529
# `setup` are still blocking.
@@ -38,7 +42,10 @@ def _exec(i):
3842

3943
if __name__ == '__main__':
4044
cluster = None
45+
cloud = None
4146
if len(sys.argv) > 1:
4247
# For smoke test passing in a cluster name.
4348
cluster = sys.argv[1]
44-
run(cluster)
49+
if len(sys.argv) > 2:
50+
cloud = sys.argv[2]
51+
run(cluster, cloud)

examples/multi_hostname.yaml

-3
Original file line numberDiff line numberDiff line change
@@ -1,8 +1,5 @@
11
name: multi_hostname
22

3-
resources:
4-
cloud: gcp
5-
63
num_nodes: 2
74

85
# The run command will be run on *all* nodes.

examples/per_region_images.yaml

+3-1
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,8 @@
11
resources:
22
cloud: aws
3+
instance_type: g4dn.xlarge
34
image_id:
4-
us-east-1: ami-0729d913a335efca7 # Ubuntu 20.04
5+
us-west-2: ami-0fe5af21074ad2a10 # Deep learning AMI with CUDA 11.6 without conda installed
56
us-west-1: skypilot:gpu-ubuntu-1804
67

78

@@ -10,3 +11,4 @@ setup: |
1011
1112
run: |
1213
conda env list
14+
nvidia-smi

examples/resnet_distributed_tf_app.py

+8-3
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@
77
import sky
88

99

10-
def run(cluster: Optional[str] = None):
10+
def run(cluster: Optional[str] = None, cloud: Optional[str] = None):
1111
if cluster is None:
1212
# (username, last 4 chars of hash of hostname): for uniquefying users on
1313
# shared-account cloud providers.
@@ -75,14 +75,19 @@ def run_fn(node_rank: int, ip_list: List[str]) -> Optional[str]:
7575
train.set_inputs('gs://cloud-tpu-test-datasets/fake_imagenet',
7676
estimated_size_gigabytes=70)
7777
train.set_outputs('resnet-model-dir', estimated_size_gigabytes=0.1)
78-
train.set_resources(sky.Resources(sky.AWS(), accelerators='V100'))
78+
train.set_resources(
79+
sky.Resources(sky.clouds.CLOUD_REGISTRY.from_str(cloud),
80+
accelerators='V100'))
7981

8082
sky.launch(dag, cluster_name=cluster, retry_until_up=True)
8183

8284

8385
if __name__ == '__main__':
8486
cluster = None
87+
cloud = None
8588
if len(sys.argv) > 1:
8689
# For smoke test passing in a cluster name.
8790
cluster = sys.argv[1]
88-
run(cluster)
91+
if len(sys.argv) > 2:
92+
cloud = sys.argv[2]
93+
run(cluster, cloud)

examples/using_file_mounts.yaml

+1-1
Original file line numberDiff line numberDiff line change
@@ -81,7 +81,7 @@ file_mounts:
8181

8282
setup: |
8383
sudo apt update
84-
sudo apt install tree
84+
sudo apt install -y tree
8585
8686
run: |
8787
set -ex

pyproject.toml

+1
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ env = [
1616
"SKYPILOT_DEBUG=1",
1717
"SKYPILOT_DISABLE_USAGE_COLLECTION=1"
1818
]
19+
addopts = "-s -n 16 -q --tb=short --disable-warnings"
1920

2021
[tool.mypy]
2122
python_version = "3.8"

sky/backends/backend_utils.py

+5-2
Original file line numberDiff line numberDiff line change
@@ -1355,8 +1355,8 @@ def _get_tpu_vm_pod_ips(ray_config: Dict[str, Any],
13551355
cluster_name = ray_config['cluster_name']
13561356
zone = ray_config['provider']['availability_zone']
13571357
query_cmd = (f'gcloud compute tpus tpu-vm list --filter='
1358-
f'\\(labels.ray-cluster-name={cluster_name}\\) '
1359-
f'--zone={zone} --format=value\\(name\\)')
1358+
f'"(labels.ray-cluster-name={cluster_name})" '
1359+
f'--zone={zone} --format="value(name)"')
13601360
returncode, stdout, stderr = log_lib.run_with_log(query_cmd,
13611361
'/dev/null',
13621362
shell=True,
@@ -1678,6 +1678,9 @@ def _query_status_gcp(
16781678
cluster: str,
16791679
ray_config: Dict[str, Any],
16801680
) -> List[global_user_state.ClusterStatus]:
1681+
# Note: we use ":" for filtering labels for gcloud, as the latest gcloud (v393.0)
1682+
# fails to filter labels with "=".
1683+
# Reference: https://cloud.google.com/sdk/gcloud/reference/topic/filters
16811684
launch_hashes = _ray_launch_hash(cluster, ray_config)
16821685
assert launch_hashes is not None
16831686
hash_filter_str = ' '.join(launch_hashes)

sky/backends/cloud_vm_ray_backend.py

+10-5
Original file line numberDiff line numberDiff line change
@@ -2897,10 +2897,9 @@ def teardown_no_lock(self,
28972897
terminate_cmd = tpu_utils.terminate_tpu_vm_cluster_cmd(
28982898
cluster_name, zone, log_abs_path)
28992899
else:
2900-
query_cmd = (
2901-
f'gcloud compute instances list --filter='
2902-
f'\\(labels.ray-cluster-name={cluster_name}\\) '
2903-
f'--zones={zone} --format=value\\(name\\)')
2900+
query_cmd = (f'gcloud compute instances list --filter='
2901+
f'"(labels.ray-cluster-name={cluster_name})" '
2902+
f'--zones={zone} --format=value\\(name\\)')
29042903
terminate_cmd = (
29052904
f'gcloud compute instances delete --zone={zone}'
29062905
f' --quiet $({query_cmd})')
@@ -2956,8 +2955,14 @@ def teardown_no_lock(self,
29562955
# never launched and the errors are related to pre-launch
29572956
# configurations (such as VPC not found). So it's safe & good UX
29582957
# to not print a failure message.
2958+
#
2959+
# '(ResourceGroupNotFound)': this indicates the resource group on
2960+
# Azure is not found. That means the cluster is already deleted
2961+
# on the cloud. So it's safe & good UX to not print a failure
2962+
# message.
29592963
elif ('TPU must be specified.' not in stderr and
2960-
'SKYPILOT_ERROR_NO_NODES_LAUNCHED: ' not in stderr):
2964+
'SKYPILOT_ERROR_NO_NODES_LAUNCHED: ' not in stderr and
2965+
'(ResourceGroupNotFound)' not in stderr):
29612966
logger.error(
29622967
_TEARDOWN_FAILURE_MESSAGE.format(
29632968
extra_reason='',

sky/cli.py

+13-2
Original file line numberDiff line numberDiff line change
@@ -1368,11 +1368,18 @@ def exec(
13681368
is_flag=True,
13691369
required=False,
13701370
help='Query the latest cluster statuses from the cloud provider(s).')
1371+
@click.argument('clusters',
1372+
required=False,
1373+
type=str,
1374+
nargs=-1,
1375+
**_get_shell_complete_args(_complete_cluster_name))
13711376
@usage_lib.entrypoint
1372-
def status(all: bool, refresh: bool): # pylint: disable=redefined-builtin
1377+
def status(all: bool, refresh: bool, clusters: List[str]): # pylint: disable=redefined-builtin
13731378
# NOTE(dev): Keep the docstring consistent between the Python API and CLI.
13741379
"""Show clusters.
13751380
1381+
If CLUSTERS is given, show those clusters. Otherwise, show all clusters.
1382+
13761383
The following fields for each cluster are recorded: cluster name, time
13771384
since last launch, resources, region, zone, hourly price, status, autostop,
13781385
command.
@@ -1417,7 +1424,11 @@ def status(all: bool, refresh: bool): # pylint: disable=redefined-builtin
14171424
or for autostop-enabled clusters, use ``--refresh`` to query the latest
14181425
cluster statuses from the cloud providers.
14191426
"""
1420-
cluster_records = core.status(refresh=refresh)
1427+
if clusters:
1428+
clusters = _get_glob_clusters(clusters)
1429+
else:
1430+
clusters = None
1431+
cluster_records = core.status(cluster_names=clusters, refresh=refresh)
14211432
nonreserved_cluster_records = []
14221433
reserved_clusters = dict()
14231434
for cluster_record in cluster_records:

sky/clouds/gcp.py

+6-6
Original file line numberDiff line numberDiff line change
@@ -308,8 +308,7 @@ def make_deploy_resources_variables(
308308
# --no-standard-images
309309
# We use the debian image, as the ubuntu image has some connectivity
310310
# issue when first booted.
311-
image_id = service_catalog.get_image_id_from_tag(
312-
'skypilot:cpu-debian-10', clouds='gcp')
311+
image_id = 'skypilot:cpu-debian-10'
313312

314313
r = resources
315314
# Find GPU spec, if any.
@@ -353,22 +352,23 @@ def make_deploy_resources_variables(
353352
# Though the image is called cu113, it actually has later
354353
# versions of CUDA as noted below.
355354
# CUDA driver version 470.57.02, CUDA Library 11.4
356-
image_id = service_catalog.get_image_id_from_tag(
357-
'skypilot:k80-debian-10', clouds='gcp')
355+
image_id = 'skypilot:k80-debian-10'
358356
else:
359357
# Though the image is called cu113, it actually has later
360358
# versions of CUDA as noted below.
361359
# CUDA driver version 510.47.03, CUDA Library 11.6
362360
# Does not support torch==1.13.0 with cu117
363-
image_id = service_catalog.get_image_id_from_tag(
364-
'skypilot:gpu-debian-10', clouds='gcp')
361+
image_id = 'skypilot:gpu-debian-10'
365362

366363
if resources.image_id is not None:
367364
if None in resources.image_id:
368365
image_id = resources.image_id[None]
369366
else:
370367
assert region_name in resources.image_id, resources.image_id
371368
image_id = resources.image_id[region_name]
369+
if image_id.startswith('skypilot:'):
370+
image_id = service_catalog.get_image_id_from_tag(image_id,
371+
clouds='gcp')
372372

373373
assert image_id is not None, (image_id, r)
374374
resources_vars['image_id'] = image_id

sky/clouds/service_catalog/gcp_catalog.py

+4-2
Original file line numberDiff line numberDiff line change
@@ -460,6 +460,8 @@ def get_image_id_from_tag(tag: str, region: Optional[str]) -> Optional[str]:
460460
return common.get_image_id_from_tag_impl(_image_df, tag, region)
461461

462462

463-
def validate_image_tag(tag: str, region: Optional[str]) -> bool:
463+
def is_image_tag_valid(tag: str, region: Optional[str]) -> bool:
464464
"""Returns whether the image tag is valid."""
465-
return common.is_image_tag_valid_impl(_image_df, tag, region)
465+
# GCP images are not region-specific.
466+
del region # Unused.
467+
return common.is_image_tag_valid_impl(_image_df, tag, None)

sky/core.py

+4-1
Original file line numberDiff line numberDiff line change
@@ -33,7 +33,10 @@
3333
def status(cluster_names: Optional[Union[str, Sequence[str]]] = None,
3434
refresh: bool = False) -> List[Dict[str, Any]]:
3535
# NOTE(dev): Keep the docstring consistent between the Python API and CLI.
36-
"""Get all cluster statuses.
36+
"""Get cluster statuses.
37+
38+
If cluster_names is given, return those clusters. Otherwise, return all
39+
clusters.
3740
3841
Each returned value has the following fields:
3942

sky/resources.py

+9-4
Original file line numberDiff line numberDiff line change
@@ -87,9 +87,14 @@ def __init__(
8787
# The key is None if the same image_id applies for all regions.
8888
self._image_id = image_id
8989
if isinstance(image_id, str):
90-
self._image_id = {self._region: image_id}
91-
elif isinstance(image_id, dict) and None in image_id:
92-
self._image_id = {self._region: image_id[None]}
90+
self._image_id = {self._region: image_id.strip()}
91+
elif isinstance(image_id, dict):
92+
if None in image_id:
93+
self._image_id = {self._region: image_id[None].strip()}
94+
else:
95+
self._image_id = {
96+
k.strip(): v.strip() for k, v in image_id.items()
97+
}
9398

9499
self._set_accelerators(accelerators, accelerator_args)
95100

@@ -486,7 +491,7 @@ def _try_validate_image_id(self) -> None:
486491
region_str = f' ({region})' if region else ''
487492
with ux_utils.print_exception_no_traceback():
488493
raise ValueError(
489-
f'Image tag {image_id} is not valid, please make sure'
494+
f'Image tag {image_id!r} is not valid, please make sure'
490495
f' the tag exists in {self._cloud}{region_str}.')
491496

492497
if (self._cloud.is_same_cloud(clouds.AWS()) and

sky/spot/recovery_strategy.py

+3-1
Original file line numberDiff line numberDiff line change
@@ -169,7 +169,9 @@ def _launch(self, max_retry=3, raise_on_failure=True) -> Optional[float]:
169169
# If the launch fails, it will be recovered by the following
170170
# code.
171171
logger.info('Failed to launch the spot cluster with error: '
172-
f'{type(e)}: {e}')
172+
f'{common_utils.format_exception(e)})')
173+
import traceback # pylint: disable=import-outside-toplevel
174+
logger.info(f' Traceback: {traceback.format_exc()}')
173175
retry_launch = True
174176
exception = e
175177

sky/templates/aws-ray.yml.j2

+7-4
Original file line numberDiff line numberDiff line change
@@ -118,13 +118,16 @@ setup_commands:
118118
# Line 'sudo grep ..': set the number of threads per process to unlimited to avoid ray job submit stucking issue when the number of running ray jobs increase.
119119
# Line 'mkdir -p ..': disable host key check
120120
# Line 'python3 -c ..': patch the buggy ray files and enable `-o allow_other` option for `goofys`
121-
- sudo systemctl stop unattended-upgrades || true;
121+
- function mylsof { p=$(for pid in /proc/{0..9}*; do i=$(basename "$pid"); for file in "$pid"/fd/*; do link=$(readlink -e "$file"); if [ "$link" = "$1" ]; then echo "$i"; fi; done; done); echo "$p"; };
122+
sudo systemctl stop unattended-upgrades || true;
122123
sudo systemctl disable unattended-upgrades || true;
123124
sudo sed -i 's/Unattended-Upgrade "1"/Unattended-Upgrade "0"/g' /etc/apt/apt.conf.d/20auto-upgrades || true;
124-
sudo kill -9 `sudo lsof /var/lib/dpkg/lock-frontend | awk '{print $2}' | tail -n 1` || true;
125-
sudo pkill -9 apt-get;
125+
p=$(mylsof "/var/lib/dpkg/lock-frontend"); echo "$p";
126+
sudo kill -9 `echo "$p" | tail -n 1` || true;
127+
sudo rm /var/lib/dpkg/lock-frontend;
126128
sudo pkill -9 dpkg;
127-
sudo dpkg --configure -a;
129+
sudo pkill -9 apt-get;
130+
sudo dpkg --configure --force-overwrite -a;
128131
mkdir -p ~/.ssh; touch ~/.ssh/config;
129132
pip3 --version > /dev/null 2>&1 || (curl -sSL https://bootstrap.pypa.io/get-pip.py -o get-pip.py && python3 get-pip.py && echo "PATH=$HOME/.local/bin:$PATH" >> ~/.bashrc);
130133
(type -a python | grep -q python3) || echo 'alias python=python3' >> ~/.bashrc;

0 commit comments

Comments
 (0)