Skip to content

Commit c789566

Browse files
Improve logging in error conditions & update auto-stop.rst (skypilot-org#675)
* Log error for HEAD_FAILED; don't duplicate logging for no_retry=True. * Minor touches on auto-stop.rst * Revert to only printing errors on GANG_FAILED
1 parent 699b025 commit c789566

File tree

2 files changed

+18
-17
lines changed

2 files changed

+18
-17
lines changed

docs/source/reference/auto-stop.rst

+16-12
Original file line numberDiff line numberDiff line change
@@ -3,40 +3,44 @@ Auto-stopping
33
=========
44

55
Sky's **auto-stopping** can automatically stop a cluster after a few minutes of idleness.
6+
With auto-stopping, users can simply submit jobs and leave their laptops, while
7+
**ensuring no unnecessary spending occurs**: after jobs have finished, the
8+
cluster(s) used will be automatically stopped (and restarted later).
69

7-
To setup auto-stopping for a cluster, :code:`sky autostop` can be used.
10+
To setup auto-stopping for a cluster, use :code:`sky autostop`:
811

912
.. code-block:: bash
1013
1114
# Launch a cluster with logging detached
12-
sky launch -c mycluster -d cluster.yaml
15+
sky launch -d -c mycluster cluster.yaml
1316
14-
# Set auto-stopping for the cluster, after cluster will be stopped 10 minutes of idleness
17+
# Auto-stop the cluster after 10 minutes of idleness
1518
sky autostop mycluster -i 10
1619
20+
# Use the default, 5 minutes of idleness
21+
# sky autostop mycluster
22+
1723
The :code:`-d / --detach` flag detaches logging from the terminal.
1824

19-
To cancel the auto-stop scheduled on the cluster:
25+
To cancel a scheduled auto-stop on the cluster:
2026

2127
.. code-block:: bash
2228
23-
# Cancel auto-stop for the cluster
2429
sky autostop mycluster --cancel
2530
26-
To view the status of the cluster:
31+
To view the status of the cluster, use ``sky status [--refresh]``:
2732

2833
.. code-block:: bash
2934
3035
# Show a cluster's jobs (IDs, statuses).
3136
sky status
32-
NAME LAUNCHED RESOURCES STATUS AUTOSTOP COMMAND
33-
mucluster 1 min ago 2x AWS(m4.2xlarge) UP 0 min sky launch -d -c ...
37+
NAME LAUNCHED RESOURCES STATUS AUTOSTOP COMMAND
38+
mycluster 1 min ago 2x AWS(m4.2xlarge) UP 10 min sky launch -d -c ...
3439
3540
# Refresh the status for auto-stopping
3641
sky status --refresh
37-
NAME LAUNCHED RESOURCES STATUS AUTOSTOP COMMAND
38-
mucluster 1 min ago 2x AWS(m4.2xlarge) STOPPED - sky launch -d -c ...
39-
42+
NAME LAUNCHED RESOURCES STATUS AUTOSTOP COMMAND
43+
mycluster 11 min ago 2x AWS(m4.2xlarge) STOPPED - sky launch -d -c ...
4044
41-
The cluster status in :code:`sky status` shows the cached status of the cluster, which can be out-dated for clusters with auto-stopping scheduled. To view a real status of the cluster with auto-stopping scheduled, use :code:`sky status --refresh`.
4245
46+
:code:`sky status` shows the cached statuses, which can be outdated for clusters with auto-stopping scheduled. To query the real statuses of clusters with auto-stopping scheduled, use :code:`sky status --refresh`.

sky/backends/cloud_vm_ray_backend.py

+2-5
Original file line numberDiff line numberDiff line change
@@ -660,7 +660,6 @@ def _yield_region_zones(self, to_provision: 'resources_lib.Resources',
660660
'It is possibly killed by cloud provider or manually '
661661
'in the cloud provider console. To remove the cluster '
662662
f'please run: sky down {cluster_name}')
663-
logger.error(message)
664663
# Reset to UP (rather than keeping it at INIT), as INIT
665664
# mode will enable failover to other regions, causing
666665
# data lose.
@@ -684,7 +683,6 @@ def _yield_region_zones(self, to_provision: 'resources_lib.Resources',
684683
'Failed to acquire resources to restart the stopped '
685684
f'cluster {cluster_name} on {region}. Please retry again '
686685
'later.')
687-
logger.error(message)
688686

689687
# Reset to STOPPED (rather than keeping it at INIT), because
690688
# (1) the cluster is not up (2) it ensures future `sky start`
@@ -875,8 +873,6 @@ def _retry_region_zones(self,
875873
# ray up failed for the head node.
876874
self._update_blocklist_on_error(to_provision.cloud, region,
877875
zones, stdout, stderr)
878-
logger.error(
879-
f'*** HEAD_FAILED for {cluster_name} {region.name}')
880876
else:
881877
# gang scheduling failed.
882878
assert status == self.GangSchedulingStatus.GANG_FAILED, status
@@ -891,7 +887,8 @@ def _retry_region_zones(self,
891887
stderr=None)
892888

893889
# Only log the errors for GANG_FAILED, since HEAD_FAILED may
894-
# not have created any resources (it can happen however).
890+
# not have created any resources (it can happen however) and
891+
# HEAD_FAILED can happen in "normal" failover cases.
895892
logger.error('*** Failed provisioning the cluster. ***')
896893
terminate_str = 'Terminating' if need_terminate else 'Stopping'
897894
logger.error(f'*** {terminate_str} the failed cluster. ***')

0 commit comments

Comments
 (0)