Improve logging in error conditions & update auto-stop.rst (skypilot-org#675)

concretevitamin · web-flow · commit c7895665055a · 2022-03-31T22:55:56.000-07:00
* Log error for HEAD_FAILED; don't duplicate logging for no_retry=True.

* Minor touches on auto-stop.rst

* Revert to only printing errors on GANG_FAILED
diff --git a/docs/source/reference/auto-stop.rst b/docs/source/reference/auto-stop.rst
@@ -3,40 +3,44 @@ Auto-stopping
 =========
 
 Sky's **auto-stopping** can automatically stop a cluster after a few minutes of idleness.
+With auto-stopping, users can simply submit jobs and leave their laptops, while
+**ensuring no unnecessary spending occurs**: after jobs have finished, the
+cluster(s) used will be automatically stopped (and restarted later).
 
-To setup auto-stopping for a cluster, :code:`sky autostop` can be used.
+To setup auto-stopping for a cluster, use :code:`sky autostop`:
 
 .. code-block:: bash
 
    # Launch a cluster with logging detached
-   sky launch -c mycluster -d cluster.yaml
+   sky launch -d -c mycluster cluster.yaml
 
-   # Set auto-stopping for the cluster, after cluster will be stopped 10 minutes of idleness
+   # Auto-stop the cluster after 10 minutes of idleness
    sky autostop mycluster -i 10
 
+   # Use the default, 5 minutes of idleness
+   # sky autostop mycluster
+
 The :code:`-d / --detach` flag detaches logging from the terminal.
 
-To cancel the auto-stop scheduled on the cluster:
+To cancel a scheduled auto-stop on the cluster:
 
 .. code-block:: bash
 
-   # Cancel auto-stop for the cluster
    sky autostop mycluster --cancel
 
-To view the status of the cluster:
+To view the status of the cluster, use ``sky status [--refresh]``:
 
 .. code-block:: bash
 
    # Show a cluster's jobs (IDs, statuses).
    sky status
-   NAME         LAUNCHED   RESOURCES           STATUS  AUTOSTOP  COMMAND
-   mucluster    1 min ago  2x AWS(m4.2xlarge)  UP      0 min     sky launch -d -c ...
+   NAME         LAUNCHED    RESOURCES            STATUS   AUTOSTOP  COMMAND
+   mycluster    1 min ago   2x AWS(m4.2xlarge)   UP       10 min    sky launch -d -c ...
 
    # Refresh the status for auto-stopping
    sky status --refresh
-   NAME         LAUNCHED   RESOURCES           STATUS  AUTOSTOP  COMMAND
-   mucluster    1 min ago  2x AWS(m4.2xlarge)  STOPPED -         sky launch -d -c ...
-
+   NAME         LAUNCHED    RESOURCES            STATUS   AUTOSTOP  COMMAND
+   mycluster    11 min ago  2x AWS(m4.2xlarge)   STOPPED  -         sky launch -d -c ...
 
-The cluster status in :code:`sky status` shows the cached status of the cluster, which can be out-dated for clusters with auto-stopping scheduled. To view a real status of the cluster with auto-stopping scheduled, use :code:`sky status --refresh`.
 
+:code:`sky status` shows the cached statuses, which can be outdated for clusters with auto-stopping scheduled. To query the real statuses of clusters with auto-stopping scheduled, use :code:`sky status --refresh`.
diff --git a/sky/backends/cloud_vm_ray_backend.py b/sky/backends/cloud_vm_ray_backend.py
@@ -660,7 +660,6 @@ def _yield_region_zones(self, to_provision: 'resources_lib.Resources',
                     'It is possibly killed by cloud provider or manually '
                     'in the cloud provider console. To remove the cluster '
                     f'please run: sky down {cluster_name}')
-                logger.error(message)
                 # Reset to UP (rather than keeping it at INIT), as INIT
                 # mode will enable failover to other regions, causing
                 # data lose.
@@ -684,7 +683,6 @@ def _yield_region_zones(self, to_provision: 'resources_lib.Resources',
                     'Failed to acquire resources to restart the stopped '
                     f'cluster {cluster_name} on {region}. Please retry again '
                     'later.')
-                logger.error(message)
 
                 # Reset to STOPPED (rather than keeping it at INIT), because
                 # (1) the cluster is not up (2) it ensures future `sky start`
@@ -875,8 +873,6 @@ def _retry_region_zones(self,
                 # ray up failed for the head node.
                 self._update_blocklist_on_error(to_provision.cloud, region,
                                                 zones, stdout, stderr)
-                logger.error(
-                    f'*** HEAD_FAILED for {cluster_name} {region.name}')
             else:
                 # gang scheduling failed.
                 assert status == self.GangSchedulingStatus.GANG_FAILED, status
@@ -891,7 +887,8 @@ def _retry_region_zones(self,
                     stderr=None)
 
                 # Only log the errors for GANG_FAILED, since HEAD_FAILED may
-                # not have created any resources (it can happen however).
+                # not have created any resources (it can happen however) and
+                # HEAD_FAILED can happen in "normal" failover cases.
                 logger.error('*** Failed provisioning the cluster. ***')
                 terminate_str = 'Terminating' if need_terminate else 'Stopping'
                 logger.error(f'*** {terminate_str} the failed cluster. ***')