Skip to content

Commit 06e2b2b

Browse files
mrahejaMichaelvll
andauthored
Usage Collection & Runtimes (skypilot-org#872)
* Logging CLI & YAML * Logging stack traces & added metrics * yapf * added privacy policy * formatting * make suggested changes + added setup documentation * more edits * address some comments * Fix comments * Fix user hash * Rename * refactor a bit * fix return value not returned bug * update instance and readme * Fix transaction and cli logging * Remove remnant * options to turn off logging * add SKY_DEV=1 before pytest * format * fix option * push * Fix redaction and add overrides * fix typo * remove remnant * Add new lines * fix * refactor metrics and usage logging * Disable logging for testing * reformat * Disable metrics * make time stamp the same * format * Send json instead * Add runtime * Add runtime * format * Remove metric * address comments * update readme * docstr * restore timeline * switch back the timeline.event * Reorganize backend functions * fix * Fix smoke and spot status refresh * order of __init__ * fix * address comments * Use cli for status showing * Address comments * Add runtime logging for backend * format * Disable logging for github * Revert to two functions for file mounts * Move comment lines out * Add schema for the logs * Add fields * Add readme for setting loki * format * Add instructions for setting up loki service * link to IAM role creation * Address comments * Add region * format * remove IAM key * move update functions into the class * fix set_new_cluster * Catch unrecognized options and command * Address comments * Remove note * Add development environment * Switch to elastic IP * update status * Fix time and add readme * Add callable handle for clean_yaml * format * Fix status * Fix sky_dev, user for spot and env * Use new space for release * Add internal field * increase sleep for spot recovery * Fix logs for onprem * fix instruction * disable logging for smoke test * Update docs and message reset * format * Address comments * Fix readme * Fix grafana screenshot Co-authored-by: Zhanghao Wu <[email protected]>
1 parent 434ef3a commit 06e2b2b

36 files changed

+924
-175
lines changed

.github/workflows/pytest.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -51,4 +51,4 @@ jobs:
5151
pip install pytest
5252
5353
- name: Run tests with pytest
54-
run: pytest ${{ matrix.test-path }}
54+
run: SKY_DISABLE_USAGE_COLLECTION=1 pytest ${{ matrix.test-path }}

README.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -21,13 +21,21 @@ Use editable mode (`-e`) when installing:
2121
pip install -e ".[all]"
2222
pip install -r requirements-dev.txt
2323
```
24+
IMPORTANT: Please `export SKY_DEV=1` before running the sky commands in the terminal, so that the developing log will not pollute the actual user logs.
25+
2426

2527
### Submitting pull requests
2628
- After you commit, format your code with [`format.sh`](./format.sh).
2729
- In the PR description, write a `Tested:` section to describe relevant tests performed.
2830
- For changes that touch the core system, run the [smoke tests](#testing) and ensure they pass.
2931
- Follow the [Google style guide](https://google.github.io/styleguide/pyguide.html).
3032

33+
34+
### Environment Variable Options
35+
- `export SKY_DEV=1` to show debugging logs (logging.DEBUG) and send the logs to dev space.
36+
- `export SKY_DISABLE_USAGE_COLLECTION=1` to disable usage logging.
37+
- `export SKY_MINIMIZE_LOGGING=1` to minimize the sky outputs for demo purpose.
38+
3139
### Dump timeline
3240

3341
Timeline is useful for performance analysis and debugging in Sky.

docs/source/examples/spot-jobs.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ This feature **saves significant cost** (e.g., up to 70\% for GPU VMs) by making
99
To maximize availability, Sky automatically finds available spot resources across regions and clouds.
1010
Here is an example of BERT training job failing over different regions across AWS and GCP.
1111

12-
.. image:: ../imgs/spot-training.png
12+
.. image:: ../images/spot-training.png
1313
:width: 600
1414
:alt: BERT training on Spot V100
1515

docs/source/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -54,6 +54,7 @@ Key features:
5454
reference/storage
5555
reference/local/index
5656
reference/quota
57+
reference/logging
5758
reference/faq
5859

5960
.. toctree::

docs/source/reference/logging.rst

Lines changed: 31 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,31 @@
1+
.. _logging:
2+
3+
Usage Collection
4+
=================
5+
6+
Sky collects usage stats by default. This data will only be used by the Sky team to improve its services and for research purpose.
7+
We will **not** sell data or buy data about you.
8+
9+
10+
What data is collected?
11+
-----------------------
12+
13+
We collect non-sensitive data that helps us understand how Sky is used. We will redact your ``setup``, ``run``, and ``env`` from the collected data.
14+
15+
.. _usage-disable:
16+
17+
How to disable it
18+
-----------------
19+
To disable usage collection, set the ``SKY_DISABLE_USAGE_COLLECTION`` environment variable by :code:`export SKY_DISABLE_USAGE_COLLECTION=1`.
20+
21+
22+
How does it work?
23+
-----------------
24+
25+
When a Sky CLI or entrypoint function is called, Sky will do the following:
26+
27+
#. Check the environment variable ``SKY_DISABLE_USAGE_COLLECTION`` is set: 1 means disabled and 0 means enabled.
28+
29+
#. If the environment variable is not set or set to 0, it will collect information about the cluster and task resource requirements
30+
31+
#. If the environment variable is set to 1, it will skip any message sending.

sky/backends/backend.py

Lines changed: 15 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,9 @@
22
import typing
33
from typing import Dict, Optional
44

5+
import sky
56
from sky.utils import timeline
7+
from sky.usage import usage_lib
68

79
if typing.TYPE_CHECKING:
810
from sky import resources
@@ -32,21 +34,28 @@ def check_resources_fit_cluster(self, handle: ResourceHandle,
3234
raise NotImplementedError
3335

3436
@timeline.event
37+
@usage_lib.messages.usage.update_runtime('provision')
3538
def provision(self,
3639
task: 'task_lib.Task',
3740
to_provision: Optional['resources.Resources'],
3841
dryrun: bool,
3942
stream_logs: bool,
4043
cluster_name: Optional[str] = None,
4144
retry_until_up: bool = False) -> ResourceHandle:
45+
if cluster_name is None:
46+
cluster_name = sky.backends.backend_utils.generate_cluster_name()
47+
usage_lib.messages.usage.update_cluster_name(cluster_name)
48+
usage_lib.messages.usage.update_actual_task(task)
4249
return self._provision(task, to_provision, dryrun, stream_logs,
4350
cluster_name, retry_until_up)
4451

4552
@timeline.event
53+
@usage_lib.messages.usage.update_runtime('sync_workdir')
4654
def sync_workdir(self, handle: ResourceHandle, workdir: Path) -> None:
4755
return self._sync_workdir(handle, workdir)
4856

4957
@timeline.event
58+
@usage_lib.messages.usage.update_runtime('sync_file_mounts')
5059
def sync_file_mounts(
5160
self,
5261
handle: ResourceHandle,
@@ -56,15 +65,19 @@ def sync_file_mounts(
5665
return self._sync_file_mounts(handle, all_file_mounts, storage_mounts)
5766

5867
@timeline.event
68+
@usage_lib.messages.usage.update_runtime('setup')
5969
def setup(self, handle: ResourceHandle, task: 'task_lib.Task') -> None:
6070
return self._setup(handle, task)
6171

6272
def add_storage_objects(self, task: 'task_lib.Task') -> None:
6373
raise NotImplementedError
6474

6575
@timeline.event
76+
@usage_lib.messages.usage.update_runtime('execute')
6677
def execute(self, handle: ResourceHandle, task: 'task_lib.Task',
6778
detach_run: bool) -> None:
79+
usage_lib.messages.usage.update_cluster_name(handle.get_cluster_name())
80+
usage_lib.messages.usage.update_actual_task(task)
6881
return self._execute(handle, task, detach_run)
6982

7083
@timeline.event
@@ -77,6 +90,7 @@ def teardown_ephemeral_storage(self, task: 'task_lib.Task') -> None:
7790
return self._teardown_ephemeral_storage(task)
7891

7992
@timeline.event
93+
@usage_lib.messages.usage.update_runtime('teardown')
8094
def teardown(self,
8195
handle: ResourceHandle,
8296
terminate: bool,
@@ -93,7 +107,7 @@ def _provision(self,
93107
to_provision: Optional['resources.Resources'],
94108
dryrun: bool,
95109
stream_logs: bool,
96-
cluster_name: Optional[str] = None,
110+
cluster_name: str,
97111
retry_until_up: bool = False) -> ResourceHandle:
98112
raise NotImplementedError
99113

sky/backends/backend_utils.py

Lines changed: 8 additions & 59 deletions
Original file line numberDiff line numberDiff line change
@@ -5,13 +5,11 @@
55
import difflib
66
import enum
77
import getpass
8-
import hashlib
98
import json
109
import os
1110
import pathlib
1211
import random
1312
import re
14-
import socket
1513
import subprocess
1614
import textwrap
1715
import threading
@@ -45,11 +43,13 @@
4543
from sky import spot as spot_lib
4644
from sky.backends import onprem_utils
4745
from sky.skylet import log_lib
46+
from sky.utils import common_utils
4847
from sky.utils import command_runner
4948
from sky.utils import subprocess_utils
5049
from sky.utils import timeline
5150
from sky.utils import ux_utils
5251
from sky.utils import validator
52+
from sky.usage import usage_lib
5353

5454
if typing.TYPE_CHECKING:
5555
from sky import resources
@@ -613,7 +613,7 @@ def write_cluster_config(to_provision: 'resources.Resources',
613613
# (username, last 4 chars of hash of hostname): for uniquefying
614614
# users on shared-account cloud providers. Using uuid.getnode()
615615
# is incorrect; observed to collide on Macs.
616-
'security_group': f'sky-sg-{user_and_hostname_hash()}',
616+
'security_group': f'sky-sg-{common_utils.user_and_hostname_hash()}',
617617
# Azure only.
618618
'azure_subscription_id': azure_subscription_id,
619619
'resource_group': f'{cluster_name}-{region_name}',
@@ -640,6 +640,7 @@ def write_cluster_config(to_provision: 'resources.Resources',
640640
if dryrun:
641641
return config_dict
642642
_add_auth_to_cluster_config(cloud, yaml_path)
643+
usage_lib.messages.usage.update_ray_yaml(yaml_path)
643644
# For TPU nodes. TPU VMs do not need TPU_NAME.
644645
if (resources_vars.get('tpu_type') is not None and
645646
resources_vars.get('tpu_vm') is None):
@@ -689,30 +690,7 @@ def _add_auth_to_cluster_config(cloud: clouds.Cloud, cluster_config_file: str):
689690
# in the local cluster config (in ~/.sky/local/...). There is no need
690691
# for Sky to generate authentication.
691692
pass
692-
dump_yaml(cluster_config_file, config)
693-
694-
695-
def read_yaml(path):
696-
with open(path, 'r') as f:
697-
config = yaml.safe_load(f)
698-
return config
699-
700-
701-
def dump_yaml(path, config):
702-
# https://github.com/yaml/pyyaml/issues/127
703-
class LineBreakDumper(yaml.SafeDumper):
704-
705-
def write_line_break(self, data=None):
706-
super().write_line_break(data)
707-
if len(self.indents) == 1:
708-
super().write_line_break()
709-
710-
with open(path, 'w') as f:
711-
yaml.dump(config,
712-
f,
713-
Dumper=LineBreakDumper,
714-
sort_keys=False,
715-
default_flow_style=False)
693+
common_utils.dump_yaml(cluster_config_file, config)
716694

717695

718696
def get_run_timestamp() -> str:
@@ -836,7 +814,7 @@ def wait_until_ray_cluster_ready(
836814

837815
def ssh_credential_from_yaml(cluster_yaml: str) -> Tuple[str, str, str]:
838816
"""Returns ssh_user, ssh_private_key and ssh_control name."""
839-
config = read_yaml(cluster_yaml)
817+
config = common_utils.read_yaml(cluster_yaml)
840818
auth_section = config['auth']
841819
ssh_user = auth_section['ssh_user'].strip()
842820
ssh_private_key = auth_section.get('ssh_private_key')
@@ -934,35 +912,6 @@ def check_local_gpus() -> bool:
934912
return is_functional
935913

936914

937-
def user_and_hostname_hash() -> str:
938-
"""Returns a string containing <user>-<hostname hash last 4 chars>.
939-
940-
For uniquefying user clusters on shared-account cloud providers. Also used
941-
for AWS security group.
942-
943-
Using uuid.getnode() instead of gethostname() is incorrect; observed to
944-
collide on Macs.
945-
946-
NOTE: BACKWARD INCOMPATIBILITY NOTES
947-
948-
Changing this string will render AWS clusters shown in `sky status`
949-
unreusable and potentially cause leakage:
950-
951-
- If a cluster is STOPPED, any command restarting it (`sky launch`, `sky
952-
start`) will launch a NEW cluster.
953-
- If a cluster is UP, a `sky launch` command reusing it will launch a NEW
954-
cluster. The original cluster will be stopped and thus leaked from Sky's
955-
perspective.
956-
- `sky down/stop/exec` on these pre-change clusters still works, if no new
957-
clusters with the same name have been launched.
958-
959-
The reason is AWS security group names are derived from this string, and
960-
thus changing the SG name makes these clusters unrecognizable.
961-
"""
962-
hostname_hash = hashlib.md5(socket.gethostname().encode()).hexdigest()[-4:]
963-
return f'{getpass.getuser()}-{hostname_hash}'
964-
965-
966915
def generate_cluster_name():
967916
# TODO: change this ID formatting to something more pleasant.
968917
# User name is helpful in non-isolated accounts, e.g., GCP, Azure.
@@ -1309,7 +1258,7 @@ def _get_cluster_status_via_cloud_cli(
13091258
"""Returns the status of the cluster."""
13101259
resources: sky.Resources = handle.launched_resources
13111260
cloud = resources.cloud
1312-
ray_config = read_yaml(handle.cluster_yaml)
1261+
ray_config = common_utils.read_yaml(handle.cluster_yaml)
13131262
return _QUERY_STATUS_FUNCS[str(cloud)](handle.cluster_name, ray_config)
13141263

13151264

@@ -1407,7 +1356,7 @@ def _update_cluster_status(
14071356
14081357
The function will update the cached cluster status in the global state. For the
14091358
design of the cluster status and transition, please refer to the
1410-
sky/design_docs/cluster_states.md
1359+
sky/design_docs/cluster_status.md
14111360
14121361
Returns:
14131362
If the cluster is terminated or does not exist, return None.

0 commit comments

Comments
 (0)