Skip to content

Commit f34557d

Browse files
authored
Merge branch 'master' into dcgan_fashiongen_example
2 parents 27f3a70 + eea64d0 commit f34557d

File tree

98 files changed

+2812
-357
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

98 files changed

+2812
-357
lines changed

README.md

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,7 @@ TorchServe is a flexible and easy to use tool for serving PyTorch models.
4040
python ./ts_scripts/install_dependencies.py
4141
```
4242

43-
- For GPU with Cuda 10.2. Options are `cu92`, `cu101`, `cu102`, `cu111`
43+
- For GPU with Cuda 10.2. Options are `cu92`, `cu101`, `cu102`, `cu111`, `cu113`
4444

4545
```bash
4646
python ./ts_scripts/install_dependencies.py --cuda=cu102
@@ -224,7 +224,10 @@ High level performance data like Throughput or Percentile Precision can be gener
224224
### Concurrency And Number of Workers
225225
TorchServe exposes configurations that allow the user to configure the number of worker threads on CPU and GPUs. There is an important config property that can speed up the server depending on the workload.
226226
*Note: the following property has bigger impact under heavy workloads.*
227-
If TorchServe is hosted on a machine with GPUs, there is a config property called `number_of_gpu` that tells the server to use a specific number of GPUs per model. In cases where we register multiple models with the server, this will apply to all the models registered. If this is set to a low value (ex: 0 or 1), it will result in under-utilization of GPUs. On the contrary, setting to a high value (>= max GPUs available on the system) results in as many workers getting spawned per model. Clearly, this will result in unnecessary contention for GPUs and can result in sub-optimal scheduling of threads to GPU.
227+
228+
**CPU**: there is a config property called `workers` which sets the number of worker threads for a model. The best value to set `workers` to is to start with `num physical cores / 2` and increase it as much possible after setting `torch.set_num_threads(1)` in your handler.
229+
230+
**GPU**: there is a config property called `number_of_gpu` that tells the server to use a specific number of GPUs per model. In cases where we register multiple models with the server, this will apply to all the models registered. If this is set to a low value (ex: 0 or 1), it will result in under-utilization of GPUs. On the contrary, setting to a high value (>= max GPUs available on the system) results in as many workers getting spawned per model. Clearly, this will result in unnecessary contention for GPUs and can result in sub-optimal scheduling of threads to GPU.
228231
```
229232
ValueToSet = (Number of Hardware GPUs) / (Number of Unique Models)
230233
```

ci/buildspec_cpu.yml

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,9 +7,8 @@ phases:
77
commands:
88
- apt-get update
99
- apt-get install sudo -y
10-
- python ts_scripts/install_dependencies.py --environment=dev
10+
- pip install -r ci/launcher/requirements.txt
1111

1212
build:
1313
commands:
14-
- python torchserve_sanity.py
15-
- cd serving-sdk/ && mvn clean install -q && cd ../
14+
- python ci/launcher/launch_test.py --instance-type c5.18xlarge

ci/buildspec_cpu_backup.yml

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
# Build Spec for AWS CodeBuild CI
2+
3+
version: 0.2
4+
5+
phases:
6+
install:
7+
commands:
8+
- apt-get update
9+
- apt-get install sudo -y
10+
- python ts_scripts/install_dependencies.py --environment=dev
11+
12+
build:
13+
commands:
14+
- python torchserve_sanity.py
15+
- cd serving-sdk/ && mvn clean install -q && cd ../

ci/buildspec_gpu.yml

Lines changed: 2 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -7,9 +7,8 @@ phases:
77
commands:
88
- apt-get update
99
- apt-get install sudo -y
10-
- python ts_scripts/install_dependencies.py --cuda=cu102 --environment=dev
10+
- pip install -r ci/launcher/requirements.txt
1111

1212
build:
1313
commands:
14-
- python torchserve_sanity.py
15-
- cd serving-sdk/ && mvn clean install -q && cd ../
14+
- python ci/launcher/launch_test.py --instance-type p3.2xlarge

ci/buildspec_gpu_backup.yml

Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
# Build Spec for AWS CodeBuild CI
2+
3+
version: 0.2
4+
5+
phases:
6+
install:
7+
commands:
8+
- apt-get update
9+
- apt-get install sudo -y
10+
- python ts_scripts/install_dependencies.py --cuda=cu102 --environment=dev
11+
12+
build:
13+
commands:
14+
- python torchserve_sanity.py
15+
- cd serving-sdk/ && mvn clean install -q && cd ../

ci/launcher/launch_test.py

Lines changed: 190 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,190 @@
1+
import argparse
2+
import boto3
3+
import datetime
4+
import random
5+
import subprocess
6+
import os
7+
import time
8+
9+
10+
from botocore.config import Config
11+
from fabric2 import Connection
12+
from invoke import run
13+
14+
from utils import LOGGER, GPU_INSTANCES
15+
from utils import ec2 as ec2_utils
16+
17+
CPU_INSTANCE_COMMANDS_LIST = [
18+
"python3 ts_scripts/install_dependencies.py --environment=dev",
19+
"python3 torchserve_sanity.py",
20+
"cd serving-sdk/ && mvn clean install -q && cd ../",
21+
]
22+
23+
GPU_INSTANCE_COMMANDS_LIST = [
24+
"python3 ts_scripts/install_dependencies.py --environment=dev --cuda=cu102",
25+
"python3 torchserve_sanity.py",
26+
"cd serving-sdk/ && mvn clean install -q && cd ../",
27+
]
28+
29+
30+
def run_commands_on_ec2_instance(ec2_connection, is_gpu):
31+
"""
32+
This function assumes that the required 'serve' folder is already available on the ec2 instance in the home directory.
33+
Returns a map of the command executed and return value of that command.
34+
"""
35+
36+
command_result_map = {}
37+
38+
virtual_env_name = "venv"
39+
40+
with ec2_connection.cd(f"/home/ubuntu/serve"):
41+
ec2_connection.run(f"python3 -m venv {virtual_env_name}")
42+
with ec2_connection.prefix(f"source {virtual_env_name}/bin/activate"):
43+
commands_list = GPU_INSTANCE_COMMANDS_LIST if is_gpu else CPU_INSTANCE_COMMANDS_LIST
44+
45+
for command in commands_list:
46+
LOGGER.info(f"*** Executing command on ec2 instance: {command}")
47+
ret_obj = ec2_connection.run(
48+
command,
49+
echo=True,
50+
warn=True,
51+
pty=True,
52+
shell="/bin/bash",
53+
env={"LC_CTYPE": "en_US.utf8", "JAVA_HOME": "/usr/lib/jvm/java-11-openjdk-amd64"},
54+
)
55+
56+
if ret_obj.return_code != 0:
57+
LOGGER.error(f"*** Failed command: {command}")
58+
LOGGER.error(f"*** Failed command stdout: {ret_obj.stdout}")
59+
LOGGER.error(f"*** Failed command stderr: {ret_obj.stderr}")
60+
61+
command_result_map[command] = ret_obj.return_code
62+
63+
return command_result_map
64+
65+
66+
def launch_ec2_instance(region, instance_type, ami_id):
67+
"""
68+
Note: This function relies on CODEBUILD environment variables. If this function is used outside of CODEBUILD,
69+
modify the function accordingly.
70+
Spins up an ec2 instance, clones the current Github Pull Request commit id on the instance, and runs sanity test on it.
71+
Prints the output of the command executed.
72+
"""
73+
github_repo = os.environ.get("CODEBUILD_SOURCE_REPO_URL", "https://github.com/pytorch/serve.git").strip()
74+
github_pr_commit_id = os.environ.get("CODEBUILD_RESOLVED_SOURCE_VERSION", "HEAD").strip()
75+
github_hookshot = os.environ.get("CODEBUILD_SOURCE_VERSION", "job-local").strip()
76+
github_hookshot = github_hookshot.replace("/", "-")
77+
78+
# Extract the PR number or use the last 6 characters of the commit id
79+
github_pull_request_number = github_hookshot.split("-")[1] if "-" in github_hookshot else github_hookshot[-6:]
80+
81+
ec2_client = boto3.client("ec2", config=Config(retries={"max_attempts": 10}), region_name=region)
82+
random.seed(f"{datetime.datetime.now().strftime('%Y%m%d%H%M%S%f')}")
83+
ec2_key_name = f"{github_hookshot}-ec2-instance-{random.randint(1, 1000)}"
84+
85+
# Spin up ec2 instance and run tests
86+
try:
87+
key_file = ec2_utils.generate_ssh_keypair(ec2_client, ec2_key_name)
88+
instance_details = ec2_utils.launch_instance(
89+
ami_id,
90+
instance_type,
91+
ec2_key_name=ec2_key_name,
92+
region=region,
93+
user_data=None,
94+
iam_instance_profile_name=ec2_utils.EC2_INSTANCE_ROLE_NAME,
95+
instance_name=ec2_key_name,
96+
)
97+
98+
instance_id = instance_details["InstanceId"]
99+
ip_address = ec2_utils.get_public_ip(instance_id, region=region)
100+
101+
LOGGER.info(f"*** Waiting on instance checks to complete...")
102+
ec2_utils.check_instance_state(instance_id, state="running", region=region)
103+
ec2_utils.check_system_state(instance_id, system_status="ok", instance_status="ok", region=region)
104+
LOGGER.info(f"*** Instance checks complete. Running commands on instance.")
105+
106+
# Create a fabric connection to the ec2 instance.
107+
ec2_connection = ec2_utils.get_ec2_fabric_connection(instance_id, key_file, region)
108+
109+
LOGGER.info(f"Running update command. This could take a while.")
110+
ec2_connection.run(f"sudo apt update")
111+
112+
# Update command takes a while to run, and should ideally run uninterrupted
113+
time.sleep(300)
114+
115+
with ec2_connection.cd("/home/ubuntu"):
116+
LOGGER.info(f"*** Cloning the PR related to {github_hookshot} on the ec2 instance.")
117+
ec2_connection.run(f"git clone {github_repo}")
118+
ec2_connection.run(
119+
f"cd serve && git fetch origin pull/{github_pull_request_number}/head:pull && git checkout pull"
120+
)
121+
122+
ec2_connection.run(f"sudo apt-get install -y python3-venv")
123+
# Following is necessary on Base Ubuntu DLAMI because the default python is python2
124+
# This will NOT fail for other AMI where default python is python3
125+
ec2_connection.run(
126+
f"sudo cp /usr/local/bin/pip3 /usr/local/bin/pip && pip install --upgrade pip", warn=True
127+
)
128+
ec2_connection.run(
129+
f"sudo update-alternatives --install /usr/bin/python python /usr/bin/python3 1", warn=True
130+
)
131+
132+
is_gpu = True if instance_type[:2] in GPU_INSTANCES else False
133+
134+
command_return_value_map = run_commands_on_ec2_instance(ec2_connection, is_gpu)
135+
136+
if any(command_return_value_map.values()):
137+
raise ValueError(f"*** One of the commands executed on ec2 returned a non-zero value.")
138+
else:
139+
LOGGER.info(f"*** All commands executed successfully on ec2. command:return_value map is as follows:")
140+
LOGGER.info(command_return_value_map)
141+
142+
except ValueError as e:
143+
LOGGER.error(f"*** ValueError: {e}")
144+
LOGGER.error(f"*** Following commands had the corresponding return value:")
145+
LOGGER.error(command_return_value_map)
146+
raise e
147+
except Exception as e:
148+
LOGGER.error(f"*** Exception occured. {e}")
149+
raise e
150+
finally:
151+
LOGGER.warning(f"*** Terminating instance-id: {instance_id} with name: {ec2_key_name}")
152+
ec2_utils.terminate_instance(instance_id, region)
153+
LOGGER.warning(f"*** Destroying ssh key_pair: {ec2_key_name}")
154+
ec2_utils.destroy_ssh_keypair(ec2_client, ec2_key_name)
155+
156+
157+
def main():
158+
159+
parser = argparse.ArgumentParser()
160+
161+
parser.add_argument(
162+
"--instance-type",
163+
default="p3.2xlarge",
164+
help="Specify the instance type you want to run the test on. Default: p3.2xlarge",
165+
)
166+
167+
parser.add_argument(
168+
"--region",
169+
default="us-west-2",
170+
help="Specify the aws region in which you want associated ec2 instance to be spawned",
171+
)
172+
173+
parser.add_argument(
174+
"--ami-id",
175+
default="ami-032e40ca6b0973cf2",
176+
help="Specify an Ubuntu Base DLAMI only. This AMI type ships with nvidia drivers already setup. Using other AMIs might"
177+
"need non-trivial installations on the AMI. AMI-ids differ per aws region.",
178+
)
179+
180+
arguments = parser.parse_args()
181+
182+
instance_type = arguments.instance_type
183+
region = arguments.region
184+
ami_id = arguments.ami_id
185+
186+
launch_ec2_instance(region, instance_type, ami_id)
187+
188+
189+
if __name__ == "__main__":
190+
main()

ci/launcher/requirements.txt

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,3 @@
1+
fabric2==2.5.0
2+
boto3
3+
retrying

ci/launcher/utils/__init__.py

Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
import logging
2+
import sys
3+
4+
LOGGER = logging.getLogger(__name__)
5+
LOGGER.setLevel(logging.INFO)
6+
LOGGER.addHandler(logging.StreamHandler(sys.stderr))
7+
8+
DEFAULT_REGION = "us-west-2"
9+
10+
GPU_INSTANCES = ["p2", "p3", "p4", "g2", "g3", "g4"]

0 commit comments

Comments
 (0)