Skip to content

Commit 169e98e

Browse files
authored
[gpu] performance and functionality improvements (#1265)
* [gpu] performance and functionality improvements * Capturing disk usage statistics to reduce excessive disk space * created exit handler to clean up environment on completion or failure * created prepare function to prepare for the installation * when sufficient memory is available, configure a ramdisk * reduce noise by turning off -x in utility functions * added descriptive comments before the obscurely coded compare_versions_lte and compare_versions_lt functions * removed some intermediate driver versions * added cuda url for 12.6 * execute_with_retries now logs on failure, captures runtime and cleans before installing on debian * saving OS installation and NV .run files and their temp files to ramdisk * piping source .xz file directly xz instead of saving to disk first * new utility function "is_debuntu" checks for the frequently used conditon of whether the running OS is either debian or ubuntu * added support for specifying an http proxy (thank you प्रकाश) * moving load of kernel module to later in the code and exercising modprobe of all modules to avoid regression * fixed problem with attempting to fetch from incorrect vault directory when rocky kernel package is not found in primary repo * using correct cran-r signing key for ubuntu18 * corrected file check condition for /etc/apt/trusted.gpg * do not update all packages on rocky ; move preparation to prepare function * increasing memory to make use of ramdisk * using something a little smaller * create mount_ramdisk function and call it ; fix up the version comparison functions ; create ge and le comparisons for OSs * iterating better, caching results of system calls ; renamed to repair_old_backports * comparing correct version numbers * rocky uses a tmpfs on /tmp in the base image * tested on rocky and ubuntu * tested harder on rocky * cuda 11 no longer available for debian 12 * cuda v11 no longer supported on debian12 * corrected use of ubuntu regex for rocky version * re-enabling spark job tests * correct a couple of edge cases * added instructions for manually running tests * open a monitor session by default * cleaning up cuda and cudnn url generation * condition better * cleaned up generation of NVIDIA_CUDA_URL * updated versions and GPU accelerators in the documentation * ensure this test to be skipped based on cuda version rather than dataproc version alone * fix for /usr/local/cuda-12.4/bin/nvcc: No such file or directory * correcting path to run-bazel-tests.sh * runing variable definition * cleaned up skip conditions * order of operations * works with 2.0-rocky8 * remove redundant conditional check * supported version limits are tightened up a bit ; clean up rocky vault install code * corrected syntax errors * failure to run dnf here should not fail the entire installer * order matters here * 2.2-ubuntu22 works with cuda 11, other 2.2 do not * 2.2-ubuntu22 works with cuda 11, other 2.2 do not * fixes ubuntu22 kernel version mismatch error * disabling rocky9 builds due to out of date base dataproc image * cuda 2.0 not supported in debian12 * some 2.0-rocky8 single instance tests fail * intended to use <= and not >= * simplify gpu resource script * setting default discoveryScript ; testing pyspark in its own function * remove spark: prefix from property names * comment out quite a few tests * new version numbers * fixed a syntax error with documentation * musn't forget the commas * half as many tasks with twice as much cpu and gpu each * pause before first ssh ; correct variable name
1 parent da3d8c1 commit 169e98e

8 files changed

+735
-271
lines changed

gpu/Dockerfile

+40
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,40 @@
1+
# This Dockerfile builds the container from which manual tests are run
2+
# This process needs to be executed manually from a git clone
3+
#
4+
# See manual-test-runner.sh for instructions
5+
6+
FROM gcr.io/cloud-builders/gcloud
7+
8+
RUN useradd -m -d /home/ia-tests -s /bin/bash ia-tests
9+
10+
# Installed here are packages on which the tests depend
11+
RUN apt-get -qq update \
12+
&& apt-get -y -qq install \
13+
apt-transport-https apt-utils \
14+
ca-certificates libmime-base64-perl gnupg \
15+
curl jq less screen > /dev/null 2>&1 && apt-get clean
16+
17+
# Install bazel signing key, repo and package
18+
ENV bazel_kr_path=/usr/share/keyrings/bazel-release.pub.gpg
19+
ENV bazel_repo_data="http://storage.googleapis.com/bazel-apt stable jdk1.8"
20+
21+
RUN /usr/bin/curl -s https://bazel.build/bazel-release.pub.gpg \
22+
| gpg --dearmor -o "${bazel_kr_path}" \
23+
&& echo "deb [arch=amd64 signed-by=${bazel_kr_path}] ${bazel_repo_data}" \
24+
| dd of=/etc/apt/sources.list.d/bazel.list status=none \
25+
&& apt-get update -qq
26+
27+
RUN apt-get autoremove -y -qq && \
28+
apt-get install -y -qq default-jdk python3-setuptools bazel > /dev/null 2>&1 && \
29+
apt-get clean
30+
31+
# Install here any utilities you find useful when troubleshooting
32+
RUN apt-get -y -qq install emacs-nox vim uuid-runtime > /dev/null 2>&1 && apt-get clean
33+
34+
WORKDIR /init-actions
35+
36+
USER ia-tests
37+
COPY --chown=ia-tests:ia-tests . ${WORKDIR}
38+
39+
ENTRYPOINT ["/bin/bash"]
40+
#CMD ["/bin/bash"]

gpu/README.md

+14-14
Original file line numberDiff line numberDiff line change
@@ -14,12 +14,14 @@ for CUDA, the nvidia kernel driver, cuDNN, and NCCL.
1414
Specifying a supported value for the `cuda-version` metadata variable
1515
will select the following values for Driver, CuDNN and NCCL. At the
1616
time of writing, the default value for cuda-version, if unspecified is
17-
12.4. In addition to 12.4, we have also tested with 11.8.
17+
12.4. In addition to 12.4, we have also tested with 11.8, 12.0 and 12.6.
1818

19-
CUDA | Full Version | Driver | CuDNN | NCCL | Supported OSs
19+
CUDA | Full Version | Driver | CuDNN | NCCL | Tested Dataproc Image Versions
2020
-----| ------------ | --------- | --------- | ------- | -------------------
21-
11.8 | 11.8.0 | 525.147 | 8.6.0.163 | 2.15.5 | All
22-
12.4 | 12.4.1 | 550.90.07 | 9.1.0.70 | 2.21.5 | ALL
21+
11.8 | 11.8.0 | 560.35.03 | 8.6.0.163 | 2.15.5 | 2.0, 2.1, 2.2-ubuntu22
22+
12.0 | 12.0.0 | 550.90.07 | 8.8.1.3, | 2.16.5 | 2.0, 2.1, 2.2-rocky9, 2.2-ubuntu22
23+
12.4 | 12.4.1 | 550.90.07 | 9.1.0.70 | 2.23.4 | 2.1-ubuntu20, 2.1-rocky8, 2.2
24+
12.6 | 12.6.2 | 560.35.03 | 9.5.1.17 | 2.23.4 | 2.1-ubuntu20, 2.1-rocky8, 2.2
2325

2426
All variants in the preceeding table have been manually tested to work
2527
with the installer. Supported OSs at the time of writing are:
@@ -28,7 +30,6 @@ with the installer. Supported OSs at the time of writing are:
2830
* Ubuntu 18.04, 20.04, and 22.04 LTS
2931
* Rocky 8 and 9
3032

31-
3233
## Using this initialization action
3334

3435
**:warning: NOTICE:** See
@@ -47,16 +48,15 @@ attached GPU adapters.
4748
CLUSTER_NAME=<cluster_name>
4849
gcloud dataproc clusters create ${CLUSTER_NAME} \
4950
--region ${REGION} \
50-
--master-accelerator type=nvidia-tesla-v100 \
51-
--worker-accelerator type=nvidia-tesla-v100,count=4 \
51+
--master-accelerator type=nvidia-tesla-t4 \
52+
--worker-accelerator type=nvidia-tesla-t4,count=4 \
5253
--initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/gpu/install_gpu_driver.sh
5354
```
5455

5556
1. Use the `gcloud` command to create a new cluster with NVIDIA GPU drivers
5657
and CUDA installed by initialization action as well as the GPU
5758
monitoring service. The monitoring service is supported on Dataproc 2.0+ Debian
58-
and Ubuntu images. Please create a Github issue if support is needed for other
59-
Dataproc images.
59+
and Ubuntu images.
6060

6161
*Prerequisite:* Create GPU metrics in
6262
[Cloud Monitoring](https://cloud.google.com/monitoring/docs/) using Google
@@ -90,8 +90,8 @@ attached GPU adapters.
9090
CLUSTER_NAME=<cluster_name>
9191
gcloud dataproc clusters create ${CLUSTER_NAME} \
9292
--region ${REGION} \
93-
--master-accelerator type=nvidia-tesla-v100 \
94-
--worker-accelerator type=nvidia-tesla-v100,count=4 \
93+
--master-accelerator type=nvidia-tesla-t4 \
94+
--worker-accelerator type=nvidia-tesla-t4,count=4 \
9595
--initialization-actions gs://goog-dataproc-initialization-actions-${REGION}/gpu/install_gpu_driver.sh \
9696
--metadata install-gpu-agent=true \
9797
--scopes https://www.googleapis.com/auth/monitoring.write
@@ -136,12 +136,12 @@ attached GPU adapters.
136136
#### GPU Scheduling in YARN:
137137

138138
YARN is the default Resource Manager for Dataproc. To use GPU scheduling feature
139-
in Spark, it requires YARN version >= 2.10 or >=3.1.1. If intended to use Spark
139+
in Spark, it requires YARN version >= 2.10 or >= 3.1.1. If intended to use Spark
140140
with Deep Learning use case, it recommended to use YARN >= 3.1.3 to get support
141-
for [nvidia-docker version 2](https://github.com/NVIDIA/nvidia-docker).
141+
for [nvidia-container-toolkit](https://github.com/NVIDIA/nvidia-container-toolkit).
142142

143143
In current Dataproc set up, we enable GPU resource isolation by initialization
144-
script without NVIDIA Docker, you can find more information at
144+
script with NVIDIA container toolkit. You can find more information at
145145
[NVIDIA Spark RAPIDS getting started guide](https://nvidia.github.io/spark-rapids/).
146146

147147
#### cuDNN

gpu/bazel.screenrc

+11
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
screen -L -t 2.0-debian10 1 sh -c '/bin/bash -x gpu/run-bazel-tests.sh 2.0-debian10 ; exec /bin/bash'
2+
#screen -L -t 2.0-rocky8 2 sh -c '/bin/bash -x gpu/run-bazel-tests.sh 2.0-rocky8 ; exec /bin/bash'
3+
#screen -L -t 2.0-ubuntu18 3 sh -c '/bin/bash -x gpu/run-bazel-tests.sh 2.0-ubuntu18 ; exec /bin/bash'
4+
5+
#screen -L -t 2.1-debian11 4 sh -c '/bin/bash -x gpu/run-bazel-tests.sh 2.1-debian11 ; exec /bin/bash'
6+
#screen -L -t 2.1-rocky8 5 sh -c '/bin/bash -x gpu/run-bazel-tests.sh 2.1-rocky8 ; exec /bin/bash'
7+
#screen -L -t 2.1-ubuntu20 6 sh -c '/bin/bash -x gpu/run-bazel-tests.sh 2.1-ubuntu20 ; exec /bin/bash'
8+
9+
#screen -L -t 2.2-debian12 7 sh -c '/bin/bash -x gpu/run-bazel-tests.sh 2.2-debian12 ; exec /bin/bash'
10+
#screen -L -t 2.2-rocky9 8 sh -c '/bin/bash -x gpu/run-bazel-tests.sh 2.2-rocky9 ; exec /bin/bash'
11+
#screen -L -t 2.2-ubuntu22 9 sh -c '/bin/bash -x gpu/run-bazel-tests.sh 2.2-ubuntu22 ; exec /bin/bash'

gpu/env.json.sample

+7
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
{
2+
"PROJECT_ID":"example-yyyy-nn",
3+
"PURPOSE":"cuda-pre-init",
4+
"BUCKET":"my-bucket-name",
5+
"IMAGE_VERSION":"2.2-debian12",
6+
"ZONE":"us-west4-ñ"
7+
}

0 commit comments

Comments
 (0)