Merge pull request #18 from EntropyOrg/doc-gpu-setup

zmughal · web-flow · commit f34457aa4c85 · 2023-06-05T18:27:23.000-04:00
Add more GPU-specific documentation
diff --git a/README.pod b/README.pod
diff --git a/lib/AI/TensorFlow/Libtensorflow.pm b/lib/AI/TensorFlow/Libtensorflow.pm
@@ -73,6 +73,9 @@ __END__
 The C<libtensorflow> library provides low-level C bindings
 for TensorFlow with a stable ABI.
 
+For more detailed information about this library including how to get started,
+see L<AI::TensorFlow::Libtensorflow::Manual>.
+
 =cut
 
 =begin :badges
diff --git a/lib/AI/TensorFlow/Libtensorflow/Manual.pod b/lib/AI/TensorFlow/Libtensorflow/Manual.pod
@@ -8,6 +8,9 @@
 = L<AI::TensorFlow::Libtensorflow::Manual::Quickstart>
 Start here to get an overview of the library.
 
+= L<AI::TensorFlow::Libtensorflow::Manual::GPU>
+GPU-specific installation and usage information.
+
 = L<AI::TensorFlow::Libtensorflow::Manual::CAPI>
 Appendix of all C API functions with their signatures. These are linked from
 the documentation of individual methods.
diff --git a/lib/AI/TensorFlow/Libtensorflow/Manual/GPU.pod b/lib/AI/TensorFlow/Libtensorflow/Manual/GPU.pod
@@ -0,0 +1,48 @@
+# ABSTRACT: GPU-specific installation and usage information.
+# PODNAME: AI::TensorFlow::Libtensorflow::Manual::GPU
+=pod
+
+=head1 DESCRIPTION
+
+This guide provides information about using the GPU version of
+C<libtensorflow>. This is currently specific to NVIDIA GPUs as
+they provide the CUDA API that C<libtensorflow> targets for GPU devices.
+
+=head1 INSTALLATION
+
+In order to use a GPU with C<libtensorflow>, you will need to check that the
+L<hardware requirements|https://www.tensorflow.org/install/pip#hardware_requirements> and
+L<software requirements|https://www.tensorflow.org/install/pip#software_requirements> are
+met. Please refer to the official documentation for the specific
+hardware capabilities and software versions.
+
+An alternative to installing all the software listed on the "bare metal" host
+machine is to use C<libtensorflow> via a Docker container and the
+NVIDIA Container Toolkit. See L<AI::TensorFlow::Libtensorflow::Manual::Quickstart/DOCKER IMAGES>
+for more information.
+
+=head1 RUNTIME
+
+When running C<libtensorflow>, your program will attempt to acquire quite a bit
+of GPU VRAM. You can check if you have enough free VRAM by using the
+C<nvidia-smi> command which displays resource information as well as which
+processes are currently using the GPU.  If C<libtensorflow> is not able to
+allocate enough memory, it will crash with an out-of-memory (OOM) error. This
+is typical when running multiple programs that both use the GPU.
+
+If you have multiple GPUs, you can control which GPUs your program can access
+by using the
+L<C<CUDA_VISIBLE_DEVICES> environment variable|https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#env-vars>
+provided by the underlying CUDA library. This is typically
+done by setting the variable in a C<BEGIN> block before loading
+L<AI::TensorFlow::Libtensorflow>:
+
+  BEGIN {
+      # Set the specific GPU device that is available
+      # to this program to GPU index 0, which is the
+      # first GPU as listed in the output of `nvidia-smi`.
+      $ENV{CUDA_VISIBLE_DEVICES} = '0';
+      require AI::TensorFlow::Libtensorflow;
+  }
+
+=cut
diff --git a/lib/AI/TensorFlow/Libtensorflow/Manual/Quickstart.pod b/lib/AI/TensorFlow/Libtensorflow/Manual/Quickstart.pod
@@ -94,6 +94,8 @@ C<http://127.0.0.1:8888/> in order to connect to the Jupyter Notebook interface
 via the web browser. In the browser, click on the C<notebook> folder to access
 the notebooks.
 
+=head2 GPU Docker support
+
 If using the GPU Docker image for NVIDIA support, make sure that the
 L<TensorFlow Docker requirements|https://www.tensorflow.org/install/docker#tensorflow_docker_requirements>
 are met and that the correct flags are passed to C<docker run>, for example
@@ -102,8 +104,30 @@ C<<
   docker run --rm --gpus all [...]
 >>
 
-More information about NVIDIA Docker containers can be found in the user guide
-for the L<NVIDIA Container Toolkit|https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/user-guide.html>.
+More information about NVIDIA Docker containers can be found in the
+NVIDIA Container Toolkit
+L<Installation Guide|https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html>
+(specifically L<Setting up NVIDIA Container Toolkit|https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/install-guide.html#setting-up-nvidia-container-toolkit>)
+and
+L<User Guide|https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/user-guide.html>.
+
+=head3 Diagnostics
+
+When using the Docker GPU image, you may come across the error
+
+C<<
+  nvidia-container-cli: initialization error: load library failed: libnvidia-ml.so.1: cannot open shared object file: no such file or directory: unknown.
+>>
+
+Make sure that you have installed the NVIDIA Container Toolkit correctly
+via the Installation Guide. Also make sure that you only have one Docker daemon
+installed.  The recommended approach is to install via the official Docker
+releases at L<https://docs.docker.com/engine/install/>. Note that in some
+cases, you may have other unofficial Docker installations such as the
+C<docker.io> package or the Snap C<docker> package, which may conflict with
+the official vendor-provided NVIDIA Container Runtime.
+
+=head2 Docker Tags
 
 =begin :list