gjbex
diff --git a/‎python_for_hpc.pptx
131 KB b/‎python_for_hpc.pptx
131 KB
diff --git a/‎source-code/README.md
+6 b/‎source-code/README.md
+6
diff --git a/‎source-code/gpu/README.md
+14 b/‎source-code/gpu/README.md
+14
diff --git a/‎source-code/gpu/curand.ipynb
+199 b/‎source-code/gpu/curand.ipynb
+199
@@ -3,6 +3,7 @@
 This is source code that is either used in the presentation, or was developed
 to create it.  There is some material not covered in the presentation as well.
 
+
 ## Requirements
 
 * Python version: at least 3.6
@@ -18,6 +19,10 @@ to create it.  There is some material not covered in the presentation as well.
   * jupyter
   * ipywidgets
 
+* For the GPU code:
+  * pycuda
+  * scikit-cuda
+
 
 ## What is it?
 
@@ -39,3 +44,4 @@ to create it.  There is some material not covered in the presentation as well.
 1. `pypy`: code to experiment with the Pypy interpreter.
 1. `file-formats`: influcence of file formats on performance.
 1. `gpu`: some examples of using GPUs.
+1. `performance`: general considerations about performance.
@@ -0,0 +1,14 @@
+# GPU
+
+Sample code for performing computations on a GPU.
+
+
+## What is it?
+
+1. `pycuda.ipynb`: jupyter notebook illustrating pyCUDA.
+1. `curand.ipynb`: jupyter notebook illustrating generating random
+   numbers on a GPU.
+1. `scikit_cuda.ipynb`: jupyter notebook illustrating linear algebra
+   on a GPU device.
+1. `numba.ipynb`: jupyter notebook illustrating using numba for
+   GPU computing.
@@ -0,0 +1,199 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "2a4181f6-f103-4025-a45c-5bd31bbc3d12",
+   "metadata": {},
+   "source": [
+    "# Requirements"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "eac06215-e0dd-430b-a157-a5aab4904b65",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import math\n",
+    "import numpy as np\n",
+    "import pycuda.autoinit\n",
+    "from pycuda import gpuarray\n",
+    "from pycuda.compiler import SourceModule"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5832d66d-0446-499c-a17e-ffcac9bdbd0a",
+   "metadata": {},
+   "source": [
+    "We determine $\\pi$ as the ratio between a circle of radius 1 and the square that circumscribes it. The area of the circle will be approximated by the number of randomly selected points that fall into it, compared to the (larger) number of points that fall into the square.\n",
+    "\n",
+    "If we choose $x$ and $y$ independently from a uniform distribution $[0, 1[$, then $(x, y)$ represents a point and lies in a circle with radius 1 if $x^2 + y^2 \\le 1$.  Since this is only one quarter of the circle and circumscribing square, we get $\\pi$ by dividing the number of points in the circle by the total number of points, and multiplying by 4."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "32ca5892-664f-4317-a41d-b4f7a4d3f156",
+   "metadata": {},
+   "source": [
+    "# Implementation"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "2bc4dbd6-5018-427d-9ee9-347bfe6cd9d9",
+   "metadata": {},
+   "source": [
+    "To generate random numbers for each thread on the GPU, we use the curand library, which is a C++ library.  Since we use plain C for our CUDA code, we have to make sure that the header file is read ouside of an extermal C block.  The `SourceModule` add this automatically by default, so we make sure this isn't done by specifying the appropriate option.  In the source code, we implement our kernel in an external C blcok.\n",
+    "\n",
+    "The random number generator is intialized using the `curand_init` function that takes a seed as its first argument.  We ensure that it is unique for each thread by adding a thread-specific constant to the clock time.\n",
+    "\n",
+    "Random numbers are sampled from a uniform distribution using the `curand_uniform` function."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 48,
+   "id": "122a78d1-9d07-4484-8f30-a01ed58a9715",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "source_code = '''\n",
+    "    #include <curand_kernel.h>\n",
+    "    \n",
+    "    typedef unsigned long long cu_long;\n",
+    "    \n",
+    "    extern \"C\" {\n",
+    "        __global__ void estimate_pi(cu_long nr_tries, cu_long *nr_hits) {\n",
+    "            curandState rand_state;\n",
+    "            int thread_id = blockIdx.x*blockDim.x + threadIdx.x;\n",
+    "            curand_init((cu_long) clock() + (cu_long) thread_id,\n",
+    "                        (cu_long) 0, (cu_long) 0, &rand_state);\n",
+    "            float x, y;\n",
+    "            for (cu_long i = 0; i < nr_tries; ++i) {\n",
+    "                x = curand_uniform(&rand_state);\n",
+    "                y = curand_uniform(&rand_state);\n",
+    "                if (x*x + y*y < 1.0f) {\n",
+    "                    nr_hits[thread_id]++;\n",
+    "                }\n",
+    "            }\n",
+    "        }\n",
+    "    }\n",
+    "'''\n",
+    "\n",
+    "kernels = SourceModule(no_extern_c=True, source=source_code)\n",
+    "pi_kernel = kernels.get_function('estimate_pi')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "e54d0d19-9545-46b6-b79f-5a3a110e7a66",
+   "metadata": {},
+   "source": [
+    "Set the number of threads per block and the number of blocks per grid, and create an array of the appropriate size to store the counts for each thread.  Also specify the number of points to try"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 44,
+   "id": "71134a08-5195-4205-ae1c-51c7298704e7",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "threads_per_block = 32\n",
+    "blocks_per_grid = 512\n",
+    "total_threads = threads_per_block*blocks_per_grid\n",
+    "nr_hits = gpuarray.zeros((total_threads, ), dtype=np.uint64)\n",
+    "nr_tries = np.uint64(2**24)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "84c1a177-7a4f-4b27-9a11-4b318601ac4d",
+   "metadata": {},
+   "source": [
+    "Now we can execute the kernel and compute the value of $\\pi$."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 45,
+   "id": "a872e28c-9256-4d50-b46c-f8381131eda1",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pi_kernel(nr_tries, nr_hits, grid=(blocks_per_grid, 1, 1), block=(threads_per_block, 1, 1))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 46,
+   "id": "53ddbb43-0f22-427e-83a6-c6d0264e958e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pi_computed = 4.0*np.sum(nr_hits.get())/(nr_tries*total_threads)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "19bbb167-50b8-4bdc-8c92-1fb2f69d1e1b",
+   "metadata": {},
+   "source": [
+    "Checking the accuracy as comared with $\\pi$'s true value shows that it is correct upto a millionth."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 47,
+   "id": "f86f0258-64c8-44a0-b8b3-5a6f5d24e362",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "1.0e-01 True\n",
+      "1.0e-02 True\n",
+      "1.0e-03 True\n",
+      "1.0e-04 True\n",
+      "1.0e-05 True\n",
+      "1.0e-06 True\n",
+      "1.0e-07 False\n",
+      "1.0e-08 False\n",
+      "1.0e-09 False\n",
+      "1.0e-10 False\n",
+      "1.0e-11 False\n",
+      "1.0e-12 False\n"
+     ]
+    }
+   ],
+   "source": [
+    "for tol in np.logspace(-1, -12, num=12):\n",
+    "    print(f'{tol:.1e} {math.isclose(pi_computed, math.pi, rel_tol=tol)}')"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.7"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}