From 1abb082cb3b1662d777622cf5ecd2355b8e7c1c7 Mon Sep 17 00:00:00 2001 From: "Giordano Lipari @laptop4" Date: Mon, 6 Nov 2023 17:06:16 +0100 Subject: [PATCH 01/17] Episode 2: Empty commit to mark the detachment of a branch If plans work out as planned, I intend to provide all remarks on Episode 2 (Using your GPU with CuPy) in this dedicated branch. Additional sub-branches may concern (i) linguistic editing (ii) pedagogical suggestion. In case some suggestion involves both aspects of feedback, I will make choice based on judgement. From 1d12ca31588038cc850651b962b8472da35c54e4 Mon Sep 17 00:00:00 2001 From: "Giordano Lipari @laptop4" Date: Mon, 6 Nov 2023 18:04:11 +0100 Subject: [PATCH 02/17] Episode02: Apply semantic newlines Semantic newlines means that one sentence takes one line only, hence all lines that contain sentences are one-liners. This makes it easy to isolate the changes the file undergoes. A few broken lines have been merged into one. More frequently, lines containing several sentences have been split at the full stop plus a whitespace. A side effect of this rule occurs with i.e. e.g. dev. std. vs. and suchlike. Especially troublesome is when numbers are followed by a dot and whitespace, like in array outputs. Sometimes a spurious blank line appeared. Mind also when you break a comment since the new line requires the # token at the start. I have applied the changes with the dialogue of the text editor, using regular expressions and then git diff, repeatedly until there were no side effects. No side effect means that the old version is one line and the new version is more lines, except when I blended broken lines; git diff shows the state of the play neatly. Sometimes I have changed a double into a single blank line (hopefully this is immaterial for styling). I have done this ahead of working on linguistic and/or pedagogic editing. This order should then be applied when editing other lesson episodes (in other branches), for consistency of method. Writing a sed or awk script to apply these changes correctly and consistently across different files is a TODO. --- episodes/cupy.Rmd | 94 +++++++++++++++++++++++++++++++---------------- 1 file changed, 62 insertions(+), 32 deletions(-) diff --git a/episodes/cupy.Rmd b/episodes/cupy.Rmd index 2be6c76..c646ead 100644 --- a/episodes/cupy.Rmd +++ b/episodes/cupy.Rmd @@ -24,7 +24,8 @@ This makes it a very convenient tool to use the compute power of GPUs for people # Convolution in Python We start by generating an artificial "image" on the host using Python and NumPy; the host is the CPU on the laptop, desktop, or cluster node you are using right now, and from now on we may use *host* to refer to the CPU and *device* to refer to the GPU. -The image will be all zeros, except for isolated pixels with value one, on a regular grid. The plan is to convolve it with a Gaussian and inspect the result. +The image will be all zeros, except for isolated pixels with value one, on a regular grid. +The plan is to convolve it with a Gaussian and inspect the result. We will also record the time it takes to execute this convolution on the host. We can interactively write and executed the code in an iPython shell or a Jupyter notebook. @@ -56,7 +57,9 @@ After executing the code, you should see the following image. ### Background The computation we want to perform on this image is a convolution, once on the host and once on the device so we can compare the results and execution times. -In computer vision applications, convolutions are often used to filter images and if you want to know more about them, we encourage you to check out [this github repository](https://github.com/vdumoulin/conv_arithmetic) by Vincent Dumoulin and Francesco Visin with some great animations. We have already seen that we can think of an image as a matrix of color values, when we convolve that image with a particular filter, we generate a new matrix with different color values. An example of convolution can be seen in the figure below (illustration by Michael Plotke, CC BY-SA 3.0, via Wikimedia Commons). +In computer vision applications, convolutions are often used to filter images and if you want to know more about them, we encourage you to check out [this github repository](https://github.com/vdumoulin/conv_arithmetic) by Vincent Dumoulin and Francesco Visin with some great animations. +We have already seen that we can think of an image as a matrix of color values, when we convolve that image with a particular filter, we generate a new matrix with different color values. +An example of convolution can be seen in the figure below (illustration by Michael Plotke, CC BY-SA 3.0, via Wikimedia Commons). ![Example of animated convolution](./fig/2D_Convolution_Animation.gif){alt="Example of animated convolution"} @@ -64,14 +67,19 @@ In our example, we will convolve our image with a 2D Gaussian function shown bel $$G(x,y) = \frac{1}{2\pi \sigma^2} \exp\left(-\frac{x^2 + y^2}{2 \sigma^2}\right),$$ -where $x$ and $y$ are the "coordinates in our matrix, i.e. our row and columns, and $$\sigma$$ controls the width of the Gaussian distribution. Convolving images with 2D Gaussian functions will change the value of each pixel to be a weighted average of the pixels around it, thereby "smoothing" the image. Convolving images with a Gaussian function reduces the noise in the image, which is often required in [edge-detection](https://en.wikipedia.org/wiki/Gaussian_blur#Edge_detection) since most algorithms to do this are sensitive to noise. +where $x$ and $y$ are the "coordinates in our matrix, i.e. our row and columns, and $$\sigma$$ controls the width of the Gaussian distribution. +Convolving images with 2D Gaussian functions will change the value of each pixel to be a weighted average of the pixels around it, thereby "smoothing" the image. +Convolving images with a Gaussian function reduces the noise in the image, which is often required in [edge-detection](https://en.wikipedia.org/wiki/Gaussian_blur#Edge_detection) since most algorithms to do this are sensitive to noise. ::: callout -It is often useful to identify the dataflow inherent in a problem. Say, if we want to square a list of numbers, all the operations are independent. The dataflow of a one-to-one operation is called a *map*. +It is often useful to identify the dataflow inherent in a problem. +Say, if we want to square a list of numbers, all the operations are independent. +The dataflow of a one-to-one operation is called a *map*. ![Data flow of a map operation](./fig/mapping.svg){alt="Data flow of a map operation"} -A convolution is slightly more complicated. Here we have a many-to-one data flow, which is also known as a stencil. +A convolution is slightly more complicated. +Here we have a many-to-one data flow, which is also known as a stencil. ![Data flow of a stencil operation](./fig/stencil.svg){alt="Data flow of a stencil operation"} @@ -79,7 +87,8 @@ GPU's are exceptionally well suited to compute algorithms that follow one of the ::: # Convolution on the CPU Using SciPy -Let us first construct the Gaussian, and then display it. Remember that at this point we are still doing everything with standard Python, and not using the GPU yet. +Let us first construct the Gaussian, and then display it. +Remember that at this point we are still doing everything with standard Python, and not using the GPU yet. ~~~python x, y = np.meshgrid(np.linspace(-2, 2, 15), np.linspace(-2, 2, 15)) @@ -95,7 +104,9 @@ This should show you a symmetrical two-dimensional Gaussian. ![Two-dimensional Gaussian](./fig/gauss.png){alt="Two-dimensional Gaussian"} -Now we are ready to do the convolution on the host. We do not have to write this convolution function ourselves, as it is very conveniently provided by SciPy. Let us also record the time it takes to perform this convolution and inspect the top left corner of the convolved image. +Now we are ready to do the convolution on the host. +We do not have to write this convolution function ourselves, as it is very conveniently provided by SciPy. +Let us also record the time it takes to perform this convolution and inspect the top left corner of the convolved image. ~~~python from scipy.signal import convolve2d as convolve2d_cpu @@ -193,7 +204,7 @@ We now have a more reasonable, but still impressive, speedup of 116 times over t :::::::::::::::::::::::::::::::::::::: challenge ## Challenge: convolution on the GPU without CuPy -Try to convolve the NumPy array `deltas` with the NumPy array `gauss` directly on the GPU, without using CuPy arrays. +Try to convolve the NumPy array `deltas` with the NumPy array `gauss` directly on the GPU, without using CuPy arrays. If this works, it should save us the time and effort of transferring `deltas` and `gauss` to the GPU. ::::::::::::::::::::::::::::::::::::: solution @@ -215,7 +226,6 @@ Random Access Memory (RAM) of the host and not in GPU memory. ::::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::: - # Validation To check that we actually computed the same output on the host and the device we can compare the two output arrays `convolved_image_using_GPU` and `convolved_image_using_CPU`. @@ -280,7 +290,8 @@ This results in ...... ...... ...... -TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construct a NumPy array explicitly. +TypeError: Implicit conversion to a NumPy array is not allowed. +Please use `.get()` to construct a NumPy array explicitly. ~~~ So SciPy routines cannot have CuPy arrays as input. @@ -319,7 +330,7 @@ print(f"{gpu_avg_time:.6f} s") You may be surprised that we can issue these commands without error. Contrary to SciPy routines, NumPy accepts CuPy arrays, i.e. arrays that exist in GPU memory, as input. -[In the CuPy documentation](https://docs.cupy.dev/en/stable/user_guide/interoperability.html#numpy) you can find some background on why NumPy routines can handle CuPy arrays. +[In the CuPy documentation](https://docs.cupy.dev/en/stable/user_guide/interoperability.html#numpy) you can find some background on why NumPy routines can handle CuPy arrays. Also, remember when we used `np.allclose` with a NumPy and a CuPy array as input? That worked for the same reason. @@ -330,11 +341,11 @@ The linear convolution is actually performed on the GPU, which also results in a 0.014529 s ~~~ -Without much effort, we obtained a 18 times speedup. +Without much effort, we obtained a 18 times speedup. # A real world example: image processing for radio astronomy -In this section, we will perform the four major steps in image processing for radio astronomy: determination of background characteristics, segmentation, connected component labelling and source measurements. +In this section, we will perform the four major steps in image processing for radio astronomy: determination of background characteristics, segmentation, connected component labelling and source measurements. ## Import the FITS image @@ -367,13 +378,27 @@ pyl.colorbar(im_plot, ax=ax) ![Image of the Galactic Center](./fig/improved_image_of_GC.png){alt="Image of the Galactic Center"} -The data has been switched left to right because Right Ascension increases to the left, so now it adheres to this astronomical convention. We can see a few dozen sources, especially in the lower left and upper right, which both have relatively good image quality. The band across the image from upper left to lower right has the worst image quality because there are many extended sources - especially the Galactic Center itself, in the middle - which are hard to deconvolve with the limited spacings from this observation. You can, however, see a couple of ring-like structures. These are actually supernova shells, i.e. the remnants from massive stars that exploded at the end of their lifes. +The data has been switched left to right because Right Ascension increases to the left, so now it adheres to this astronomical convention. +We can see a few dozen sources, especially in the lower left and upper right, which both have relatively good image quality. +The band across the image from upper left to lower right has the worst image quality because there are many extended sources - especially the Galactic Center itself, in the middle - which are hard to deconvolve with the limited spacings from this observation. +You can, however, see a couple of ring-like structures. +These are actually supernova shells, i.e. the remnants from massive stars that exploded at the end of their lifes. ## Determine the background characteristics of the image -We want to identify all the sources - meaning e.g. stars, supernova remnants and distant galaxies - in this image and measure their positions and fluxes. How do we separate source pixels from background pixels? When do we know if a pixel with a high value belongs to a source or is simply a noise peak? We assume the background noise, which is a reflection of the limited sensitivity of the radio telescope, has a normal distribution. The chance of having a background pixel with a value above 5 times the standard deviation is 2.9e-7. We have 2²² = 4.2e6 pixels in our image, so the chance of catching at least one random noise peak by setting a threshold of 5 times the standard deviation is less than 50%. We refer to the standard deviation as $\sigma$. +We want to identify all the sources - meaning e.g. stars, supernova remnants and distant galaxies - in this image and measure their positions and fluxes. +How do we separate source pixels from background pixels? When do we know if a pixel with a high value belongs to a source or is simply a noise peak? We assume the background noise, which is a reflection of the limited sensitivity of the radio telescope, has a normal distribution. +The chance of having a background pixel with a value above 5 times the standard deviation is 2.9e-7. +We have 2²² = 4.2e6 pixels in our image, so the chance of catching at least one random noise peak by setting a threshold of 5 times the standard deviation is less than 50%. +We refer to the standard deviation as $\sigma$. -How do we measure the standard deviation of the background pixels? First we need to separate them from the source pixels, based on their values, in the sense that high pixel values more likely come from sources. The technique we use is called $\kappa, \sigma$ clipping. First we take all pixels and compute the standard deviation ($\sigma$). Then we compute the median and clip all pixels larger than $median + 3 * \sigma$ and smaller than $median - 3 * \sigma$. From the clipped set, we compute the median and standard deviation again and clip again. Continue until no more pixels are clipped. The standard deviation from this final set of pixel values is the basis for the next step. +How do we measure the standard deviation of the background pixels? First we need to separate them from the source pixels, based on their values, in the sense that high pixel values more likely come from sources. +The technique we use is called $\kappa, \sigma$ clipping. +First we take all pixels and compute the standard deviation ($\sigma$). +Then we compute the median and clip all pixels larger than $median + 3 * \sigma$ and smaller than $median - 3 * \sigma$. +From the clipped set, we compute the median and standard deviation again and clip again. +Continue until no more pixels are clipped. +The standard deviation from this final set of pixel values is the basis for the next step. Before clipping, let us investigate some properties of our unclipped data. @@ -392,7 +417,8 @@ The maximum flux density is 2506 mJy/beam, coming from the Galactic Center itsel mean = 3.898e-04, median = 1.571e-05, sttdev = 1.993e-02, maximum = 2.506e+00 ~~~ -You might observe that $\kappa, \sigma$ clipping is a compute intense task, that is why we want to do it on a GPU. But let's first issue the algorithm on a CPU. +You might observe that $\kappa, \sigma$ clipping is a compute intense task, that is why we want to do it on a GPU. +But let's first issue the algorithm on a CPU. This is the NumPy code to do this: @@ -423,7 +449,8 @@ print(f"Fastest CPU ks clipping time = \ Fastest CPU ks clipping time = 7.777e+02 ms. ~~~ -So that is close to 1 second to perform these computations. Hopefully, we can speed this up using the GPU. +So that is close to 1 second to perform these computations. +Hopefully, we can speed this up using the GPU. How has $\kappa, \sigma$ clipping influenced our statistics? ~~~python @@ -445,7 +472,8 @@ mean of clipped = -1.945e-06, median of clipped = -9.796e-06 :::::::::::::::::::::::::::::::::::::: challenge ## Challenge: $\kappa, \sigma$ clipping on the GPU -Now that you understand how the $\kappa, \sigma$ clipping algorithm works, perform it on the GPU using CuPy and compute the speedup. Include the data transfer to and from the GPU in your calculation. +Now that you understand how the $\kappa, \sigma$ clipping algorithm works, perform it on the GPU using CuPy and compute the speedup. +Include the data transfer to and from the GPU in your calculation. ::::::::::::::::::::::::::::::::::::: solution ## Solution @@ -479,10 +507,10 @@ The speedup factor for ks clipping is: 1.232e+01 ::::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::: - ## Segment the image -We have seen that clipping at the $5 \sigma$ level of an image this size (2048² pixels) will yield a chance of less than 50% that from all the sources we detect at least one will be a noise peak. So let us set the threshold at $5 \sigma$ and segment it. +We have seen that clipping at the $5 \sigma$ level of an image this size (2048² pixels) will yield a chance of less than 50% that from all the sources we detect at least one will be a noise peak. +So let us set the threshold at $5 \sigma$ and segment it. First check that we find the same standard deviation from our clipper on the GPU: ~~~python @@ -512,7 +540,8 @@ Fastest CPU segmentation time = 6.294e+00 ms. ## Labelling of the segmented data -This is called connected component labelling (CCL). It will replace pixel values in the segmented image - just consisting of zeros and ones - of the first connected group of pixels with the value 1 - so nothing changed, but just for that first group - the pixel values in the second group of connected pixels will all be 2, the third connected group of pixels will all have the value 3 etc. +This is called connected component labelling (CCL). +It will replace pixel values in the segmented image - just consisting of zeros and ones - of the first connected group of pixels with the value 1 - so nothing changed, but just for that first group - the pixel values in the second group of connected pixels will all be 2, the third connected group of pixels will all have the value 3 etc. This is a CPU code for connected component labelling. @@ -537,7 +566,8 @@ The number of sources in the image at the 5σ level is 185. Fastest CPU CCL time = 2.546e+01 ms. ~~~ -Let us not just accept the answer, but also do a sanity check. What are the values in the labelled image? +Let us not just accept the answer, but also do a sanity check. +What are the values in the labelled image? ~~~python print(f"These are all the pixel values we can find in the labelled image: \ @@ -565,7 +595,9 @@ These are all the pixel values we can find in the labelled image: [ 0. 1. 2 # Source measurements -We are ready for the final step. We have been given observing time to make this beautiful image of the Galactic Center, we have determined its background statistics, we have separated actual cosmic sources from noise and now we want to measure these cosmic sources. What are their positions and what are their flux densities? +We are ready for the final step. +We have been given observing time to make this beautiful image of the Galactic Center, we have determined its background statistics, we have separated actual cosmic sources from noise and now we want to measure these cosmic sources. +What are their positions and what are their flux densities? Again, the algorithms from `scipy.ndimage` help us to determine these quantities. This is the CPU code for measuring our sources. @@ -625,7 +657,8 @@ Fastest CPU set of source measurements = 7.838e+02 ms. :::::::::::::::::::::::::::::::::::::: challenge ## Challenge: putting it all together Combine the first two steps of image processing for astronomy, i.e. determining background characteristics e.g. through $\kappa, \sigma$ clipping and segmentation into a single function, that works for both CPU and GPU. -Next, write a function for connected component labelling and source measurements on the GPU and calculate the overall speedup factor for the combined four steps of image processing in astronomy on the GPU relative to the CPU. Finally, verify your output by comparing with the previous output, using the CPU. +Next, write a function for connected component labelling and source measurements on the GPU and calculate the overall speedup factor for the combined four steps of image processing in astronomy on the GPU relative to the CPU. +Finally, verify your output by comparing with the previous output, using the CPU. ::::::::::::::::::::::::::::::::::::: solution ~~~python @@ -674,7 +707,8 @@ def ccl_and_source_measurements_on_GPU(data_GPU, segmented_image_GPU): all_fluxes = sl_gpu(data_GPU, labelled_image_GPU, cp.arange(1, number_of_sources_in_image+1)) # This seems redundant, but we want to return ndarrays (Numpy) - # and what we have are lists. These first have to be converted to + # and what we have are lists. + # These first have to be converted to # Cupy arrays before they can be converted to Numpy arrays. return cp.asnumpy(cp.asarray(all_positions)), \ cp.asnumpy(cp.asarray(all_fluxes)) @@ -705,12 +739,9 @@ all_fluxes_agree = np.allclose(CPU_output[1], GPU_output[1]) print(f"The CPU and GPU fluxes agree: {all_positions_agree}\n") ~~~ - ~~~output -The four steps of image processing for astronomy take 1.060e+03 ms - on our CPU. -The four steps of image processing for astronomy take 5.770e+01 ms - on our GPU. +The four steps of image processing for astronomy take 1.060e+03 ms on our CPU. +The four steps of image processing for astronomy take 5.770e+01 ms on our GPU. This means that the overall speedup factor GPU vs CPU equals: 1.838e+01 The CPU and GPU positions agree: True @@ -721,7 +752,6 @@ The CPU and GPU fluxes agree: True ::::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::: - :::::: keypoints - "CuPy provides GPU accelerated version of many NumPy and Scipy functions." - "Always have CPU and GPU versions of your code so that you can compare performance, as well as validate your code." From ac6143ed77f788cf9c6dc371aa537cacdc3b2173 Mon Sep 17 00:00:00 2001 From: "Giordano Lipari @laptop4" Date: Mon, 6 Nov 2023 18:45:24 +0100 Subject: [PATCH 03/17] Episode02: Apply semantic newlines (fix oversights) --- episodes/cupy.Rmd | 3 +-- 1 file changed, 1 insertion(+), 2 deletions(-) diff --git a/episodes/cupy.Rmd b/episodes/cupy.Rmd index c646ead..841acc7 100644 --- a/episodes/cupy.Rmd +++ b/episodes/cupy.Rmd @@ -221,8 +221,7 @@ However, this gives a long error message ending with: TypeError: Unsupported type ~~~ -It is unfortunately not possible to access NumPy arrays directly from the GPU because they exist in the -Random Access Memory (RAM) of the host and not in GPU memory. +It is unfortunately not possible to access NumPy arrays directly from the GPU because they exist in the Random Access Memory (RAM) of the host and not in GPU memory. ::::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::: From cdf743df3a9557229c4fad0305f0524ce6555cfc Mon Sep 17 00:00:00 2001 From: "Giordano Lipari @laptop4" Date: Mon, 6 Nov 2023 19:45:28 +0100 Subject: [PATCH 04/17] Episode02: check spelling in US English Being the developers from the Netherlands eScience CentER, I have applied a round of spellchecking in US English. This affected the body of the text as well as the name of variables (labelled > labeled). This very broad type of editing can be applied ahead of distinguishing linguistic and pedagogic issues. --- episodes/cupy.Rmd | 52 +++++++++++++++++++++++------------------------ 1 file changed, 26 insertions(+), 26 deletions(-) diff --git a/episodes/cupy.Rmd b/episodes/cupy.Rmd index 841acc7..af65a81 100644 --- a/episodes/cupy.Rmd +++ b/episodes/cupy.Rmd @@ -83,7 +83,7 @@ Here we have a many-to-one data flow, which is also known as a stencil. ![Data flow of a stencil operation](./fig/stencil.svg){alt="Data flow of a stencil operation"} -GPU's are exceptionally well suited to compute algorithms that follow one of these patterns. +GPUs are exceptionally well suited to compute algorithms that follow one of these patterns. ::: # Convolution on the CPU Using SciPy @@ -344,7 +344,7 @@ Without much effort, we obtained a 18 times speedup. # A real world example: image processing for radio astronomy -In this section, we will perform the four major steps in image processing for radio astronomy: determination of background characteristics, segmentation, connected component labelling and source measurements. +In this section, we will perform the four major steps in image processing for radio astronomy: determination of background characteristics, segmentation, connected component labeling and source measurements. ## Import the FITS image @@ -381,7 +381,7 @@ The data has been switched left to right because Right Ascension increases to th We can see a few dozen sources, especially in the lower left and upper right, which both have relatively good image quality. The band across the image from upper left to lower right has the worst image quality because there are many extended sources - especially the Galactic Center itself, in the middle - which are hard to deconvolve with the limited spacings from this observation. You can, however, see a couple of ring-like structures. -These are actually supernova shells, i.e. the remnants from massive stars that exploded at the end of their lifes. +These are actually supernova shells, i.e. the remnants from massive stars that exploded at the end of their lives. ## Determine the background characteristics of the image @@ -537,22 +537,22 @@ ms.") Fastest CPU segmentation time = 6.294e+00 ms. ~~~ -## Labelling of the segmented data +## Labeling of the segmented data -This is called connected component labelling (CCL). +This is called connected component labeling (CCL). It will replace pixel values in the segmented image - just consisting of zeros and ones - of the first connected group of pixels with the value 1 - so nothing changed, but just for that first group - the pixel values in the second group of connected pixels will all be 2, the third connected group of pixels will all have the value 3 etc. -This is a CPU code for connected component labelling. +This is a CPU code for connected component labeling. ~~~python from scipy.ndimage import label as label_cpu -labelled_image = np.empty(data.shape) -number_of_sources_in_image = label_cpu(segmented_image, output = labelled_image) +labeled_image = np.empty(data.shape) +number_of_sources_in_image = label_cpu(segmented_image, output = labeled_image) sigma_unicode = "\u03C3" print(f"The number of sources in the image at the 5{sigma_unicode} level is \ {number_of_sources_in_image}.") -timing_CCL_CPU = %timeit -o label_cpu(segmented_image, output = labelled_image) +timing_CCL_CPU = %timeit -o label_cpu(segmented_image, output = labeled_image) fastest_CCL_CPU = timing_CCL_CPU.best print(f"Fastest CPU CCL time = {1000 * fastest_CCL_CPU:.3e} ms.") ~~~ @@ -566,17 +566,17 @@ Fastest CPU CCL time = 2.546e+01 ms. ~~~ Let us not just accept the answer, but also do a sanity check. -What are the values in the labelled image? +What are the values in the labeled image? ~~~python -print(f"These are all the pixel values we can find in the labelled image: \ -{np.unique(labelled_image)}") +print(f"These are all the pixel values we can find in the labeled image: \ +{np.unique(labeled_image)}") ~~~ This should show the following output: ~~~output -These are all the pixel values we can find in the labelled image: [ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. +These are all the pixel values we can find in the labeled image: [ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54. 55. @@ -604,9 +604,9 @@ This is the CPU code for measuring our sources. ~~~python from scipy.ndimage import center_of_mass as com_cpu from scipy.ndimage import sum_labels as sl_cpu -all_positions = com_cpu(data, labelled_image, \ +all_positions = com_cpu(data, labeled_image, \ range(1, number_of_sources_in_image+1)) -all_integrated_fluxes = sl_cpu(data, labelled_image, \ +all_integrated_fluxes = sl_cpu(data, labeled_image, \ range(1, number_of_sources_in_image+1)) print (f'These are the ten highest integrated fluxes of the sources in my \n\ @@ -624,9 +624,9 @@ Now we can try to measure the execution times for both algorithms, like this: ~~~python %%timeit -o -all_positions = com_cpu(data, labelled_image, \ +all_positions = com_cpu(data, labeled_image, \ range(1, number_of_sources_in_image+1)) -all_integrated_fluxes = sl_cpu(data, labelled_image, \ +all_integrated_fluxes = sl_cpu(data, labeled_image, \ range(1, number_of_sources_in_image+1)) ~~~ @@ -656,7 +656,7 @@ Fastest CPU set of source measurements = 7.838e+02 ms. :::::::::::::::::::::::::::::::::::::: challenge ## Challenge: putting it all together Combine the first two steps of image processing for astronomy, i.e. determining background characteristics e.g. through $\kappa, \sigma$ clipping and segmentation into a single function, that works for both CPU and GPU. -Next, write a function for connected component labelling and source measurements on the GPU and calculate the overall speedup factor for the combined four steps of image processing in astronomy on the GPU relative to the CPU. +Next, write a function for connected component labeling and source measurements on the GPU and calculate the overall speedup factor for the combined four steps of image processing in astronomy on the GPU relative to the CPU. Finally, verify your output by comparing with the previous output, using the CPU. ::::::::::::::::::::::::::::::::::::: solution @@ -670,12 +670,12 @@ def first_two_steps_for_both_CPU_and_GPU(data): return segmented_image def ccl_and_source_measurements_on_CPU(data_CPU, segmented_image_CPU): - labelled_image_CPU = np.empty(data_CPU.shape) + labeled_image_CPU = np.empty(data_CPU.shape) number_of_sources_in_image = label_cpu(segmented_image_CPU, - output= labelled_image_CPU) - all_positions = com_cpu(data_CPU, labelled_image_CPU, + output= labeled_image_CPU) + all_positions = com_cpu(data_CPU, labeled_image_CPU, np.arange(1, number_of_sources_in_image+1)) - all_fluxes = sl_cpu(data_CPU, labelled_image_CPU, + all_fluxes = sl_cpu(data_CPU, labeled_image_CPU, np.arange(1, number_of_sources_in_image+1)) return np.array(all_positions), np.array(all_fluxes) @@ -698,12 +698,12 @@ from cupyx.scipy.ndimage import center_of_mass as com_gpu from cupyx.scipy.ndimage import sum_labels as sl_gpu def ccl_and_source_measurements_on_GPU(data_GPU, segmented_image_GPU): - labelled_image_GPU = cp.empty(data_GPU.shape) + labeled_image_GPU = cp.empty(data_GPU.shape) number_of_sources_in_image = label_gpu(segmented_image_GPU, - output= labelled_image_GPU) - all_positions = com_gpu(data_GPU, labelled_image_GPU, + output= labeled_image_GPU) + all_positions = com_gpu(data_GPU, labeled_image_GPU, cp.arange(1, number_of_sources_in_image+1)) - all_fluxes = sl_gpu(data_GPU, labelled_image_GPU, + all_fluxes = sl_gpu(data_GPU, labeled_image_GPU, cp.arange(1, number_of_sources_in_image+1)) # This seems redundant, but we want to return ndarrays (Numpy) # and what we have are lists. From db44928cfd1ae4e4b46bd80a6ec1a3c38c540f92 Mon Sep 17 00:00:00 2001 From: "Giordano Lipari @laptop4" Date: Tue, 7 Nov 2023 08:39:24 +0100 Subject: [PATCH 05/17] E02-language: revise up to 'Convolution on the GPU Using CuPy' This revision has aimed at improving the learninge experience by means of: (1) preserving content, barring a few sentences repeated within a short range, and some incidental observations that could still be inferred from the context; (2) merging, splitting and relocating sentences in order to provide a reading experience in consistent blocks, mostly ordered from general to specific notions, so that a topic is opened and closed within a narrow span of text; (3) changing the text at word level to reduce repetitions and/or convey more precision; (4) only comments have been edited in the code lines; editing the code lines is to be handled in a separate branch named E02-code. Although conciseness and brevity are different notions, the word count went from 4660 to 4568. INFORMATION CHANGES The revision deserves revision as far as meanings are concerned. Broadly speaking, editing was supposed to streamline the reading/learning experience and not to alter the content, which may have happened unintentionally. However, the following changes are deliberate and should be revised all the more: (1) the connection between CPU and GPU. Rather than with a cable. I propose they are more likely connected with some interconnect plug. I have add a Wikipedia link to interconnects and refrained to expand on hardware specifications too widely, for example mentioning NVLink and developments. (2) the status of cupyx as a namespace of CuPy. I even got somewhere that once upon a time cupyx used to be referred to as cupy.cupyx. At any rate, cupyx is documented inside Cupy. I have added a link to the official documentation. My new wording wanted to convey the relatedness of CuPy and cupyx and can be inaccurate, but also read (1) next. FUTURE Remarks not implemented but worth considering in the future are: (1) clarifying from the outset overlaps and distinctions between the terms API v libraries v packages v modules, and between methods v functions v routines. Then, stick to a preferred usage nor to confuse learners neither to reinforce confusions; (2) if not done in a previous episode, mentioning that we are only coding on Nvidia graphics cards because of the CUDA lock; (3) revising this episode's questions and objectives (at the top of the page) in a separate branch to be named E02-pedagogy. ONE-OFF NOTE ON LINGUISTIC EDITING As seen here, linguistic revision takes time and space. In terms of psychology of copy editing, it would be unsurprising if the authors of the original content feel uneasy about changes at first. I hope nobody takes umbrage at that, although this is a ritual moment of copy editing. Exposing the reasoning behind the changes is a handle to reframe the owner's view of a text and evaluate the new state of the play with a dispassionate eye. The risk of cutting corners is, naturally, to lose edge. --- episodes/cupy.Rmd | 116 ++++++++++++++++++++++++---------------------- 1 file changed, 60 insertions(+), 56 deletions(-) diff --git a/episodes/cupy.Rmd b/episodes/cupy.Rmd index af65a81..a6b1e2b 100644 --- a/episodes/cupy.Rmd +++ b/episodes/cupy.Rmd @@ -19,16 +19,17 @@ exercises: 60 # Introduction to CuPy [CuPy](https://cupy.dev) is a GPU array library that implements a subset of the NumPy and SciPy interfaces. -This makes it a very convenient tool to use the compute power of GPUs for people that have some experience with NumPy, without the need to write code in a GPU programming language such as CUDA, OpenCL, or HIP. +Thanks to CuPy, people conversant with NumPy can very conveniently harvest the compute power of GPUs without writing code in GPU programming languages such as CUDA, OpenCL, and HIP. -# Convolution in Python +From now on we can also use the word *host* to refer to the CPU on the laptop, desktop, or cluster node you are using as usual, and *device* to refer to the GPU. -We start by generating an artificial "image" on the host using Python and NumPy; the host is the CPU on the laptop, desktop, or cluster node you are using right now, and from now on we may use *host* to refer to the CPU and *device* to refer to the GPU. -The image will be all zeros, except for isolated pixels with value one, on a regular grid. -The plan is to convolve it with a Gaussian and inspect the result. -We will also record the time it takes to execute this convolution on the host. +# Convolutions in Python -We can interactively write and executed the code in an iPython shell or a Jupyter notebook. +We start by generating an image using Python and NumPy code. +We want to compute a convolution on this input image, once on the host and once on the device, and then compare both the execution times and the results. + +In an iPython shell or a Jupyter notebook, we can write and execute the following code on the host. +The pixel values will be zero everywhere except for a regular grid of single pixels having value one, named deltas. ~~~python import numpy as np @@ -38,57 +39,59 @@ deltas = np.zeros((2048, 2048)) deltas[8::16,8::16] = 1 ~~~ -To get a feeling of how the whole image looks like, we can display the top-left corner of it. +We can display the top-left corner of the input image to get a feeling of how it looks like, as follows: ~~~python import pylab as pyl -# Necessary command to render a matplotlib image in a Jupyter notebook. +# Jupyter 'magic' command to render a Matplotlib image in the notebook %matplotlib inline # Display the image -# You can zoom in using the menu in the window that will appear +# You can zoom in/out using the menu in the window that will appear pyl.imshow(deltas[0:32, 0:32]) pyl.show() ~~~ -After executing the code, you should see the following image. +and you should obtain the following image: + +![Grid of delta functions.](./fig/deltas.png){alt='Grid of delta functions'} -![Deltas array](./fig/deltas.png){alt='Deltas array'} +## Gaussian convolutions -### Background -The computation we want to perform on this image is a convolution, once on the host and once on the device so we can compare the results and execution times. -In computer vision applications, convolutions are often used to filter images and if you want to know more about them, we encourage you to check out [this github repository](https://github.com/vdumoulin/conv_arithmetic) by Vincent Dumoulin and Francesco Visin with some great animations. -We have already seen that we can think of an image as a matrix of color values, when we convolve that image with a particular filter, we generate a new matrix with different color values. -An example of convolution can be seen in the figure below (illustration by Michael Plotke, CC BY-SA 3.0, via Wikimedia Commons). +The illustration below shows an example of convolution (by Michael Plotke, CC BY-SA 3.0, via Wikimedia Commons). +To know more about convolutions, we encourage you to check out [this GitHub repository](https://github.com/vdumoulin/conv_arithmetic) by Vincent Dumoulin and Francesco Visin with some great animations. ![Example of animated convolution](./fig/2D_Convolution_Animation.gif){alt="Example of animated convolution"} -In our example, we will convolve our image with a 2D Gaussian function shown below: +In this course section, we will convolve our image with a 2D Gaussian function, having the general form: $$G(x,y) = \frac{1}{2\pi \sigma^2} \exp\left(-\frac{x^2 + y^2}{2 \sigma^2}\right),$$ -where $x$ and $y$ are the "coordinates in our matrix, i.e. our row and columns, and $$\sigma$$ controls the width of the Gaussian distribution. -Convolving images with 2D Gaussian functions will change the value of each pixel to be a weighted average of the pixels around it, thereby "smoothing" the image. -Convolving images with a Gaussian function reduces the noise in the image, which is often required in [edge-detection](https://en.wikipedia.org/wiki/Gaussian_blur#Edge_detection) since most algorithms to do this are sensitive to noise. +where $x$ and $y$ are distances from the origin, and $$\sigma$$ controls the width of the Gaussian curve. +Since we can think of an image as a matrix of color values, the convolution of that image with a filter generates a new matrix with different color values. +In particular, convolving images with 2D Gaussian functions changes the value of each pixel into a weighted average of the neighboring pixels, thereby smoothing out the features in the input image. + +Convolutions are frequently used in computer vision applications to filter images. +For example, Gaussian convolution can be required before applying algorithms for [edge detection](https://en.wikipedia.org/wiki/Gaussian_blur#Edge_detection), which are sensitive to the noise in the original image. ::: callout -It is often useful to identify the dataflow inherent in a problem. -Say, if we want to square a list of numbers, all the operations are independent. +Identifying the dataflow inherent in an algorithm is often useful. +Say, if we want to square the numbers in a list, the operations on each item of the list are independent one of another. The dataflow of a one-to-one operation is called a *map*. -![Data flow of a map operation](./fig/mapping.svg){alt="Data flow of a map operation"} +![Dataflow of a map operation.](./fig/mapping.svg){alt="Data flow of a map operation"} -A convolution is slightly more complicated. -Here we have a many-to-one data flow, which is also known as a stencil. +A convolution is slightly more complex because of a many-to-one dataflow, also known as a **stencil**. -![Data flow of a stencil operation](./fig/stencil.svg){alt="Data flow of a stencil operation"} +![Dataflow of a stencil operation.](./fig/stencil.svg){alt="Data flow of a stencil operation"} -GPUs are exceptionally well suited to compute algorithms that follow one of these patterns. +GPUs are exceptionally well suited to compute algorithms that follow either dataflow. ::: -# Convolution on the CPU Using SciPy -Let us first construct the Gaussian, and then display it. -Remember that at this point we are still doing everything with standard Python, and not using the GPU yet. +## Convolution on the CPU Using SciPy + +Let us first construct the Gaussian filter and then display it. +Remember that we are still coding everything in standard Python, without using the GPU. ~~~python x, y = np.meshgrid(np.linspace(-2, 2, 15), np.linspace(-2, 2, 15)) @@ -100,13 +103,13 @@ pyl.imshow(gauss) pyl.show() ~~~ -This should show you a symmetrical two-dimensional Gaussian. +The code above produces this image that shows you the symmetrical two-dimensional Gaussian: -![Two-dimensional Gaussian](./fig/gauss.png){alt="Two-dimensional Gaussian"} +![Two-dimensional Gaussian.](./fig/gauss.png){alt="Two-dimensional Gaussian"} -Now we are ready to do the convolution on the host. -We do not have to write this convolution function ourselves, as it is very conveniently provided by SciPy. -Let us also record the time it takes to perform this convolution and inspect the top left corner of the convolved image. +Now we are ready to compute the convolution on the host. +Very conveniently, SciPy provides a method for convolutions. +Let us also record the time to perform this convolution and inspect the top-left corner of the convolved image, as follows: ~~~python from scipy.signal import convolve2d as convolve2d_cpu @@ -117,27 +120,27 @@ pyl.show() %timeit -n 1 -r 1 convolve2d_cpu(deltas, gauss) ~~~ -Obviously, the time to perform this convolution will depend very much on the power of your CPU, but I expect you to find that it takes a couple of seconds. +Obviously, the power of your CPU influences the actual execution time very much on, and we expect that to be in the region of a couple of seconds. ~~~output 2.4 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each) ~~~ -When you display the corner of the image, you can see that the "ones" surrounded by zeros have actually been blurred by a Gaussian, so we end up with a regular grid of Gaussians. +Displaying just a corner of the image shows that the Gaussian has so much blurred the original pattern of ones surrounded by zeros that we end up with a regular pattern of Gaussians. -![Regular grid of Gaussians](./fig/convolved_image.png){alt="Regular grid of Gaussians"} +![Grid of Gaussians in the convoluted image.](./fig/convolved_image.png){alt="Grid of Gaussians in the convoluted image"} -# Convolution on the GPU Using CuPy +## Convolution on the GPU Using CuPy -This is part of a lesson on GPU programming, so let us use the GPU. -Although there is a physical connection - i.e. a cable - between the CPU and the GPU, they do not share the same memory space. -This image depicts the different components of CPU and GPU and how they are connected: +This is a lesson on GPU programming, so let us use the GPU. +In spite of being physically connected -- typically with special [interconnects](https://en.wikipedia.org/wiki/Interconnect_(integrated_circuits)) -- the CPU and the GPU do not share the same memory space. +This picture depicts the different components of CPU and GPU and how they are connected: -![CPU and GPU are two separate entities, each with its own memory](./fig/CPU_and_GPU_separated.png){alt="CPU and GPU are two separate entities, each with its own memory"} +![CPU and GPU are separate entities with an own memory.](./fig/CPU_and_GPU_separated.png){alt="CPU and GPU are separate entities with an own memory"} -This means that an array created from e.g. an iPython shell using NumPy is physically located into the main memory of the host, and therefore available for the CPU but not the GPU. -It is not yet present in GPU memory, which means that we need to copy our data, the input image and the convolving function to the GPU, before we can execute any code on it. -In practice, we have the arrays `deltas` and `gauss` in the host's RAM, and we need to copy them to GPU memory using CuPy. +This means that the array created with NumPy is physically stored in the host's main memory and, therefore, is only available to the CPU. +Since our input image and convolving function are not yet present in the GPU memory, we need to copy both to the GPU before executing any code on it. +In practice, we use CuPy to copy the arrays `deltas` and `gauss` from the host's RAM to the GPU memory as follows: ~~~python import cupy as cp @@ -146,11 +149,12 @@ deltas_gpu = cp.asarray(deltas) gauss_gpu = cp.asarray(gauss) ~~~ -Now it is time to do the convolution on the GPU. -SciPy does not offer functions that can use the GPU, so we need to import the convolution function from another library, called `cupyx`; `cupyx.scipy` contains a subset of all SciPy routines. -You will see that the GPU convolution function from the `cupyx` library looks very much like the convolution function from SciPy we used previously. -In general, NumPy and CuPy look very similar, as well as the SciPy and `cupyx` libraries, and this is on purpose to facilitate the use of the GPU by programmers that are already familiar with NumPy and SciPy. -Let us again record the time to execute the convolution, so that we can compare it with the time it took on the host. +Now it is time to compute the convolution on our GPU. +Inconveniently, SciPy does not offer methods running on GPUs. +Hence, we import the convolution function from a CuPy package aliased as `cupyx`, whose package [`cupyx.scipy`](https://docs.cupy.dev/en/stable/reference/scipy.html), in turn, performs a subset of the SciPy operations. +We will soon verify that the GPU convolution function in `cupyx` works like the convolution function from SciPy called on the CPU. +In general, CuPy proper and NumPy look so much similar as do the `cupyx` methods and SciPy; this is intended to invite programmers already familiar with NumPy and SciPy to use the GPU. +For now, let us again record the time to execute on the device the same convolution as the host, so that we can compare the respective performances. ~~~python from cupyx.scipy.signal import convolve2d as convolve2d_gpu @@ -159,15 +163,15 @@ convolved_image_using_GPU = convolve2d_gpu(deltas_gpu, gauss_gpu) %timeit -n 7 -r 1 convolved_image_using_GPU = convolve2d_gpu(deltas_gpu, gauss_gpu) ~~~ -Similar to what we had previously on the host, the execution time of the GPU convolution will depend very much on the GPU used. -These are the results using a NVIDIA Tesla T4 on Google Colab. +Also the execution time of the GPU convolution will depend very much on the hardware used, as seen for the host. +The timing using a NVIDIA Tesla T4 on Google Colab was: ~~~output 98.2 µs ± 0 ns per loop (mean ± std. dev. of 1 run, 7 loops each) ~~~ -This is a lot faster than on the host, a performance improvement, or speedup, of 24439 times. -Impressive, but is it true? +This is way faster than the host: more than a 24000-fold performance improvement, or speedup. +Impressive, but is that true? # Measuring performance From 9640c8cac391fefff030299009cd6107e196c9de Mon Sep 17 00:00:00 2001 From: "Giordano Lipari @laptop4" Date: Tue, 7 Nov 2023 12:41:26 +0100 Subject: [PATCH 06/17] E02-language: revise up to 'Convolution on the GPU Using CuPy' (2) This continues commit db44928 with the following changes: (1) explain that the variable `deltas` refers to Dirac delta functions; (2) add warning about multiple meaning of word kernel, and use filters for convolution kernels; (3) clean badness that turn up rendering the Rmd (R markdown) file in RStudio; (4) polish expressions that could be more precise. --- episodes/cupy.Rmd | 26 ++++++++++++++------------ 1 file changed, 14 insertions(+), 12 deletions(-) diff --git a/episodes/cupy.Rmd b/episodes/cupy.Rmd index a6b1e2b..5f533ed 100644 --- a/episodes/cupy.Rmd +++ b/episodes/cupy.Rmd @@ -29,7 +29,7 @@ We start by generating an image using Python and NumPy code. We want to compute a convolution on this input image, once on the host and once on the device, and then compare both the execution times and the results. In an iPython shell or a Jupyter notebook, we can write and execute the following code on the host. -The pixel values will be zero everywhere except for a regular grid of single pixels having value one, named deltas. +The pixel values will be zero everywhere except for a regular grid of single pixels having value one, very much like a Dirac's delta function, hence here named `deltas`. ~~~python import numpy as np @@ -58,21 +58,23 @@ and you should obtain the following image: ## Gaussian convolutions -The illustration below shows an example of convolution (by Michael Plotke, CC BY-SA 3.0, via Wikimedia Commons). +The illustration below shows an example of convolution (courtesy of Michael Plotke, CC BY-SA 3.0, via Wikimedia Commons). +Looking at the terminology in the illustration, be forewarned that the word *kernel* happens to have different meanings that, inconveniently, apply to both mathematical convolution and coding on a GPU device. To know more about convolutions, we encourage you to check out [this GitHub repository](https://github.com/vdumoulin/conv_arithmetic) by Vincent Dumoulin and Francesco Visin with some great animations. -![Example of animated convolution](./fig/2D_Convolution_Animation.gif){alt="Example of animated convolution"} +![Example of animated convolution.](./fig/2D_Convolution_Animation.gif){alt="Example of animated convolution"} In this course section, we will convolve our image with a 2D Gaussian function, having the general form: $$G(x,y) = \frac{1}{2\pi \sigma^2} \exp\left(-\frac{x^2 + y^2}{2 \sigma^2}\right),$$ -where $x$ and $y$ are distances from the origin, and $$\sigma$$ controls the width of the Gaussian curve. -Since we can think of an image as a matrix of color values, the convolution of that image with a filter generates a new matrix with different color values. -In particular, convolving images with 2D Gaussian functions changes the value of each pixel into a weighted average of the neighboring pixels, thereby smoothing out the features in the input image. +where $x$ and $y$ are distances from the origin, and $\sigma$ controls the width of the Gaussian curve. +Since we can think of an image as a matrix of color values, the convolution of that image with a kernel generates a new matrix with different color values. +In particular, convolving images with 2D Gaussian kernel changes the value of each pixel into a weighted average of the neighboring pixels, thereby smoothing out the features in the input image. -Convolutions are frequently used in computer vision applications to filter images. +Convolutions are frequently used in computer vision to filter images. For example, Gaussian convolution can be required before applying algorithms for [edge detection](https://en.wikipedia.org/wiki/Gaussian_blur#Edge_detection), which are sensitive to the noise in the original image. +To avoid conflicting vocabularies, in the remainder we refer to *convolution kernels* as *filters*. ::: callout Identifying the dataflow inherent in an algorithm is often useful. @@ -81,7 +83,7 @@ The dataflow of a one-to-one operation is called a *map*. ![Dataflow of a map operation.](./fig/mapping.svg){alt="Data flow of a map operation"} -A convolution is slightly more complex because of a many-to-one dataflow, also known as a **stencil**. +A convolution is slightly more complex because of a many-to-one dataflow, also known as a *stencil*. ![Dataflow of a stencil operation.](./fig/stencil.svg){alt="Data flow of a stencil operation"} @@ -90,7 +92,7 @@ GPUs are exceptionally well suited to compute algorithms that follow either data ## Convolution on the CPU Using SciPy -Let us first construct the Gaussian filter and then display it. +Let us first construct and then display the Gaussian filter. Remember that we are still coding everything in standard Python, without using the GPU. ~~~python @@ -103,7 +105,7 @@ pyl.imshow(gauss) pyl.show() ~~~ -The code above produces this image that shows you the symmetrical two-dimensional Gaussian: +The code above produces this image of a symmetrical two-dimensional Gaussian: ![Two-dimensional Gaussian.](./fig/gauss.png){alt="Two-dimensional Gaussian"} @@ -120,7 +122,7 @@ pyl.show() %timeit -n 1 -r 1 convolve2d_cpu(deltas, gauss) ~~~ -Obviously, the power of your CPU influences the actual execution time very much on, and we expect that to be in the region of a couple of seconds. +Obviously, the compute power of your CPU influences the actual execution time very much, and we expect that to be in the region of a couple of seconds. ~~~output 2.4 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each) @@ -139,7 +141,7 @@ This picture depicts the different components of CPU and GPU and how they are co ![CPU and GPU are separate entities with an own memory.](./fig/CPU_and_GPU_separated.png){alt="CPU and GPU are separate entities with an own memory"} This means that the array created with NumPy is physically stored in the host's main memory and, therefore, is only available to the CPU. -Since our input image and convolving function are not yet present in the GPU memory, we need to copy both to the GPU before executing any code on it. +Since our input image and convolution filter are not yet present in the device memory, we need to copy both data to the GPU before executing any code on it. In practice, we use CuPy to copy the arrays `deltas` and `gauss` from the host's RAM to the GPU memory as follows: ~~~python From 5760da568989b1a07270085bcd8379260fe1e05f Mon Sep 17 00:00:00 2001 From: "Giordano Lipari @laptop4" Date: Tue, 7 Nov 2023 16:09:08 +0100 Subject: [PATCH 07/17] E02-code: first raw pass with PEP8 and other preparations --- episodes/cupy.Rmd | 105 ++++++++++++++++++++-------------------------- 1 file changed, 46 insertions(+), 59 deletions(-) diff --git a/episodes/cupy.Rmd b/episodes/cupy.Rmd index af65a81..8cdb76a 100644 --- a/episodes/cupy.Rmd +++ b/episodes/cupy.Rmd @@ -95,7 +95,7 @@ x, y = np.meshgrid(np.linspace(-2, 2, 15), np.linspace(-2, 2, 15)) dst = np.sqrt(x*x + y*y) sigma = 1 muu = 0.000 -gauss = np.exp(-((dst-muu)**2/(2.0 * sigma**2))) +gauss = np.exp(-(dst - muu)**2 / (2.0 * sigma**2)) pyl.imshow(gauss) pyl.show() ~~~ @@ -406,8 +406,7 @@ mean_ = data.mean() median_ = np.median(data) stddev_ = np.std(data) max_ = np.amax(data) -print(f"mean = {mean_:.3e}, median = {median_:.3e}, sttdev = {stddev_:.3e}, \ -maximum = {max_:.3e}") +print(f"mean = {mean_:.3e}, median = {median_:.3e}, sttdev = {stddev_:.3e}, maximum = {max_:.3e}") ~~~ The maximum flux density is 2506 mJy/beam, coming from the Galactic Center itself, so from the center of the image, while the overall standard deviation is 19.9 mJy/beam: @@ -439,8 +438,7 @@ def kappa_sigma_clipper(data_flat): data_clipped = kappa_sigma_clipper(data_flat) timing_ks_clipping_cpu = %timeit -o kappa_sigma_clipper(data_flat) fastest_ks_clipping_cpu = timing_ks_clipping_cpu.best -print(f"Fastest CPU ks clipping time = \ - {1000 * fastest_ks_clipping_cpu:.3e} ms.") +print(f"Fastest CPU ks clipping time = {1000 * fastest_ks_clipping_cpu:.3e} ms.") ~~~ ~~~output @@ -457,16 +455,17 @@ clipped_mean_ = data_clipped.mean() clipped_median_ = np.median(data_clipped) clipped_stddev_ = np.std(data_clipped) clipped_max_ = np.amax(data_clipped) -print(f"mean of clipped = {clipped_mean_:.3e}, median of clipped = \ -{clipped_median_:.3e} \n standard deviation of clipped = \ -{clipped_stddev_:.3e}, maximum of clipped = {clipped_max_:.3e}") +print(f"mean of clipped = {clipped_mean_:.3e}, + median of clipped = {clipped_median_:.3e} \n + standard deviation of clipped = {clipped_stddev_:.3e}, + maximum of clipped = {clipped_max_:.3e}") ~~~ All output statistics have become smaller which is reassuring; it seems data_clipped contains mostly background pixels: ~~~output mean of clipped = -1.945e-06, median of clipped = -9.796e-06 - standard deviation of clipped = 1.334e-02, maximum of clipped = 4.000e-02 +standard deviation of clipped = 1.334e-02, maximum of clipped = 4.000e-02 ~~~ :::::::::::::::::::::::::::::::::::::: challenge @@ -484,8 +483,9 @@ def ks_clipper_gpu(data_flat): return cp.asnumpy(data_gpu_clipped) data_clipped_on_GPU = ks_clipper_gpu(data_flat) -timing_ks_clipping_gpu = benchmark(ks_clipper_gpu, \ - (data_flat, ), n_repeat=10) +timing_ks_clipping_gpu = benchmark(ks_clipper_gpu, + (data_flat, ), + n_repeat=10) fastest_ks_clipping_gpu = np.amin(timing_ks_clipping_gpu.gpu_times) print(f"{1000 * fastest_ks_clipping_gpu:.3e} ms") ~~~ @@ -528,8 +528,7 @@ threshold = 5 * stddev_gpu_ segmented_image = np.where(data > threshold, 1, 0) timing_segmentation_CPU = %timeit -o np.where(data > threshold, 1, 0) fastest_segmentation_CPU = timing_segmentation_CPU.best -print(f"Fastest CPU segmentation time = {1000 * fastest_segmentation_CPU:.3e} \ -ms.") +print(f"Fastest CPU segmentation time = {1000 * fastest_segmentation_CPU:.3e} ms.") ~~~ ~~~output @@ -549,8 +548,7 @@ from scipy.ndimage import label as label_cpu labeled_image = np.empty(data.shape) number_of_sources_in_image = label_cpu(segmented_image, output = labeled_image) sigma_unicode = "\u03C3" -print(f"The number of sources in the image at the 5{sigma_unicode} level is \ -{number_of_sources_in_image}.") +print(f"The number of sources in the image at the 5{sigma_unicode} level is {number_of_sources_in_image}.") timing_CCL_CPU = %timeit -o label_cpu(segmented_image, output = labeled_image) fastest_CCL_CPU = timing_CCL_CPU.best @@ -569,14 +567,14 @@ Let us not just accept the answer, but also do a sanity check. What are the values in the labeled image? ~~~python -print(f"These are all the pixel values we can find in the labeled image: \ -{np.unique(labeled_image)}") +print(f"These are all the pixel values we can find in the labeled image: {np.unique(labeled_image)}") ~~~ This should show the following output: ~~~output -These are all the pixel values we can find in the labeled image: [ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. +These are all the pixel values we can find in the labeled image: +[ 0. 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25. 26. 27. 28. 29. 30. 31. 32. 33. 34. 35. 36. 37. 38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54. 55. @@ -604,19 +602,19 @@ This is the CPU code for measuring our sources. ~~~python from scipy.ndimage import center_of_mass as com_cpu from scipy.ndimage import sum_labels as sl_cpu -all_positions = com_cpu(data, labeled_image, \ - range(1, number_of_sources_in_image+1)) -all_integrated_fluxes = sl_cpu(data, labeled_image, \ +all_positions = com_cpu(data, labeled_image, + range(1, number_of_sources_in_image+1)) +all_integrated_fluxes = sl_cpu(data, labeled_image, range(1, number_of_sources_in_image+1)) -print (f'These are the ten highest integrated fluxes of the sources in my \n\ -image: {np.sort(all_integrated_fluxes)[-10:]}') +print (f'These are the ten highest integrated fluxes of the sources in my \n image: {np.sort(all_integrated_fluxes)[-10:]}') ~~~ which gives the Galactic Center as the most luminous source, which makes sense when we look at our image. ~~~output -These are the ten highest integrated fluxes of the sources in my image: [ 38.90615184 41.91485894 43.02203498 47.30590784 51.23707351 +These are the ten highest integrated fluxes of the sources in my image: +[ 38.90615184 41.91485894 43.02203498 47.30590784 51.23707351 58.07289425 68.85673917 70.31223921 95.16443585 363.58937774] ~~~ @@ -624,9 +622,9 @@ Now we can try to measure the execution times for both algorithms, like this: ~~~python %%timeit -o -all_positions = com_cpu(data, labeled_image, \ - range(1, number_of_sources_in_image+1)) -all_integrated_fluxes = sl_cpu(data, labeled_image, \ +all_positions = com_cpu(data, labeled_image, + range(1, number_of_sources_in_image+1)) +all_integrated_fluxes = sl_cpu(data, labeled_image, range(1, number_of_sources_in_image+1)) ~~~ @@ -643,8 +641,7 @@ To collect the result from that timing in our next cell block, we need a trick t ~~~python timing_source_measurements_CPU = _ fastest_source_measurements_CPU = timing_source_measurements_CPU.best -print(f"Fastest CPU set of source measurements = \ -{1000 * fastest_source_measurements_CPU:.3e} ms.") +print(f"Fastest CPU set of source measurements = {1000 * fastest_source_measurements_CPU:.3e} ms.") ~~~ Which yields @@ -672,26 +669,24 @@ def first_two_steps_for_both_CPU_and_GPU(data): def ccl_and_source_measurements_on_CPU(data_CPU, segmented_image_CPU): labeled_image_CPU = np.empty(data_CPU.shape) number_of_sources_in_image = label_cpu(segmented_image_CPU, - output= labeled_image_CPU) + output= labeled_image_CPU) all_positions = com_cpu(data_CPU, labeled_image_CPU, np.arange(1, number_of_sources_in_image+1)) all_fluxes = sl_cpu(data_CPU, labeled_image_CPU, - np.arange(1, number_of_sources_in_image+1)) + np.arange(1, number_of_sources_in_image+1)) return np.array(all_positions), np.array(all_fluxes) -CPU_output = ccl_and_source_measurements_on_CPU(data, \ - first_two_steps_for_both_CPU_and_GPU(data)) +CPU_output = ccl_and_source_measurements_on_CPU(data, + first_two_steps_for_both_CPU_and_GPU(data)) -timing_complete_processing_CPU = \ - benchmark(ccl_and_source_measurements_on_CPU, (data, \ - first_two_steps_for_both_CPU_and_GPU(data)), \ - n_repeat=10) +timing_complete_processing_CPU = benchmark( + ccl_and_source_measurements_on_CPU, + (data, + first_two_steps_for_both_CPU_and_GPU(data)), n_repeat=10) -fastest_complete_processing_CPU = \ - np.amin(timing_complete_processing_CPU.cpu_times) +fastest_complete_processing_CPU = np.amin(timing_complete_processing_CPU.cpu_times) -print(f"The four steps of image processing for astronomy take \ -{1000 * fastest_complete_processing_CPU:.3e} ms\n on our CPU.") +print(f"The four steps of image processing for astronomy take {1000 * fastest_complete_processing_CPU:.3e} ms\n on our CPU.") from cupyx.scipy.ndimage import label as label_gpu from cupyx.scipy.ndimage import center_of_mass as com_gpu @@ -704,32 +699,24 @@ def ccl_and_source_measurements_on_GPU(data_GPU, segmented_image_GPU): all_positions = com_gpu(data_GPU, labeled_image_GPU, cp.arange(1, number_of_sources_in_image+1)) all_fluxes = sl_gpu(data_GPU, labeled_image_GPU, - cp.arange(1, number_of_sources_in_image+1)) + cp.arange(1, number_of_sources_in_image+1)) # This seems redundant, but we want to return ndarrays (Numpy) # and what we have are lists. # These first have to be converted to # Cupy arrays before they can be converted to Numpy arrays. - return cp.asnumpy(cp.asarray(all_positions)), \ - cp.asnumpy(cp.asarray(all_fluxes)) + return cp.asnumpy(cp.asarray(all_positions)), + cp.asnumpy(cp.asarray(all_fluxes)) -GPU_output = ccl_and_source_measurements_on_GPU(cp.asarray(data), \ - first_two_steps_for_both_CPU_and_GPU(cp.asarray(data))) +GPU_output = ccl_and_source_measurements_on_GPU(cp.asarray(data), first_two_steps_for_both_CPU_and_GPU(cp.asarray(data))) -timing_complete_processing_GPU = \ - benchmark(ccl_and_source_measurements_on_GPU, (cp.asarray(data), \ - first_two_steps_for_both_CPU_and_GPU(cp.asarray(data))), \ - n_repeat=10) +timing_complete_processing_GPU = benchmark(ccl_and_source_measurements_on_GPU, (cp.asarray(data), first_two_steps_for_both_CPU_and_GPU(cp.asarray(data))), n_repeat=10) -fastest_complete_processing_GPU = \ - np.amin(timing_complete_processing_GPU.gpu_times) +fastest_complete_processing_GPU = np.amin(timing_complete_processing_GPU.gpu_times) -print(f"The four steps of image processing for astronomy take \ -{1000 * fastest_complete_processing_GPU:.3e} ms\n on our GPU.") +print(f"The four steps of image processing for astronomy take {1000 * fastest_complete_processing_GPU:.3e} ms\n on our GPU.") -overall_speedup_factor = fastest_complete_processing_CPU/ \ - fastest_complete_processing_GPU -print(f"This means that the overall speedup factor GPU vs CPU equals: \ -{overall_speedup_factor:.3e}\n") +overall_speedup_factor = fastest_complete_processing_CPU / fastest_complete_processing_GPU +print(f"This means that the overall speedup factor GPU vs CPU equals: {overall_speedup_factor:.3e}\n") all_positions_agree = np.allclose(CPU_output[0], GPU_output[0]) print(f"The CPU and GPU positions agree: {all_positions_agree}\n") @@ -741,7 +728,7 @@ print(f"The CPU and GPU fluxes agree: {all_positions_agree}\n") ~~~output The four steps of image processing for astronomy take 1.060e+03 ms on our CPU. The four steps of image processing for astronomy take 5.770e+01 ms on our GPU. -This means that the overall speedup factor GPU vs CPU equals: 1.838e+01 +This means that the overall speedup factor GPU vs _mCPU equals: 1.838e+01 The CPU and GPU positions agree: True From 591d2897e88818e6fc0bd15a51ecd222eb6a1381 Mon Sep 17 00:00:00 2001 From: "Giordano Lipari @laptop4" Date: Tue, 7 Nov 2023 21:52:09 +0100 Subject: [PATCH 08/17] E02-language: revise the first episode half 'Convolution in Python' The revision work of commit 9640c8c is extended to the section 'NumPy routines on the GPU' included. The spirit is the same as in commit 9640c8c Some edits leaked in the rest of the documents, either by programmatic substitution or because of establishing an order of appearance (abbreviations). Left to do is the second half of the episode with the astronomy example. --- episodes/cupy.Rmd | 108 ++++++++++++++++++++++------------------------ 1 file changed, 51 insertions(+), 57 deletions(-) diff --git a/episodes/cupy.Rmd b/episodes/cupy.Rmd index 5f533ed..2ddf711 100644 --- a/episodes/cupy.Rmd +++ b/episodes/cupy.Rmd @@ -92,7 +92,7 @@ GPUs are exceptionally well suited to compute algorithms that follow either data ## Convolution on the CPU Using SciPy -Let us first construct and then display the Gaussian filter. +Let's first construct and then display the Gaussian filter. Remember that we are still coding everything in standard Python, without using the GPU. ~~~python @@ -111,7 +111,7 @@ The code above produces this image of a symmetrical two-dimensional Gaussian: Now we are ready to compute the convolution on the host. Very conveniently, SciPy provides a method for convolutions. -Let us also record the time to perform this convolution and inspect the top-left corner of the convolved image, as follows: +Let's also record the time to perform this convolution and inspect the top-left corner of the convolved image, as follows: ~~~python from scipy.signal import convolve2d as convolve2d_cpu @@ -134,7 +134,7 @@ Displaying just a corner of the image shows that the Gaussian has so much blurre ## Convolution on the GPU Using CuPy -This is a lesson on GPU programming, so let us use the GPU. +This is a lesson on GPU programming, so let's use the GPU. In spite of being physically connected -- typically with special [interconnects](https://en.wikipedia.org/wiki/Interconnect_(integrated_circuits)) -- the CPU and the GPU do not share the same memory space. This picture depicts the different components of CPU and GPU and how they are connected: @@ -142,7 +142,7 @@ This picture depicts the different components of CPU and GPU and how they are co This means that the array created with NumPy is physically stored in the host's main memory and, therefore, is only available to the CPU. Since our input image and convolution filter are not yet present in the device memory, we need to copy both data to the GPU before executing any code on it. -In practice, we use CuPy to copy the arrays `deltas` and `gauss` from the host's RAM to the GPU memory as follows: +In practice, we use CuPy to copy the arrays `deltas` and `gauss` from the host's Random Access Memory (RAM) to the GPU memory as follows: ~~~python import cupy as cp @@ -156,7 +156,7 @@ Inconveniently, SciPy does not offer methods running on GPUs. Hence, we import the convolution function from a CuPy package aliased as `cupyx`, whose package [`cupyx.scipy`](https://docs.cupy.dev/en/stable/reference/scipy.html), in turn, performs a subset of the SciPy operations. We will soon verify that the GPU convolution function in `cupyx` works like the convolution function from SciPy called on the CPU. In general, CuPy proper and NumPy look so much similar as do the `cupyx` methods and SciPy; this is intended to invite programmers already familiar with NumPy and SciPy to use the GPU. -For now, let us again record the time to execute on the device the same convolution as the host, so that we can compare the respective performances. +For now, let's again record the execution time on the device the same convolution as the host, so that we can compare the respective performances. ~~~python from cupyx.scipy.signal import convolve2d as convolve2d_gpu @@ -175,13 +175,13 @@ The timing using a NVIDIA Tesla T4 on Google Colab was: This is way faster than the host: more than a 24000-fold performance improvement, or speedup. Impressive, but is that true? -# Measuring performance +## Measuring performance -So far we used `timeit` to measure the performance of our Python code, no matter if it was running on the CPU or was GPU accelerated. -However, execution on the GPU is asynchronous: the control is given back to the Python interpreter immediately, while the GPU is still executing the task. -Because of this, we cannot use `timeit` anymore: the timing would not be correct. +So far we used `timeit` to measure the performance of our Python code, no matter whether it was running on the CPU or was GPU-accelerated. +However, the execution on the GPU is *asynchronous*: the Python interpreter takes control of the program execution immediately, while the GPU is still executing the task. +Therefore, the timing of `timeit` is not reliable. -CuPy provides a function, `benchmark` that we can use to measure the time it takes the GPU to execute our kernels. +Conveniently, `cupyx` provides the function `benchmark` that measures the actual execution time in the GPU. ~~~python from cupyx.profiler import benchmark @@ -189,57 +189,56 @@ from cupyx.profiler import benchmark execution_gpu = benchmark(convolve2d_gpu, (deltas_gpu, gauss_gpu), n_repeat=10) ~~~ -The previous code executes `convolve2d_gpu` ten times, and stores the execution time of each run, in seconds, inside the `gpu_times` attribute of the variable `execution_gpu`. -We can then compute the average execution time and print it, as shown. +The previous code executes `convolve2d_gpu` with the appropriate arguments ten times, and stores inside the `gpu_times` attribute of the variable `execution_gpu` the execution time of each run in seconds. +These measurements are also more stable and representative because `benchmarks` excludes the compile time and the repetitions warm up the GPU. +We can then average the execution times, as follows: ~~~python gpu_avg_time = np.average(execution_gpu.gpu_times) print(f"{gpu_avg_time:.6f} s") ~~~ -Another advantage of the `benchmark` method is that it excludes the compile time, and warms up the GPU, so that measurements are more stable and representative. - -With the new measuring code in place, we can measure performance again. +whereby the performance revisited is: ~~~output 0.020642 s ~~~ -We now have a more reasonable, but still impressive, speedup of 116 times over the host code. +We now have a more reasonable, but still impressive, 116-fold speedup with respect to the host code. :::::::::::::::::::::::::::::::::::::: challenge ## Challenge: convolution on the GPU without CuPy Try to convolve the NumPy array `deltas` with the NumPy array `gauss` directly on the GPU, without using CuPy arrays. -If this works, it should save us the time and effort of transferring `deltas` and `gauss` to the GPU. +If this works, it will save us the time and effort of transferring the arrays `deltas` and `gauss` to the GPU. ::::::::::::::::::::::::::::::::::::: solution -We can directly try to use the GPU convolution function `convolve2d_gpu` with `deltas` and `gauss` as inputs. +We can call the GPU convolution function `convolve2d_gpu` directly with `deltas` and `gauss` as argument: ~~~python convolve2d_gpu(deltas, gauss) ~~~ -However, this gives a long error message ending with: +However, this throws a long error message ending with: ~~~output TypeError: Unsupported type ~~~ -It is unfortunately not possible to access NumPy arrays directly from the GPU because they exist in the Random Access Memory (RAM) of the host and not in GPU memory. +Unfortunately, it is impossible to directly access from the GPU the NumPy arrays that live in the host RAM. ::::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::: -# Validation +## Validation -To check that we actually computed the same output on the host and the device we can compare the two output arrays `convolved_image_using_GPU` and `convolved_image_using_CPU`. +To check that we actually computed the same output on the host and on the device, we compare the two output arrays `convolved_image_using_GPU` and `convolved_image_using_CPU`, as follows: ~~~python np.allclose(convolved_image_using_GPU, convolved_image_using_CPU) ~~~ -As you may expect, the result of the comparison is positive, and in fact we computed the same results on the host and the device. +As you may have expected, the positive result of the comparison confirms that we computed the same results on the host and on the device: ~~~output array(True) @@ -248,13 +247,13 @@ array(True) :::::::::::::::::::::::::::::::::::::: challenge ## Challenge: fairer comparison of CPU vs. GPU -Compute again the speedup achieved using the GPU, but try to take also into account the time spent transferring the data to the GPU and back. +Compute again the speedup achieved using the GPU, taking into account also the time spent transferring the data from the CPU to the GPU and back. -Hint: to copy a CuPy array back to the host (CPU), use the `cp.asnumpy()` function. +Hint: use the `cp.asnumpy()` method to copy a CuPy array back to the host. :::::::::::::::::::::::::::::::::::::: solution -A convenient solution is to group both the transfers, to and from the GPU, and the convolution into a single Python function, and then time its execution, like in the following example. +A convenient strategy is to time the execution of a single Python function that groups the transfers to and from the GPU and the convolution, as follows: ~~~python def transfer_compute_transferback(): @@ -272,24 +271,24 @@ print(f"{gpu_avg_time:.6f} s") 0.035400 s ~~~ -The speedup taking into account the data transfers decreased from 116 to 67. -Taking into account the necessary data transfers when computing the speedup is a better, and more fair, way to compare performance. -As a note, because data transfers force the GPU to sync with the host, this could also be measured with `timeit` and still provide correct measurements. +The speedup taking into account the data transfer decreased from 116 to 67. +Accounting for the necessary data transfers is a better and fairer way to compute performance speedups. +As an side note, here `timeit` would still provide correct measurements, because the data transfers force the device to *sync* with the host. ::::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::: -# A shortcut: performing NumPy routines on the GPU +## A shortcut: performing NumPy routines on the GPU -We saw earlier that we cannot execute routines from the `cupyx` library directly on NumPy arrays. -In fact we need to first transfer the data from host to device memory. -Vice versa, if we try to execute a regular SciPy routine (i.e. designed to run the CPU) on a CuPy array, we will also encounter an error. -Try the following: +We saw in a previous challenge that we cannot launch the routines of the `cupyx` library directly on NumPy arrays. +In fact, we first need to transfer the data from the host to the device memory. +Vice versa, we will encounter an error also if we launch on a CuPy array a regular SciPy routine designed to run on CPUs. +Try out the following: ~~~python convolve2d_cpu(deltas_gpu, gauss_gpu) ~~~ -This results in +which results in ~~~output ...... @@ -299,13 +298,10 @@ TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construct a NumPy array explicitly. ~~~ -So SciPy routines cannot have CuPy arrays as input. -We can, however, execute a simpler command that does not require SciPy. -Instead of 2D convolution, we can do 1D convolution. -For that we can use a NumPy routine instead of a SciPy routine. -The `convolve` routine from NumPy performs linear (1D) convolution. -To generate some input for a linear convolution, we can flatten our image from 2D to 1D (using `ravel()`), but we also need a 1D kernel. -For the latter we will take the diagonal elements of our 2D Gaussian kernel. +So, SciPy routines do not accept CuPy arrays as input. +However, instead of performing a 2D convolution, we can execute a simpler 1D (linear) convolution that uses the NumPy routine `convolve` instead of a SciPy routine. +To generate the appropriate input for a linear convolution, we flatten our input image from 2D to 1D using the method `ravel`. +To generate a 1D filter, we take the diagonal elements of our 2D Gaussian filter. Try the following three instructions for linear convolution on the CPU: ~~~python @@ -314,15 +310,14 @@ gauss_1d = gauss.diagonal() %timeit -n 1 -r 1 np.convolve(deltas_1d, gauss_1d) ~~~ -You could arrive at something similar to this timing result: +A realistic execution time is: ~~~output 270 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each) ~~~ -We have performed a regular linear convolution using our CPU. -Now let us try something bold. -We will transfer the 1D arrays to the GPU and use the NumPy routine to do the convolution. +After having performed a regular linear convolution on the host using NumPy, let's try something bold. +We transfer the 1D arrays to the device and do the convolution with a NumPy routine. ~~~python deltas_1d_gpu = cp.asarray(deltas_1d) @@ -333,20 +328,19 @@ gpu_avg_time = np.average(execution_gpu.gpu_times) print(f"{gpu_avg_time:.6f} s") ~~~ -You may be surprised that we can issue these commands without error. -Contrary to SciPy routines, NumPy accepts CuPy arrays, i.e. arrays that exist in GPU memory, as input. -[In the CuPy documentation](https://docs.cupy.dev/en/stable/user_guide/interoperability.html#numpy) you can find some background on why NumPy routines can handle CuPy arrays. - -Also, remember when we used `np.allclose` with a NumPy and a CuPy array as input? +You may be surprised that these commands do not throw any error. +Contrary to SciPy, NumPy routines accept CuPy arrays as input, even though the latter exist only in GPU memory. +Indeed, can you recall when we validated our codes using a NumPy and a CuPy array as input of `np.allclose`? That worked for the same reason. +[The CuPy documentation](https://docs.cupy.dev/en/stable/user_guide/interoperability.html#numpy) explains why NumPy routines can handle CuPy arrays. -The linear convolution is actually performed on the GPU, which also results in a lower execution time. +The linear convolution is actually performed on the GPU and results in a faster execution time: ~~~output 0.014529 s ~~~ -Without much effort, we obtained a 18 times speedup. +Without much coding effort, we obtained a 18-fold speedup. # A real world example: image processing for radio astronomy @@ -368,7 +362,7 @@ The latter two methods are needed to convert byte ordering from big endian to li ## Inspect the image -Let us have a look at part of this image. +Let's have a look at part of this image. ~~~python from matplotlib.colors import LogNorm @@ -405,7 +399,7 @@ From the clipped set, we compute the median and standard deviation again and cli Continue until no more pixels are clipped. The standard deviation from this final set of pixel values is the basis for the next step. -Before clipping, let us investigate some properties of our unclipped data. +Before clipping, let's investigate some properties of our unclipped data. ~~~python mean_ = data.mean() @@ -515,7 +509,7 @@ The speedup factor for ks clipping is: 1.232e+01 ## Segment the image We have seen that clipping at the $5 \sigma$ level of an image this size (2048² pixels) will yield a chance of less than 50% that from all the sources we detect at least one will be a noise peak. -So let us set the threshold at $5 \sigma$ and segment it. +So let's set the threshold at $5 \sigma$ and segment it. First check that we find the same standard deviation from our clipper on the GPU: ~~~python @@ -571,7 +565,7 @@ The number of sources in the image at the 5σ level is 185. Fastest CPU CCL time = 2.546e+01 ms. ~~~ -Let us not just accept the answer, but also do a sanity check. +Let's not just accept the answer, but also do a sanity check. What are the values in the labeled image? ~~~python From 2755175e6c0cb9e18abf431c3207b1033e5ed10b Mon Sep 17 00:00:00 2001 From: "Giordano Lipari @laptop4" Date: Tue, 7 Nov 2023 22:02:07 +0100 Subject: [PATCH 09/17] E02-code: rename variable 'deltas' as 'diracs' It seems clear that the previous variable name 'deltas' did not express generic finite differences but an application of Dirac delta function to the image pixels. 'diracs' is therefore more precise insofar as it rules out any ambiguity. Since the convolution filter is called gauss, the pair diracs/gauss sounds also well balanced. Any other choice is of course possible, but without the drawbacks of the delta terminology --- episodes/cupy.Rmd | 44 ++++++++++++++++++++++---------------------- 1 file changed, 22 insertions(+), 22 deletions(-) diff --git a/episodes/cupy.Rmd b/episodes/cupy.Rmd index 8cdb76a..2818e5e 100644 --- a/episodes/cupy.Rmd +++ b/episodes/cupy.Rmd @@ -34,8 +34,8 @@ We can interactively write and executed the code in an iPython shell or a Jupyte import numpy as np # Construct an image with repeated delta functions -deltas = np.zeros((2048, 2048)) -deltas[8::16,8::16] = 1 +diracs = np.zeros((2048, 2048)) +diracs[8::16,8::16] = 1 ~~~ To get a feeling of how the whole image looks like, we can display the top-left corner of it. @@ -47,13 +47,13 @@ import pylab as pyl # Display the image # You can zoom in using the menu in the window that will appear -pyl.imshow(deltas[0:32, 0:32]) +pyl.imshow(diracs[0:32, 0:32]) pyl.show() ~~~ After executing the code, you should see the following image. -![Deltas array](./fig/deltas.png){alt='Deltas array'} +![Deltas array](./fig/diracs.png){alt='Deltas array'} ### Background The computation we want to perform on this image is a convolution, once on the host and once on the device so we can compare the results and execution times. @@ -111,10 +111,10 @@ Let us also record the time it takes to perform this convolution and inspect the ~~~python from scipy.signal import convolve2d as convolve2d_cpu -convolved_image_using_CPU = convolve2d_cpu(deltas, gauss) +convolved_image_using_CPU = convolve2d_cpu(diracs, gauss) pyl.imshow(convolved_image_using_CPU[0:32, 0:32]) pyl.show() -%timeit -n 1 -r 1 convolve2d_cpu(deltas, gauss) +%timeit -n 1 -r 1 convolve2d_cpu(diracs, gauss) ~~~ Obviously, the time to perform this convolution will depend very much on the power of your CPU, but I expect you to find that it takes a couple of seconds. @@ -137,12 +137,12 @@ This image depicts the different components of CPU and GPU and how they are conn This means that an array created from e.g. an iPython shell using NumPy is physically located into the main memory of the host, and therefore available for the CPU but not the GPU. It is not yet present in GPU memory, which means that we need to copy our data, the input image and the convolving function to the GPU, before we can execute any code on it. -In practice, we have the arrays `deltas` and `gauss` in the host's RAM, and we need to copy them to GPU memory using CuPy. +In practice, we have the arrays `diracs` and `gauss` in the host's RAM, and we need to copy them to GPU memory using CuPy. ~~~python import cupy as cp -deltas_gpu = cp.asarray(deltas) +diracs_gpu = cp.asarray(diracs) gauss_gpu = cp.asarray(gauss) ~~~ @@ -155,8 +155,8 @@ Let us again record the time to execute the convolution, so that we can compare ~~~python from cupyx.scipy.signal import convolve2d as convolve2d_gpu -convolved_image_using_GPU = convolve2d_gpu(deltas_gpu, gauss_gpu) -%timeit -n 7 -r 1 convolved_image_using_GPU = convolve2d_gpu(deltas_gpu, gauss_gpu) +convolved_image_using_GPU = convolve2d_gpu(diracs_gpu, gauss_gpu) +%timeit -n 7 -r 1 convolved_image_using_GPU = convolve2d_gpu(diracs_gpu, gauss_gpu) ~~~ Similar to what we had previously on the host, the execution time of the GPU convolution will depend very much on the GPU used. @@ -180,7 +180,7 @@ CuPy provides a function, `benchmark` that we can use to measure the time it tak ~~~python from cupyx.profiler import benchmark -execution_gpu = benchmark(convolve2d_gpu, (deltas_gpu, gauss_gpu), n_repeat=10) +execution_gpu = benchmark(convolve2d_gpu, (diracs_gpu, gauss_gpu), n_repeat=10) ~~~ The previous code executes `convolve2d_gpu` ten times, and stores the execution time of each run, in seconds, inside the `gpu_times` attribute of the variable `execution_gpu`. @@ -204,15 +204,15 @@ We now have a more reasonable, but still impressive, speedup of 116 times over t :::::::::::::::::::::::::::::::::::::: challenge ## Challenge: convolution on the GPU without CuPy -Try to convolve the NumPy array `deltas` with the NumPy array `gauss` directly on the GPU, without using CuPy arrays. -If this works, it should save us the time and effort of transferring `deltas` and `gauss` to the GPU. +Try to convolve the NumPy array `diracs` with the NumPy array `gauss` directly on the GPU, without using CuPy arrays. +If this works, it should save us the time and effort of transferring `diracs` and `gauss` to the GPU. ::::::::::::::::::::::::::::::::::::: solution -We can directly try to use the GPU convolution function `convolve2d_gpu` with `deltas` and `gauss` as inputs. +We can directly try to use the GPU convolution function `convolve2d_gpu` with `diracs` and `gauss` as inputs. ~~~python -convolve2d_gpu(deltas, gauss) +convolve2d_gpu(diracs, gauss) ~~~ However, this gives a long error message ending with: @@ -252,9 +252,9 @@ A convenient solution is to group both the transfers, to and from the GPU, and t ~~~python def transfer_compute_transferback(): - deltas_gpu = cp.asarray(deltas) + diracs_gpu = cp.asarray(diracs) gauss_gpu = cp.asarray(gauss) - convolved_image_using_GPU = convolve2d_gpu(deltas_gpu, gauss_gpu) + convolved_image_using_GPU = convolve2d_gpu(diracs_gpu, gauss_gpu) convolved_image_using_GPU_copied_to_host = cp.asnumpy(convolved_image_using_GPU) execution_gpu = benchmark(transfer_compute_transferback, (), n_repeat=10) @@ -280,7 +280,7 @@ Vice versa, if we try to execute a regular SciPy routine (i.e. designed to run t Try the following: ~~~python -convolve2d_cpu(deltas_gpu, gauss_gpu) +convolve2d_cpu(diracs_gpu, gauss_gpu) ~~~ This results in @@ -303,9 +303,9 @@ For the latter we will take the diagonal elements of our 2D Gaussian kernel. Try the following three instructions for linear convolution on the CPU: ~~~python -deltas_1d = deltas.ravel() +diracs_1d = diracs.ravel() gauss_1d = gauss.diagonal() -%timeit -n 1 -r 1 np.convolve(deltas_1d, gauss_1d) +%timeit -n 1 -r 1 np.convolve(diracs_1d, gauss_1d) ~~~ You could arrive at something similar to this timing result: @@ -319,10 +319,10 @@ Now let us try something bold. We will transfer the 1D arrays to the GPU and use the NumPy routine to do the convolution. ~~~python -deltas_1d_gpu = cp.asarray(deltas_1d) +diracs_1d_gpu = cp.asarray(diracs_1d) gauss_1d_gpu = cp.asarray(gauss_1d) -execution_gpu = benchmark(np.convolve, (deltas_1d_gpu, gauss_1d_gpu), n_repeat=10) +execution_gpu = benchmark(np.convolve, (diracs_1d_gpu, gauss_1d_gpu), n_repeat=10) gpu_avg_time = np.average(execution_gpu.gpu_times) print(f"{gpu_avg_time:.6f} s") ~~~ From 1d3013834ddcd6dd32b6ced511929d46a9e5182d Mon Sep 17 00:00:00 2001 From: "Giordano Lipari @laptop4" Date: Tue, 7 Nov 2023 22:14:57 +0100 Subject: [PATCH 10/17] E02-code: change variable names in the definition of the convolution filter The mean 'muu' is now 'origin' like in the illustration in the same section. 'dst' is now 'dist' which links more easily to the distance it means. --- episodes/cupy.Rmd | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/episodes/cupy.Rmd b/episodes/cupy.Rmd index 2818e5e..41daa8f 100644 --- a/episodes/cupy.Rmd +++ b/episodes/cupy.Rmd @@ -92,10 +92,10 @@ Remember that at this point we are still doing everything with standard Python, ~~~python x, y = np.meshgrid(np.linspace(-2, 2, 15), np.linspace(-2, 2, 15)) -dst = np.sqrt(x*x + y*y) +dist = np.sqrt(x*x + y*y) sigma = 1 -muu = 0.000 -gauss = np.exp(-(dst - muu)**2 / (2.0 * sigma**2)) +origin = 0.000 +gauss = np.exp(-(dist - origin)**2 / (2.0 * sigma**2)) pyl.imshow(gauss) pyl.show() ~~~ From ffd79abeb24b185f14ab1c82800f67965e4f0c2f Mon Sep 17 00:00:00 2001 From: "Giordano Lipari @laptop4" Date: Tue, 7 Nov 2023 22:29:41 +0100 Subject: [PATCH 11/17] E02-code: rename figure with deltas as diracs.png This makes the changes in commit 2755175 render correctly with R markdown. Otheriwse you would get a file not found. This change should have been a part of 2755175, in fairness. --- episodes/fig/{deltas.png => diracs.png} | Bin 1 file changed, 0 insertions(+), 0 deletions(-) rename episodes/fig/{deltas.png => diracs.png} (100%) diff --git a/episodes/fig/deltas.png b/episodes/fig/diracs.png similarity index 100% rename from episodes/fig/deltas.png rename to episodes/fig/diracs.png From 5c1dd571fca3b9e7ab95103bfdbe3a93694accd4 Mon Sep 17 00:00:00 2001 From: "Giordano Lipari @laptop4" Date: Tue, 7 Nov 2023 22:48:03 +0100 Subject: [PATCH 12/17] E02-code/convolution: improve semantics and length of variable names These changes concern the first episode half on image convolution. The notation is more uniform as to the usage of lower/uppercase and suffixes, and is possibly more concise. This is the table of changes: convolved_image_using_CPU > convolved_image_cpu convolved_image_using_GPU > convolved_image_gpu convolved_image_using_GPU_copied_to_host > convolved_image_gpu_in_host diracs_1d > diracs_1d_cpu execution_gpu > benchmark_gpu gauss_1d > gauss_1d_cpu gpu_avg_time > gpu_execution_avg transfer_compute_transferback > push_compute_pull (Git-inspired!) --- episodes/cupy.Rmd | 54 +++++++++++++++++++++++------------------------ 1 file changed, 27 insertions(+), 27 deletions(-) diff --git a/episodes/cupy.Rmd b/episodes/cupy.Rmd index 41daa8f..16c0ddf 100644 --- a/episodes/cupy.Rmd +++ b/episodes/cupy.Rmd @@ -111,8 +111,8 @@ Let us also record the time it takes to perform this convolution and inspect the ~~~python from scipy.signal import convolve2d as convolve2d_cpu -convolved_image_using_CPU = convolve2d_cpu(diracs, gauss) -pyl.imshow(convolved_image_using_CPU[0:32, 0:32]) +convolved_image_cpu = convolve2d_cpu(diracs, gauss) +pyl.imshow(convolved_image_cpu[0:32, 0:32]) pyl.show() %timeit -n 1 -r 1 convolve2d_cpu(diracs, gauss) ~~~ @@ -155,8 +155,8 @@ Let us again record the time to execute the convolution, so that we can compare ~~~python from cupyx.scipy.signal import convolve2d as convolve2d_gpu -convolved_image_using_GPU = convolve2d_gpu(diracs_gpu, gauss_gpu) -%timeit -n 7 -r 1 convolved_image_using_GPU = convolve2d_gpu(diracs_gpu, gauss_gpu) +convolved_image_gpu = convolve2d_gpu(diracs_gpu, gauss_gpu) +%timeit -n 7 -r 1 convolved_image_gpu = convolve2d_gpu(diracs_gpu, gauss_gpu) ~~~ Similar to what we had previously on the host, the execution time of the GPU convolution will depend very much on the GPU used. @@ -180,15 +180,15 @@ CuPy provides a function, `benchmark` that we can use to measure the time it tak ~~~python from cupyx.profiler import benchmark -execution_gpu = benchmark(convolve2d_gpu, (diracs_gpu, gauss_gpu), n_repeat=10) +benchmark_gpu = benchmark(convolve2d_gpu, (diracs_gpu, gauss_gpu), n_repeat=10) ~~~ -The previous code executes `convolve2d_gpu` ten times, and stores the execution time of each run, in seconds, inside the `gpu_times` attribute of the variable `execution_gpu`. +The previous code executes `convolve2d_gpu` ten times, and stores the execution time of each run, in seconds, inside the `gpu_times` attribute of the variable `benchmark_gpu`. We can then compute the average execution time and print it, as shown. ~~~python -gpu_avg_time = np.average(execution_gpu.gpu_times) -print(f"{gpu_avg_time:.6f} s") +gpu_execution_avg = np.average(benchmark_gpu.gpu_times) +print(f"{gpu_execution_avg:.6f} s") ~~~ Another advantage of the `benchmark` method is that it excludes the compile time, and warms up the GPU, so that measurements are more stable and representative. @@ -227,10 +227,10 @@ It is unfortunately not possible to access NumPy arrays directly from the GPU be # Validation -To check that we actually computed the same output on the host and the device we can compare the two output arrays `convolved_image_using_GPU` and `convolved_image_using_CPU`. +To check that we actually computed the same output on the host and the device we can compare the two output arrays `convolved_image_gpu` and `convolved_image_cpu`. ~~~python -np.allclose(convolved_image_using_GPU, convolved_image_using_CPU) +np.allclose(convolved_image_gpu, convolved_image_cpu) ~~~ As you may expect, the result of the comparison is positive, and in fact we computed the same results on the host and the device. @@ -251,15 +251,15 @@ Hint: to copy a CuPy array back to the host (CPU), use the `cp.asnumpy()` functi A convenient solution is to group both the transfers, to and from the GPU, and the convolution into a single Python function, and then time its execution, like in the following example. ~~~python -def transfer_compute_transferback(): +def push_compute_pull(): diracs_gpu = cp.asarray(diracs) gauss_gpu = cp.asarray(gauss) - convolved_image_using_GPU = convolve2d_gpu(diracs_gpu, gauss_gpu) - convolved_image_using_GPU_copied_to_host = cp.asnumpy(convolved_image_using_GPU) + convolved_image_gpu = convolve2d_gpu(diracs_gpu, gauss_gpu) + convolved_image_gpu_in_host = cp.asnumpy(convolved_image_gpu) -execution_gpu = benchmark(transfer_compute_transferback, (), n_repeat=10) -gpu_avg_time = np.average(execution_gpu.gpu_times) -print(f"{gpu_avg_time:.6f} s") +benchmark_gpu = benchmark(push_compute_pull, (), n_repeat=10) +gpu_execution_avg = np.average(benchmark_gpu.gpu_times) +print(f"{gpu_execution_avg:.6f} s") ~~~ ~~~output @@ -286,9 +286,9 @@ convolve2d_cpu(diracs_gpu, gauss_gpu) This results in ~~~output -...... -...... -...... +... +... +... TypeError: Implicit conversion to a NumPy array is not allowed. Please use `.get()` to construct a NumPy array explicitly. ~~~ @@ -303,9 +303,9 @@ For the latter we will take the diagonal elements of our 2D Gaussian kernel. Try the following three instructions for linear convolution on the CPU: ~~~python -diracs_1d = diracs.ravel() -gauss_1d = gauss.diagonal() -%timeit -n 1 -r 1 np.convolve(diracs_1d, gauss_1d) +diracs_1d_cpu = diracs.ravel() +gauss_1d_cpu = gauss.diagonal() +%timeit -n 1 -r 1 np.convolve(diracs_1d_cpu, gauss_1d_cpu) ~~~ You could arrive at something similar to this timing result: @@ -319,12 +319,12 @@ Now let us try something bold. We will transfer the 1D arrays to the GPU and use the NumPy routine to do the convolution. ~~~python -diracs_1d_gpu = cp.asarray(diracs_1d) -gauss_1d_gpu = cp.asarray(gauss_1d) +diracs_1d_gpu = cp.asarray(diracs_1d_cpu) +gauss_1d_gpu = cp.asarray(gauss_1d_cpu) -execution_gpu = benchmark(np.convolve, (diracs_1d_gpu, gauss_1d_gpu), n_repeat=10) -gpu_avg_time = np.average(execution_gpu.gpu_times) -print(f"{gpu_avg_time:.6f} s") +benchmark_gpu = benchmark(np.convolve, (diracs_1d_gpu, gauss_1d_gpu), n_repeat=10) +gpu_execution_avg = np.average(benchmark_gpu.gpu_times) +print(f"{gpu_execution_avg:.6f} s") ~~~ You may be surprised that we can issue these commands without error. From 0ec5ab9688e710a351b6c1fb1a0a771983e17d1c Mon Sep 17 00:00:00 2001 From: "Giordano Lipari @laptop4" Date: Wed, 8 Nov 2023 18:34:24 +0100 Subject: [PATCH 13/17] E02-language/convolution: revise first episode half (second pass) I looked at the first half in its entirety with a fresh pair of eyes before moving to the second episode half on astronomy. I would call these edits either stylistic (reading flow) or pedagogic (announce what you explain). Although the work is not complete, there should be no unglued bits left. --- episodes/cupy.Rmd | 63 ++++++++++++++++++++++++----------------------- 1 file changed, 32 insertions(+), 31 deletions(-) diff --git a/episodes/cupy.Rmd b/episodes/cupy.Rmd index 2ddf711..a9189e1 100644 --- a/episodes/cupy.Rmd +++ b/episodes/cupy.Rmd @@ -21,15 +21,15 @@ exercises: 60 [CuPy](https://cupy.dev) is a GPU array library that implements a subset of the NumPy and SciPy interfaces. Thanks to CuPy, people conversant with NumPy can very conveniently harvest the compute power of GPUs without writing code in GPU programming languages such as CUDA, OpenCL, and HIP. -From now on we can also use the word *host* to refer to the CPU on the laptop, desktop, or cluster node you are using as usual, and *device* to refer to the GPU. +From now on we can also use the word *host* to refer to the CPU on the laptop, desktop, or cluster node you are using as usual, and *device* to refer to the graphics card and its GPU. # Convolutions in Python We start by generating an image using Python and NumPy code. -We want to compute a convolution on this input image, once on the host and once on the device, and then compare both the execution times and the results. +We want to compute a convolution on this input image once on the host and once on the device, and then compare both the execution times and the results. In an iPython shell or a Jupyter notebook, we can write and execute the following code on the host. -The pixel values will be zero everywhere except for a regular grid of single pixels having value one, very much like a Dirac's delta function, hence here named `deltas`. +The pixel values will be zero everywhere except for a regular grid of single pixels having value one, very much like a Dirac's delta function; hence the input image is named `deltas`. ~~~python import numpy as np @@ -66,11 +66,11 @@ To know more about convolutions, we encourage you to check out [this GitHub repo In this course section, we will convolve our image with a 2D Gaussian function, having the general form: -$$G(x,y) = \frac{1}{2\pi \sigma^2} \exp\left(-\frac{x^2 + y^2}{2 \sigma^2}\right),$$ +$$G(x,y) = \frac{1}{2\pi \sigma^2} \exp\left(-\frac{x^2 + y^2}{2 \sigma^2}\right)$$ where $x$ and $y$ are distances from the origin, and $\sigma$ controls the width of the Gaussian curve. Since we can think of an image as a matrix of color values, the convolution of that image with a kernel generates a new matrix with different color values. -In particular, convolving images with 2D Gaussian kernel changes the value of each pixel into a weighted average of the neighboring pixels, thereby smoothing out the features in the input image. +In particular, convolving images with a 2D Gaussian kernel changes the value of each pixel into a weighted average of the neighboring pixels, thereby smoothing out the features in the input image. Convolutions are frequently used in computer vision to filter images. For example, Gaussian convolution can be required before applying algorithms for [edge detection](https://en.wikipedia.org/wiki/Gaussian_blur#Edge_detection), which are sensitive to the noise in the original image. @@ -122,7 +122,8 @@ pyl.show() %timeit -n 1 -r 1 convolve2d_cpu(deltas, gauss) ~~~ -Obviously, the compute power of your CPU influences the actual execution time very much, and we expect that to be in the region of a couple of seconds. +Obviously, the compute power of your CPU influences the actual execution time very much. +We expect that to be in the region of a couple of seconds, as shown in the timing report below: ~~~output 2.4 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each) @@ -140,7 +141,7 @@ This picture depicts the different components of CPU and GPU and how they are co ![CPU and GPU are separate entities with an own memory.](./fig/CPU_and_GPU_separated.png){alt="CPU and GPU are separate entities with an own memory"} -This means that the array created with NumPy is physically stored in the host's main memory and, therefore, is only available to the CPU. +This means that the array created with NumPy is physically stored in a memory of the host's and, therefore, is only available to the CPU. Since our input image and convolution filter are not yet present in the device memory, we need to copy both data to the GPU before executing any code on it. In practice, we use CuPy to copy the arrays `deltas` and `gauss` from the host's Random Access Memory (RAM) to the GPU memory as follows: @@ -153,10 +154,10 @@ gauss_gpu = cp.asarray(gauss) Now it is time to compute the convolution on our GPU. Inconveniently, SciPy does not offer methods running on GPUs. -Hence, we import the convolution function from a CuPy package aliased as `cupyx`, whose package [`cupyx.scipy`](https://docs.cupy.dev/en/stable/reference/scipy.html), in turn, performs a subset of the SciPy operations. -We will soon verify that the GPU convolution function in `cupyx` works like the convolution function from SciPy called on the CPU. -In general, CuPy proper and NumPy look so much similar as do the `cupyx` methods and SciPy; this is intended to invite programmers already familiar with NumPy and SciPy to use the GPU. -For now, let's again record the execution time on the device the same convolution as the host, so that we can compare the respective performances. +Hence, we import the convolution function from a CuPy package aliased as `cupyx`, whose sub-package [`cupyx.scipy`](https://docs.cupy.dev/en/stable/reference/scipy.html) performs a selection of the SciPy operations. +We will soon verify that the GPU convolution function of `cupyx` works out the same calculations as the CPU convolution function of SciPy. +In general, CuPy proper and NumPy are so similar as are the `cupyx` methods and SciPy; this is intended to invite programmers already familiar with NumPy and SciPy to use the GPU for computing. +For now, let's again record the execution time on the device for the same convolution as the host, and can compare the respective performances. ~~~python from cupyx.scipy.signal import convolve2d as convolve2d_gpu @@ -178,10 +179,11 @@ Impressive, but is that true? ## Measuring performance So far we used `timeit` to measure the performance of our Python code, no matter whether it was running on the CPU or was GPU-accelerated. -However, the execution on the GPU is *asynchronous*: the Python interpreter takes control of the program execution immediately, while the GPU is still executing the task. +However, the execution on the GPU is *asynchronous*: the Python interpreter takes back control of the program execution immediately, while the GPU is still executing the task. Therefore, the timing of `timeit` is not reliable. Conveniently, `cupyx` provides the function `benchmark` that measures the actual execution time in the GPU. +The following code executes `convolve2d_gpu` with the appropriate arguments ten times, and stores inside the `gpu_times` attribute of the variable `execution_gpu` the execution time of each run in seconds. ~~~python from cupyx.profiler import benchmark @@ -189,8 +191,7 @@ from cupyx.profiler import benchmark execution_gpu = benchmark(convolve2d_gpu, (deltas_gpu, gauss_gpu), n_repeat=10) ~~~ -The previous code executes `convolve2d_gpu` with the appropriate arguments ten times, and stores inside the `gpu_times` attribute of the variable `execution_gpu` the execution time of each run in seconds. -These measurements are also more stable and representative because `benchmarks` excludes the compile time and the repetitions warm up the GPU. +These measurements are also more stable and representative, because `benchmarks` disregards the compile time and the repetitions warm up the GPU. We can then average the execution times, as follows: ~~~python @@ -204,12 +205,12 @@ whereby the performance revisited is: 0.020642 s ~~~ -We now have a more reasonable, but still impressive, 116-fold speedup with respect to the host code. +We now have a more reasonable, but still impressive, 116-fold speedup with respect to the execution on the host. :::::::::::::::::::::::::::::::::::::: challenge ## Challenge: convolution on the GPU without CuPy -Try to convolve the NumPy array `deltas` with the NumPy array `gauss` directly on the GPU, without using CuPy arrays. +Try to convolve the NumPy array `deltas` with the NumPy array `gauss` directly on the GPU, that is, without CuPy arrays. If this works, it will save us the time and effort of transferring the arrays `deltas` and `gauss` to the GPU. ::::::::::::::::::::::::::::::::::::: solution @@ -226,19 +227,19 @@ However, this throws a long error message ending with: TypeError: Unsupported type ~~~ -Unfortunately, it is impossible to directly access from the GPU the NumPy arrays that live in the host RAM. +Unfortunately, it is impossible to access directly from the GPU the NumPy arrays that live in the host RAM. ::::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::: ## Validation -To check that we actually computed the same output on the host and on the device, we compare the two output arrays `convolved_image_using_GPU` and `convolved_image_using_CPU`, as follows: +To check that the host and the device actually produced the same output, we compare the two output arrays `convolved_image_using_GPU` and `convolved_image_using_CPU` as follows: ~~~python np.allclose(convolved_image_using_GPU, convolved_image_using_CPU) ~~~ -As you may have expected, the positive result of the comparison confirms that we computed the same results on the host and on the device: +As you may have expected, the outcome of the comparison confirms that the results on the host and on the device are the same: ~~~output array(True) @@ -249,7 +250,7 @@ array(True) Compute again the speedup achieved using the GPU, taking into account also the time spent transferring the data from the CPU to the GPU and back. -Hint: use the `cp.asnumpy()` method to copy a CuPy array back to the host. +Hint: use the `cp.asnumpy` method to copy a CuPy array back to the host. :::::::::::::::::::::::::::::::::::::: solution @@ -272,16 +273,16 @@ print(f"{gpu_avg_time:.6f} s") ~~~ The speedup taking into account the data transfer decreased from 116 to 67. -Accounting for the necessary data transfers is a better and fairer way to compute performance speedups. -As an side note, here `timeit` would still provide correct measurements, because the data transfers force the device to *sync* with the host. +Nonetheless, accounting for the necessary data transfers is a better and fairer way to compute performance speedups. +As an aside, here `timeit` would still provide correct measurements, because the data transfers force the device and host to *sync* one with another. ::::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::: ## A shortcut: performing NumPy routines on the GPU -We saw in a previous challenge that we cannot launch the routines of the `cupyx` library directly on NumPy arrays. -In fact, we first need to transfer the data from the host to the device memory. -Vice versa, we will encounter an error also if we launch on a CuPy array a regular SciPy routine designed to run on CPUs. +We saw in a previous challenge that we cannot launch the routines of `cupyx` directly on NumPy arrays. +In fact, we first needed to transfer the data from the host to the device memory. +Conversely, we also encounter an error when we launch a regular SciPy routine, designed to run on CPUs, on a CuPy array. Try out the following: ~~~python @@ -300,9 +301,9 @@ Please use `.get()` to construct a NumPy array explicitly. So, SciPy routines do not accept CuPy arrays as input. However, instead of performing a 2D convolution, we can execute a simpler 1D (linear) convolution that uses the NumPy routine `convolve` instead of a SciPy routine. -To generate the appropriate input for a linear convolution, we flatten our input image from 2D to 1D using the method `ravel`. +To generate the input appropriate to a linear convolution, we flatten our input image from 2D to 1D using the method `ravel`. To generate a 1D filter, we take the diagonal elements of our 2D Gaussian filter. -Try the following three instructions for linear convolution on the CPU: +Let's launch a linear convolution on the CPU with the following three instructions: ~~~python deltas_1d = deltas.ravel() @@ -316,8 +317,8 @@ A realistic execution time is: 270 ms ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each) ~~~ -After having performed a regular linear convolution on the host using NumPy, let's try something bold. -We transfer the 1D arrays to the device and do the convolution with a NumPy routine. +After having performed this regular linear convolution on the host with NumPy, let's try something bold. +We transfer the 1D arrays to the device and do the convolution with the same NumPy routine, as follows: ~~~python deltas_1d_gpu = cp.asarray(deltas_1d) @@ -334,13 +335,13 @@ Indeed, can you recall when we validated our codes using a NumPy and a CuPy arra That worked for the same reason. [The CuPy documentation](https://docs.cupy.dev/en/stable/user_guide/interoperability.html#numpy) explains why NumPy routines can handle CuPy arrays. -The linear convolution is actually performed on the GPU and results in a faster execution time: +The last linear convolution has actually been performed on the GPU, and faster than the CPU: ~~~output 0.014529 s ~~~ -Without much coding effort, we obtained a 18-fold speedup. +With this Numpy shortcut and without much coding effort, we obtained a good 18-fold speedup. # A real world example: image processing for radio astronomy From 9dd5c8cdfe615bf16524b58cb39f96fe80c54521 Mon Sep 17 00:00:00 2001 From: "Giordano Lipari @laptop4" Date: Thu, 9 Nov 2023 13:43:30 +0100 Subject: [PATCH 14/17] EO2-language/pedagogy: incomplete revision of the second episode half on Astronomy MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Here linguistic editing has been largely functional to pedagogy improvements. UNTIL THE SECTION ON BACKGROUND CHARACTERIZATION (STEP 1) The style goes too conversational about specialist knowledge, and seems to imply that just speaking of specialist knowledge is sufficient for learners to focus on the applications in GPU programming. In fact, the current implementation of a good idea approach engages uninitiated learners with considerable extraneous load at the expense of the cognitive load. This may prove quite tiresome during the course and while revising the material. The editing work so far to limit the side effects of a specialist application involved among others: (1) re-locate sentences so as to guide the readers from more- to less-known notions; (2) add a few Wikipedia links for exotic notions; (3) highlight/disclose assumptions, purposes and conclusions for the uninitiated readers. Hopefully, the uninitiated reader is better prepared to swallow the purpose of this section. Please double check for correctness. FROM SECTION ON SEGMENTING (STEP 2) ONWARD The text and logic become too hard to unravel and follow, both linguistically and pedagogically. The germane and extraneous loads for learners are considerable and possibly excessive. Intervening on these sections requires too much guesswork. Only a light linguistic pass has then been applied as a foothold for suggested work. OVERALL ON THE ASTRONOMY PART After having read the text and code several times, I guess that the purpose of this standard routine in radio astronomy is to find out sources lost WITHIN the background, which is then translated into finding the sources WITHIN the clipped dataset. Realising this (or anything that applies instead) should not be guesswork in the first place. The course should provide the necessary, accompanying and lightweight information early on. Also, the assisting commentary should be tuned in well with the overall statement of purpose of the method. Unglued specialist claims (for example, those on the probability of finding bright pixels at random) should fall into place without raising questions (how did you compute that probability? why is it equally relevant to the initial image of 2048² pixels and to the clipped image, which is arguably smaller?) The course should play ahead of these doubts. Shifting attention between inconsistent/incomplete specialist claims muddles the notions of GPU programming. As a loadstar, the learner should recognize in the text -- as easily as possible -- what to do, how to do that, and why to do that. (This extended commit message may well become the explanation of a pull request/issue.) --- episodes/cupy.Rmd | 101 ++++++++++++++++++++++++---------------------- 1 file changed, 52 insertions(+), 49 deletions(-) diff --git a/episodes/cupy.Rmd b/episodes/cupy.Rmd index a9189e1..034534e 100644 --- a/episodes/cupy.Rmd +++ b/episodes/cupy.Rmd @@ -343,14 +343,16 @@ The last linear convolution has actually been performed on the GPU, and faster t With this Numpy shortcut and without much coding effort, we obtained a good 18-fold speedup. -# A real world example: image processing for radio astronomy +# A scientific application: image processing for radio astronomy -In this section, we will perform the four major steps in image processing for radio astronomy: determination of background characteristics, segmentation, connected component labeling and source measurements. +In this section, we will perform four classical steps in image processing for radio astronomy: determination of the background characteristics, segmentation, connected component labeling, and source measurements. -## Import the FITS image +## Import the FITS file -Start by importing a 2048² pixels image of the Galactic Center, an image made from observations by the Indian Giant Metrewave Radio Telescope (GMRT) at 150 MHz. -The image is stored [in this repository](./data/GMRT_image_of_Galactic_Center.fits) as a [FITS](https://en.wikipedia.org/wiki/FITS) file, and to read it we need the `astropy` Python package. +We import a 2048²-pixel image of the electromagnetic radiation at 150 MHz around the Galactic Center, as observed by the Indian Giant Metrewave Radio Telescope (GMRT). + +The image is stored [in this lesson's repository](./data/GMRT_image_of_Galactic_Center.fits) as a [FITS](https://en.wikipedia.org/wiki/FITS) file. +The `astropy` Python package enables us to read in this file and, for compatibility with Python, convert the [byte ordering](https://en.wikipedia.org/wiki/Endianness) from the big to little endian, as follows: ~~~python from astropy.io import fits @@ -359,11 +361,10 @@ with fits.open("GMRT_image_of_Galactic_Center.fits") as hdul: data = hdul[0].data.byteswap().newbyteorder() ~~~ -The latter two methods are needed to convert byte ordering from big endian to little endian. - ## Inspect the image -Let's have a look at part of this image. +Let's have a look at this image. +Left and right in the data will be swapped with the method `fliplr` just to adhere to the astronomical convention of the [right ascension](https://en.wikipedia.org/wiki/Right_ascension) increasing leftwards. ~~~python from matplotlib.colors import LogNorm @@ -376,31 +377,20 @@ im_plot = ax.imshow(np.fliplr(data), cmap=pyl.cm.gray_r, norm=LogNorm(vmin = max pyl.colorbar(im_plot, ax=ax) ~~~ -![Image of the Galactic Center](./fig/improved_image_of_GC.png){alt="Image of the Galactic Center"} - -The data has been switched left to right because Right Ascension increases to the left, so now it adheres to this astronomical convention. -We can see a few dozen sources, especially in the lower left and upper right, which both have relatively good image quality. -The band across the image from upper left to lower right has the worst image quality because there are many extended sources - especially the Galactic Center itself, in the middle - which are hard to deconvolve with the limited spacings from this observation. -You can, however, see a couple of ring-like structures. -These are actually supernova shells, i.e. the remnants from massive stars that exploded at the end of their lives. +![Image of the Galactic Center at the radio frequency of 150 MHz](./fig/improved_image_of_GC.png){alt="Image of the Galactic Center"} -## Determine the background characteristics of the image +Stars, remnants of supernovas (massive exploded stars), and distant galaxies are possible radiation sources observed in this image. +We can spot a few dozen radiation sources in fairly good quality in the lower left and upper right. +In contrast, the quality is worse in the diagonal from the upper left to the lower right, because the radio telescope has a limited sensitivity and cannot look into large radiation sources. +Nonetheless, you can notice the Galactic Center in the middle of the image and supernova shells as ring-like formations elsewhere. -We want to identify all the sources - meaning e.g. stars, supernova remnants and distant galaxies - in this image and measure their positions and fluxes. -How do we separate source pixels from background pixels? When do we know if a pixel with a high value belongs to a source or is simply a noise peak? We assume the background noise, which is a reflection of the limited sensitivity of the radio telescope, has a normal distribution. -The chance of having a background pixel with a value above 5 times the standard deviation is 2.9e-7. -We have 2²² = 4.2e6 pixels in our image, so the chance of catching at least one random noise peak by setting a threshold of 5 times the standard deviation is less than 50%. -We refer to the standard deviation as $\sigma$. +The telescope accuracy has added background noise to the image. +Yet, we want to identify all the radiation sources in this image, determine their positions, and measure their radiation fluxes. +How can we then assert whether a high-intensity pixel is a peak of the noise or a genuine source? +Assuming that the background noise is normally distributed with standard deviation $\sigma$, the chance of its intensity being larger than 5$\sigma$ is 2.9e-7. +The chance of picking from our image of 2048² (4.2e6) pixels at least one that is so extremely bright because of noise is thus less than 50%. -How do we measure the standard deviation of the background pixels? First we need to separate them from the source pixels, based on their values, in the sense that high pixel values more likely come from sources. -The technique we use is called $\kappa, \sigma$ clipping. -First we take all pixels and compute the standard deviation ($\sigma$). -Then we compute the median and clip all pixels larger than $median + 3 * \sigma$ and smaller than $median - 3 * \sigma$. -From the clipped set, we compute the median and standard deviation again and clip again. -Continue until no more pixels are clipped. -The standard deviation from this final set of pixel values is the basis for the next step. - -Before clipping, let's investigate some properties of our unclipped data. +Before moving on, let's determine some summary statistics of the radiation intensity in the input image: ~~~python mean_ = data.mean() @@ -411,16 +401,25 @@ print(f"mean = {mean_:.3e}, median = {median_:.3e}, sttdev = {stddev_:.3e}, \ maximum = {max_:.3e}") ~~~ -The maximum flux density is 2506 mJy/beam, coming from the Galactic Center itself, so from the center of the image, while the overall standard deviation is 19.9 mJy/beam: +This gives (in Jy/beam): ~~~output mean = 3.898e-04, median = 1.571e-05, sttdev = 1.993e-02, maximum = 2.506e+00 ~~~ -You might observe that $\kappa, \sigma$ clipping is a compute intense task, that is why we want to do it on a GPU. -But let's first issue the algorithm on a CPU. +The maximum flux density is 2506 mJy/beam coming from the Galactic Center, the overall standard deviation 19.9 mJy/beam, and the median 1.57e-05 mJy/beam. + + +## Step 1: Determining the characteristics of the background -This is the NumPy code to do this: +First, we separate the background pixels from the source pixels using an iterative procedure called $\kappa$-$\sigma$ clipping, which assumes that high intensity is more likely to result from genuine sources. +We start from the standard deviation ($\sigma$) and the median ($\mu_{1/2}$) of all pixel intensities, as computed above. +We then discard (clip) the pixels whose intensity is larger than $\mu_{1/2} + 3\sigma$ or smaller than $\mu_{1/2} - 3\sigma$. +In the next pass, we compute again the median and standard deviation of the clipped set, and clip once more. +We repeat these operations until no more pixels are clipped, that is, there are no outliers in the designated tails. + +In the meantime, you may have noticed already that $\kappa$-$\sigma$ clipping is a compute intensive task that could be implemented in a GPU. +For the moment, let's implement the algorithm with the NumPy code for a CPU, as follows: ~~~python # Flattening our 2D data first makes subsequent steps easier. @@ -444,14 +443,14 @@ print(f"Fastest CPU ks clipping time = \ {1000 * fastest_ks_clipping_cpu:.3e} ms.") ~~~ +The performance is close to 1 s and, hopefully, can be sped up on the GPU. + ~~~output 793 ms ± 17.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) Fastest CPU ks clipping time = 7.777e+02 ms. ~~~ -So that is close to 1 second to perform these computations. -Hopefully, we can speed this up using the GPU. -How has $\kappa, \sigma$ clipping influenced our statistics? +Finally, let's see how the $\kappa$-$\sigma$ clipping has influenced the summary statistics: ~~~python clipped_mean_ = data_clipped.mean() @@ -463,17 +462,20 @@ print(f"mean of clipped = {clipped_mean_:.3e}, median of clipped = \ {clipped_stddev_:.3e}, maximum of clipped = {clipped_max_:.3e}") ~~~ -All output statistics have become smaller which is reassuring; it seems data_clipped contains mostly background pixels: +The first-order statistics have become smaller, which reassures us that `data_clipped` contains background pixels: ~~~output mean of clipped = -1.945e-06, median of clipped = -9.796e-06 standard deviation of clipped = 1.334e-02, maximum of clipped = 4.000e-02 ~~~ +The standard deviation of the intensity in the background pixels is be the basis for the next step. + :::::::::::::::::::::::::::::::::::::: challenge -## Challenge: $\kappa, \sigma$ clipping on the GPU -Now that you understand how the $\kappa, \sigma$ clipping algorithm works, perform it on the GPU using CuPy and compute the speedup. -Include the data transfer to and from the GPU in your calculation. +## Challenge: $\kappa$-$\sigma$ clipping on the GPU + +Now that you know how the $\kappa$-$\sigma$ clipping algorithm works, perform it on the GPU using CuPy. +Compute the speedup, including the data transfer to and from the GPU. ::::::::::::::::::::::::::::::::::::: solution ## Solution @@ -507,11 +509,12 @@ The speedup factor for ks clipping is: 1.232e+01 ::::::::::::::::::::::::::::::::::::: :::::::::::::::::::::::::::::::::::::: -## Segment the image +## Step 2: Segmenting the image + +We have already estimated that clipping an image of 2048² pixels at the $5 \sigma$ level yields a chance of less than 50% that at least one out of all the sources we detect is a noise peak. +So let's set the threshold at $5\sigma$ and segment it. -We have seen that clipping at the $5 \sigma$ level of an image this size (2048² pixels) will yield a chance of less than 50% that from all the sources we detect at least one will be a noise peak. -So let's set the threshold at $5 \sigma$ and segment it. -First check that we find the same standard deviation from our clipper on the GPU: +First let's check that the standard deviation from our clipper on the GPU is the same: ~~~python stddev_gpu_ = np.std(data_clipped_on_GPU) @@ -522,7 +525,7 @@ print(f"standard deviation of background_noise = {stddev_gpu_:.4f} Jy/beam") standard deviation of background_noise = 0.0133 Jy/beam ~~~ -With the standard deviation computed we apply the $5 \sigma$ threshold to the image. +We then apply the $5 \sigma$ threshold to the image with this standard deviation: ~~~python threshold = 5 * stddev_gpu_ @@ -538,7 +541,7 @@ ms.") Fastest CPU segmentation time = 6.294e+00 ms. ~~~ -## Labeling of the segmented data +## Step 3: Labeling the segmented data This is called connected component labeling (CCL). It will replace pixel values in the segmented image - just consisting of zeros and ones - of the first connected group of pixels with the value 1 - so nothing changed, but just for that first group - the pixel values in the second group of connected pixels will all be 2, the third connected group of pixels will all have the value 3 etc. @@ -593,7 +596,7 @@ These are all the pixel values we can find in the labeled image: [ 0. 1. 2. 182. 183. 184. 185.] ~~~ -# Source measurements +## Step 4: Measuring the radiation sources We are ready for the final step. We have been given observing time to make this beautiful image of the Galactic Center, we have determined its background statistics, we have separated actual cosmic sources from noise and now we want to measure these cosmic sources. @@ -656,7 +659,7 @@ Fastest CPU set of source measurements = 7.838e+02 ms. :::::::::::::::::::::::::::::::::::::: challenge ## Challenge: putting it all together -Combine the first two steps of image processing for astronomy, i.e. determining background characteristics e.g. through $\kappa, \sigma$ clipping and segmentation into a single function, that works for both CPU and GPU. +Combine the first two steps of image processing for astronomy, i.e. determining background characteristics e.g. through $\kappa$-$\sigma$ clipping and segmentation into a single function, that works for both CPU and GPU. Next, write a function for connected component labeling and source measurements on the GPU and calculate the overall speedup factor for the combined four steps of image processing in astronomy on the GPU relative to the CPU. Finally, verify your output by comparing with the previous output, using the CPU. From 202080f33c778092eb5b90bab6e8e8200e4182d4 Mon Sep 17 00:00:00 2001 From: "Giordano Lipari @laptop4" Date: Thu, 9 Nov 2023 14:19:13 +0100 Subject: [PATCH 15/17] E02-code/astronomy: make variable names shorter and patterned uniformly --- episodes/cupy.Rmd | 37 +++++++++++++++++++------------------ 1 file changed, 19 insertions(+), 18 deletions(-) diff --git a/episodes/cupy.Rmd b/episodes/cupy.Rmd index 16c0ddf..d2505bc 100644 --- a/episodes/cupy.Rmd +++ b/episodes/cupy.Rmd @@ -421,29 +421,30 @@ But let's first issue the algorithm on a CPU. This is the NumPy code to do this: ~~~python -# Flattening our 2D data first makes subsequent steps easier. +# Flattening our 2D data makes subsequent steps easier data_flat = data.ravel() -# Here is a kappa, sigma clipper for the CPU -def kappa_sigma_clipper(data_flat): + +# Here is a kappa-sigma clipper for the CPU +def ks_clipper_cpu(data_flat): while True: med = np.median(data_flat) std = np.std(data_flat) - clipped_lower = data_flat.compress(data_flat > med - 3 * std) - clipped_both = clipped_lower.compress(clipped_lower < med + 3 * std) - if len(clipped_both) == len(data_flat): + clipped_below = data_flat.compress(data_flat > med - 3 * std) + clipped_data = clipped_below.compress(clipped_below < med + 3 * std) + if len(clipped_data) == len(data_flat): break - data_flat = clipped_both + data_flat = clipped_data return data_flat -data_clipped = kappa_sigma_clipper(data_flat) -timing_ks_clipping_cpu = %timeit -o kappa_sigma_clipper(data_flat) +data_clipped_cpu = ks_clipper_cpu(data_flat) +timing_ks_clipping_cpu = %timeit -o ks_clipper_cpu(data_flat) fastest_ks_clipping_cpu = timing_ks_clipping_cpu.best -print(f"Fastest CPU ks clipping time = {1000 * fastest_ks_clipping_cpu:.3e} ms.") +print(f"Fastest ks clipping time on CPU = {1000 * fastest_ks_clipping_cpu:.3e} ms.") ~~~ ~~~output 793 ms ± 17.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) -Fastest CPU ks clipping time = 7.777e+02 ms. +Fastest ks clipping time on CPU = 7.777e+02 ms. ~~~ So that is close to 1 second to perform these computations. @@ -452,9 +453,9 @@ How has $\kappa, \sigma$ clipping influenced our statistics? ~~~python clipped_mean_ = data_clipped.mean() -clipped_median_ = np.median(data_clipped) -clipped_stddev_ = np.std(data_clipped) -clipped_max_ = np.amax(data_clipped) +clipped_median_ = np.median(data_clipped_cpu) +clipped_stddev_ = np.std(data_clipped_cpu) +clipped_max_ = np.amax(data_clipped_cpu) print(f"mean of clipped = {clipped_mean_:.3e}, median of clipped = {clipped_median_:.3e} \n standard deviation of clipped = {clipped_stddev_:.3e}, @@ -479,10 +480,10 @@ Include the data transfer to and from the GPU in your calculation. ~~~python def ks_clipper_gpu(data_flat): data_flat_gpu = cp.asarray(data_flat) - data_gpu_clipped = kappa_sigma_clipper(data_flat_gpu) + data_gpu_clipped = ks_clipper_cpu(data_flat_gpu) return cp.asnumpy(data_gpu_clipped) -data_clipped_on_GPU = ks_clipper_gpu(data_flat) +data_clipped_gpu = ks_clipper_gpu(data_flat) timing_ks_clipping_gpu = benchmark(ks_clipper_gpu, (data_flat, ), n_repeat=10) @@ -513,7 +514,7 @@ So let us set the threshold at $5 \sigma$ and segment it. First check that we find the same standard deviation from our clipper on the GPU: ~~~python -stddev_gpu_ = np.std(data_clipped_on_GPU) +stddev_gpu_ = np.std(data_clipped_gpu) print(f"standard deviation of background_noise = {stddev_gpu_:.4f} Jy/beam") ~~~ @@ -660,7 +661,7 @@ Finally, verify your output by comparing with the previous output, using the CPU ~~~python def first_two_steps_for_both_CPU_and_GPU(data): data_flat = data.ravel() - data_clipped = kappa_sigma_clipper(data_flat) + data_clipped = ks_clipper_cpu(data_flat) stddev_ = np.std(data_clipped) threshold = 5 * stddev_ segmented_image = np.where(data > threshold, 1, 0) From ff8978fec1daeb318a1162307b2ee3c1d1941610 Mon Sep 17 00:00:00 2001 From: "Giordano Lipari @laptop4" Date: Thu, 9 Nov 2023 14:19:13 +0100 Subject: [PATCH 16/17] E02-code/astronomy: make variable names shorter and patterned uniformly --- episodes/cupy.Rmd | 37 +++++++++++++++++++------------------ 1 file changed, 19 insertions(+), 18 deletions(-) diff --git a/episodes/cupy.Rmd b/episodes/cupy.Rmd index 16c0ddf..d2505bc 100644 --- a/episodes/cupy.Rmd +++ b/episodes/cupy.Rmd @@ -421,29 +421,30 @@ But let's first issue the algorithm on a CPU. This is the NumPy code to do this: ~~~python -# Flattening our 2D data first makes subsequent steps easier. +# Flattening our 2D data makes subsequent steps easier data_flat = data.ravel() -# Here is a kappa, sigma clipper for the CPU -def kappa_sigma_clipper(data_flat): + +# Here is a kappa-sigma clipper for the CPU +def ks_clipper_cpu(data_flat): while True: med = np.median(data_flat) std = np.std(data_flat) - clipped_lower = data_flat.compress(data_flat > med - 3 * std) - clipped_both = clipped_lower.compress(clipped_lower < med + 3 * std) - if len(clipped_both) == len(data_flat): + clipped_below = data_flat.compress(data_flat > med - 3 * std) + clipped_data = clipped_below.compress(clipped_below < med + 3 * std) + if len(clipped_data) == len(data_flat): break - data_flat = clipped_both + data_flat = clipped_data return data_flat -data_clipped = kappa_sigma_clipper(data_flat) -timing_ks_clipping_cpu = %timeit -o kappa_sigma_clipper(data_flat) +data_clipped_cpu = ks_clipper_cpu(data_flat) +timing_ks_clipping_cpu = %timeit -o ks_clipper_cpu(data_flat) fastest_ks_clipping_cpu = timing_ks_clipping_cpu.best -print(f"Fastest CPU ks clipping time = {1000 * fastest_ks_clipping_cpu:.3e} ms.") +print(f"Fastest ks clipping time on CPU = {1000 * fastest_ks_clipping_cpu:.3e} ms.") ~~~ ~~~output 793 ms ± 17.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each) -Fastest CPU ks clipping time = 7.777e+02 ms. +Fastest ks clipping time on CPU = 7.777e+02 ms. ~~~ So that is close to 1 second to perform these computations. @@ -452,9 +453,9 @@ How has $\kappa, \sigma$ clipping influenced our statistics? ~~~python clipped_mean_ = data_clipped.mean() -clipped_median_ = np.median(data_clipped) -clipped_stddev_ = np.std(data_clipped) -clipped_max_ = np.amax(data_clipped) +clipped_median_ = np.median(data_clipped_cpu) +clipped_stddev_ = np.std(data_clipped_cpu) +clipped_max_ = np.amax(data_clipped_cpu) print(f"mean of clipped = {clipped_mean_:.3e}, median of clipped = {clipped_median_:.3e} \n standard deviation of clipped = {clipped_stddev_:.3e}, @@ -479,10 +480,10 @@ Include the data transfer to and from the GPU in your calculation. ~~~python def ks_clipper_gpu(data_flat): data_flat_gpu = cp.asarray(data_flat) - data_gpu_clipped = kappa_sigma_clipper(data_flat_gpu) + data_gpu_clipped = ks_clipper_cpu(data_flat_gpu) return cp.asnumpy(data_gpu_clipped) -data_clipped_on_GPU = ks_clipper_gpu(data_flat) +data_clipped_gpu = ks_clipper_gpu(data_flat) timing_ks_clipping_gpu = benchmark(ks_clipper_gpu, (data_flat, ), n_repeat=10) @@ -513,7 +514,7 @@ So let us set the threshold at $5 \sigma$ and segment it. First check that we find the same standard deviation from our clipper on the GPU: ~~~python -stddev_gpu_ = np.std(data_clipped_on_GPU) +stddev_gpu_ = np.std(data_clipped_gpu) print(f"standard deviation of background_noise = {stddev_gpu_:.4f} Jy/beam") ~~~ @@ -660,7 +661,7 @@ Finally, verify your output by comparing with the previous output, using the CPU ~~~python def first_two_steps_for_both_CPU_and_GPU(data): data_flat = data.ravel() - data_clipped = kappa_sigma_clipper(data_flat) + data_clipped = ks_clipper_cpu(data_flat) stddev_ = np.std(data_clipped) threshold = 5 * stddev_ segmented_image = np.where(data > threshold, 1, 0) From 7d31afacf19127affa995e9cc19575e09cc804d0 Mon Sep 17 00:00:00 2001 From: "Giordano Lipari @laptop4" Date: Fri, 10 Nov 2023 09:56:24 +0100 Subject: [PATCH 17/17] E02-language/code: typesetting convention functions/methods/variable/attributes Using this native error message as example: Please use `.get()` to construct a NumPy array explicitly. I have applied the following usage: (1) Functions are typeset as `function()` (2) Methods as either `.method()` or `module.method()` (3) Variables as `variable` (4) Attributes as `.attribute` No decision made yet as to the Jupyter magics. This is an example of the implementation The previous code executes `convolve2d_gpu()` ten times, and stores the execution time of each run, in seconds, inside the `.gpu_times` attribute of the variable `benchmark_gpu`. We can then compute the average execution time and print it, as shown. For me: this commit shows the limit of the division into the branch E02-code and E0-language. The edits in this commit are in-line code text, so editing this code directly in the language-dedicated branch saves me one merge with conflicts --- episodes/cupy.Rmd | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/episodes/cupy.Rmd b/episodes/cupy.Rmd index f0934dd..0cebca7 100644 --- a/episodes/cupy.Rmd +++ b/episodes/cupy.Rmd @@ -182,8 +182,8 @@ So far we used `timeit` to measure the performance of our Python code, no matter However, the execution on the GPU is *asynchronous*: the Python interpreter takes back control of the program execution immediately, while the GPU is still executing the task. Therefore, the timing of `timeit` is not reliable. -Conveniently, `cupyx` provides the function `benchmark` that measures the actual execution time in the GPU. -The following code executes `convolve2d_gpu` with the appropriate arguments ten times, and stores inside the `gpu_times` attribute of the variable `execution_gpu` the execution time of each run in seconds. +Conveniently, `cupyx` provides the function `benchmark()` that measures the actual execution time in the GPU. +The following code executes `convolve2d_gpu()` with the appropriate arguments ten times, and stores inside the `.gpu_times` attribute of the variable `execution_gpu` the execution time of each run in seconds. ~~~python from cupyx.profiler import benchmark @@ -191,7 +191,7 @@ from cupyx.profiler import benchmark benchmark_gpu = benchmark(convolve2d_gpu, (diracs_gpu, gauss_gpu), n_repeat=10) ~~~ -These measurements are also more stable and representative, because `benchmarks` disregards the compile time and the repetitions warm up the GPU. +These measurements are also more stable and representative, because `benchmark()` disregards the compile time and the repetitions warm up the GPU. We can then average the execution times, as follows: ~~~python @@ -215,7 +215,7 @@ If this works, it will save us the time and effort of transferring the arrays `d ::::::::::::::::::::::::::::::::::::: solution -We can call the GPU convolution function `convolve2d_gpu` directly with `deltas` and `gauss` as argument: +We can call the GPU convolution function `convolve2d_gpu()` directly with `deltas` and `gauss` as argument: ~~~python convolve2d_gpu(diracs, gauss) @@ -250,7 +250,7 @@ array(True) Compute again the speedup achieved using the GPU, taking into account also the time spent transferring the data from the CPU to the GPU and back. -Hint: use the `cp.asnumpy` method to copy a CuPy array back to the host. +Hint: use the `cp.asnumpy()` method to copy a CuPy array back to the host. :::::::::::::::::::::::::::::::::::::: solution @@ -300,8 +300,8 @@ Please use `.get()` to construct a NumPy array explicitly. ~~~ So, SciPy routines do not accept CuPy arrays as input. -However, instead of performing a 2D convolution, we can execute a simpler 1D (linear) convolution that uses the NumPy routine `convolve` instead of a SciPy routine. -To generate the input appropriate to a linear convolution, we flatten our input image from 2D to 1D using the method `ravel`. +However, instead of performing a 2D convolution, we can execute a simpler 1D (linear) convolution that uses the NumPy routine `np.convolve()` instead of a SciPy routine. +To generate the input appropriate to a linear convolution, we flatten our input image from 2D to 1D using the method `.ravel()`. To generate a 1D filter, we take the diagonal elements of our 2D Gaussian filter. Let's launch a linear convolution on the CPU with the following three instructions: @@ -331,7 +331,7 @@ print(f"{gpu_execution_avg:.6f} s") You may be surprised that these commands do not throw any error. Contrary to SciPy, NumPy routines accept CuPy arrays as input, even though the latter exist only in GPU memory. -Indeed, can you recall when we validated our codes using a NumPy and a CuPy array as input of `np.allclose`? +Indeed, can you recall when we validated our codes using a NumPy and a CuPy array as input of `np.allclose()`? That worked for the same reason. [The CuPy documentation](https://docs.cupy.dev/en/stable/user_guide/interoperability.html#numpy) explains why NumPy routines can handle CuPy arrays. @@ -364,7 +364,7 @@ with fits.open("GMRT_image_of_Galactic_Center.fits") as hdul: ## Inspect the image Let's have a look at this image. -Left and right in the data will be swapped with the method `fliplr` just to adhere to the astronomical convention of the [right ascension](https://en.wikipedia.org/wiki/Right_ascension) increasing leftwards. +Left and right in the data will be swapped with the method `np.fliplr()` just to adhere to the astronomical convention of the [right ascension](https://en.wikipedia.org/wiki/Right_ascension) increasing leftwards. ~~~python from matplotlib.colors import LogNorm