diff --git a/_extras/19-gpu.md b/_extras/19-gpu.md new file mode 100644 index 0000000..1cabc5f --- /dev/null +++ b/_extras/19-gpu.md @@ -0,0 +1,282 @@ +--- +title: "Running a parallel job on the GPU" +teaching: 30 +exercises: 30 +questions: +- "How do we execute a task in parallel on a GPU?" +objectives: +- "Understand what a GPU is." +- "See how to run a parallel job on a GPU." +- "Learn how thread configuration affect GPU performance." +keypoints: +- "GPUs are becoming more commonplace in clusters and can provide higher performance than CPUs for certain applications." +- "There are a number of frameworks targeting GPUs, PyCUDA is one of the more accessible ones for beginners." +- "To get good performance on GPUs the thread configuration must be considered." +--- + +We now turn our attention to running a job using a GPU. This is another +important aspect of HPC systems, as GPU parallelism can bring added performance +benefits for many applications. +The example in this section reimplements the estimation of Pi example as introduced for MPI in +the [Running a Parallel Job]({{ page.root }}/16-parallel/#a-serial-solution-to-the-problem) lesson. + + + +## What is a GPU? + +While most CPUs now have several processor cores inside, GPUs instead have several thousand +lightweight cores which can provide high performance when given the same instructions. +Below a simplified view of the inside of a CPU are compared to the inside of a GPU. +{% include figure.html url="" caption="" max-width="60%" + file="/fig/gpu-vs-cpu.png" + alt="Simplified View of a CPU and GPU" %} + +Originally, Graphics Processing Units were designed to accelerate graphics calculations, +but increasingly they have been targeted by other types of codes (this is known as GPGPU - or +"General Purpose GPU" - Programming). They provide a natural fit for parallel codes as they +were designed specifically for this purpose. + +However, there is an additional complexity for programming GPUs as there must be separate +instructions for what runs on the GPU on and what runs on the host (usually CPU) side. As well +as this, getting good performance on GPUs often requires tuning the threads targeting the GPU, +thus requiring another level of expertise. + + +## Example GPU Code + +As an example, we will reimplement the stochastic algorithm for estimating the value of +π, the ratio of the circumference to the diameter of a circle on the GPU. +To recap, the program generates a large number of random points on a 1×1 square +centered on (½,½), and checks how many of these points fall +inside the unit circle. +On average, π/4 of the randomly-selected points should fall in the +circle, so π can be estimated from 4*f*, where *f* is the observed +fraction of points that fall in the circle. +Each sample is independent, making this algorithm easily implementable in parallel. +For more details about the algorithm, see the +[Running a Parallel Job]({{ page.root }}/16-parallel/#a-serial-solution-to-the-problem) +lesson. + +{% include figure.html url="" caption="" max-width="40%" + file="/fig/pi.png" + alt="Algorithm for computing pi through random sampling" %} + +## Running the Example Code on the GPU + +In this example we will use PyCUDA for parallelism. PyCUDA employs the CUDA parallel API +within the python framework. CUDA is a very popular framework for GPUs on HPC systems, +however it is limited to only NVIDIA GPUs. + +> ## What is CUDA? +> CUDA (Compute Unified Device Architecture) is a GPU programming interface developed by +> NVIDIA. In the CUDA API, data must be wrapped in special buffers and explicitly transferred +> to the device (GPU). In addition, the threads which perform the work on the GPU must also be explicitly set, +> which unfortunately is not an exact science. +> In addition, kernels must be written to perform computations on the GPU. +> Kernels are a type of function containing select keywords that indicate a code will be run on the GPU. +{: .callout} + +> ## What is PyCuda? +> PyCUDA provides a relatively simple python syntax for using CUDA, which can be difficult to learn for HPC beginners. +> While PyCUDA itself provides a syntax for +> copying data to/from the GPU and running a kernel, complicated kernels must still be written in a C-like +> syntax and included as a header file or wrapped in a string. +> +> Note that there are several ways of running PyCUDA ranging from managing all CUDA calls to +> having minimal control. +> This tutorial aims to provide an example which lies somewhere in the middle. More detailed information and examples can be +> found in the [PyCUDA documentation tutorial](https://documen.tician.de/pycuda/tutorial.html). +{: .callout} + + +> ## What Changes Are Needed for a CUDA Version of the π Calculator? +> +> Several modules must be imported to use pycuda. The `pycuda.driver` module must first be imported +> (which in this example we rename to `cuda` for convenience). +> `SourceModule` from the `pycuda.compiler` module must also be imported, which will +> read in the GPU kernel. +> Finally, the `pycuda.autoinit` module must be imported, which automatically initialises and cleans +> up CUDA calls. +> +> Additional modifications to the serial script demonstrate three important concepts: +> +> * Defining the estimation kernel on the GPU +> * Transferring data to the GPU +> * Defining the thread configuration for running the estimation kernel +> +> The kernel is where all the calculations will be performed for estimating pi. In this example, +> the number of samples corresponds to the number of threads launched and each thread performs a single +> check for a given sample number. If a sample is inside the circle, it puts in a count of 1 +> and if it is outside of the circle the count is zero. +> After the kernel has run, the number of values inside the circle is counted up on +> the host (CPU) side. +> +> The kernel is called from the host side, so data that is used inside the kernel may need to be defined +> on the host side. In this example, the x and y values and an array for keeping track of the number of +> points inside and outside the circle are defined on the host side and then passed to the kernel. +> Special instructions are given to the data buffers to define whether they are read-only, write-only or +> both readable and writeable. +> +> In addition to passing these three array inputs, the thread setup must also be passed into the kernel. +> This is done with two additional parameters: +> * Number of thread blocks in a grid (can be 1D) +> * Number of threads in a block (can be 1D, but both parameters must be the same dimension) +> +> Below shows a 2D example of how threads combine to form a block and threadblocks combine form a grid. +> More information about threads on the GPU can be found in [Selecting a CUDA Thread Configuration]({{ page.root }}/19-gpu/#selecting-a-cuda-thread-configuration). +{: .discussion} +{% include figure.html url="" caption="" max-width="60%" + file="/fig/cudathreadblock.png" + alt="Simplified View of Threads and Thread Blocks" %} + +We will now step through these changes to the code in more detail.\ + \ +First, instead of allocating the x and y value arrays in the estimation function, this is now done first on the host side: +``` +x_values = np.random.uniform(size=SZ) +x_values = x_values.astype(np.float32) +y_values = np.random.uniform(size=SZ) +y_values = y_values.astype(np.float32) +total_count = np.zeros((SZ,), dtype='int32') +``` +{: .language-python} +As previously, `np` comes from `import numpy as np`. + +Then the GPU kernel (`calculatePi`) is read in as a `SourceModule` object +in the variable `mod`. + +``` +mod = SourceModule(""" + __global__ void calculatePi(float *x_values, float *y_values, int *total_count) { + int tid = threadIdx.x + blockIdx.x * blockDim.x; + if ((x_values[tid] * x_values[tid]) + (y_values[tid] * y_values[tid]) < 1.0f) { + total_count[tid] = 1; + } + else{ + total_count[tid] = 0; + } + } + """) + +``` +{: .language-python} +The syntax of this function is C-like, although it also contains the keyword `__global__`, indicating +that it is a function which is called from the host to be run on the GPU. + \ +However, the `SourceModule` actually compiles this into CUDA code, so we still need one extra step to +call this kernel from PyCUDA: +``` +func = mod.get_function("calculatePi") +``` +{: .language-python} + +`func` can now be called from the host, passing in three inputs to the GPU kernel `calculatePi`. +The first two of these are `x_values` and `y_values`, which are used in the calculation and are read-only. +Thus, when they are passed in with `func,` they are wrapped in a `cuda.In` call which means they +are only read in as input on the GPU. The array `count` however is wrapped in a `cuda.InOut` call indicating +that this array is readable and writeable. +``` +func(cuda.In(x_values),cuda.In(y_values), cuda.InOut(total_count), grid=(NUM_BLOCK,1,1),block=(NUM_THREAD,1, 1)) +``` +{: .language-python} + +Two additional parameters are passed into `func`: +``` +grid = (NUM_BLOCK,1,1) +block = (NUM_THREAD,1,1) +``` +{: .language-python} + +These parameters correspond to the number of parallel threads which are designated for the GPU (behind the scenes +in the hardware, it is slightly more complicated than this, but this is a good approximation from the programmer's +perspective). In our simple example, +`NUM_BLOCK` and `NUM_THREAD` are both the same value which is defined by the input parameter to the program +(this value is squared to determine the number of samples). +The grid that is passed in to the GPU kernel indicates the number of threadblocks to use. +Each block then contains a number of threads. In our example only the first dimension is used, but up to +three dimensions can be used for each. + +After the kernel has run, the number of values inside the circle is counted up into `count` back on the host side. +``` +count = sum(total_count) +``` +{: .language-python} + +Then the value of pie can be calculated as previously. + + +A fully commented version of the final GPU parallel python code is available +[here](/files/pi-pycuda.py). + +> ## Selecting a CUDA Thread Configuration +> +> In this simple example, only the number of samples (ie. `NUM_BLOCK`\*`NUM_THREAD`) to calculate pi are performed +> by the GPU. However, there is a limit to how many threads can run on the GPU and datasets are often much larger than this. +> As such, these large datasets will not be able to simply divide up the number of points by the number of threads. +> Instead, threads will need to handle more work and achieving good performance in this manner may require tuning +> the number of thread blocks as well as the number of threads per block. +> Unfortunately there is not an exact science for this and certain thread group configurations provide better +> performance than others (powers of two, as one example) for different applications. +> In particular, those where consecutive threads access memory from the same region (or are "coalesced") tend to +> perform well. There are some heuristics available; however, optimal performance often requires +> manual experimentation. Alternatively, a variety of tuning frameworks are available to perform +> this automatically or tools like [NVIDIA NSight](https://docs.nvidia.com/nsight-compute/NsightCompute/index.html#occupancy-calculator), +> which can help determine optimal occupancy (or how well utilised the GPU is). +{: .callout} + \ + \ +Now we will create a submission file to run our code on a single GPU on a compute node: + +``` +{{ site.remote.prompt }} nano gpu-pi.sh +{{ site.remote.prompt }} cat gpu-pi.sh +``` +{: .language-bash} + +``` +#!/usr/bin/env bash +#SBATCH -J gpu-pi +#SBATCH --partition=gpu +#SBATCH --qos=gpu +#SBATCH --gres=gpu:1 +#SBATCH --nodes=1 +#SBATCH --exclusive + +# Load the computing environment we need +module use /lustre/sw/modulefiles.miniconda3 +module load pytorch/1.10.1-gpu + +# Execute the task +python gpu-pi.py 128 +``` +{: .language-bash} + +Then submit your job. We will use the batch file to set the options, +rather than the command line. + +``` +{{ site.remote.prompt }} {{ site.sched.submit.name }} gpu-pi.sh +``` +{: .language-bash} + +As before, use the status commands to check when your job runs. +Use `ls` to locate the output file, and examine it. +Is it what you expected? + +* How good is the value for π? +* How much faster was this run than the serial run with 16384 points? + +Modify the job script to increase the number of samples to 256, 512 and 1024. +and resubmit the job each time. + +* How good is the value for π? +* How long did the job take to run? + + + +In an HPC environment, we try to reduce the execution time for all types of +jobs, and CUDA is another popular method for using GPUs to target an application. +To learn about parallelization more generally, see the +[parallel novice lesson](http://www.hpc-carpentry.org/hpc-parallel-novice/) + +{% include links.md %} diff --git a/cudathreadblock.png b/cudathreadblock.png new file mode 100644 index 0000000..3b0e7db Binary files /dev/null and b/cudathreadblock.png differ diff --git a/files/pi-pycuda-minimal.py b/files/pi-pycuda-minimal.py new file mode 100644 index 0000000..9ffd99c --- /dev/null +++ b/files/pi-pycuda-minimal.py @@ -0,0 +1,72 @@ +#!/usr/bin/env python3 + +import pycuda.driver as cuda +from pycuda.compiler import SourceModule +import pycuda.autoinit +import sys +import datetime +import numpy as np + +def main(): + + if len(sys.argv) > 1: + NUM_BLOCK = int(sys.argv[1]) + else: + NUM_BLOCK = 512 + + NUM_THREAD = NUM_BLOCK + SZ = NUM_BLOCK*NUM_THREAD + + # Time how long it takes to estimate π. + start_time = datetime.datetime.now() + + x_values = np.random.uniform(size=SZ) + x_values = x_values.astype(np.float32) + y_values = np.random.uniform(size=SZ) + y_values = y_values.astype(np.float32) + total_count = np.zeros((SZ,), dtype='int32') + + mod = SourceModule(""" + __global__ void calculatePi(float *x_values, float *y_values, int *total_count) { + int tid = threadIdx.x + blockIdx.x * blockDim.x; + if ((x_values[tid] * x_values[tid]) + (y_values[tid] * y_values[tid]) < 1.0f) { + total_count[tid] = 1; + } + else{ + total_count[tid] = 0; + } + } + """) + + func = mod.get_function("calculatePi") + + func(cuda.In(x_values),cuda.In(y_values), cuda.InOut(total_count), grid=(NUM_BLOCK,1,1),block=(NUM_THREAD,1, 1)) + + count = sum(total_count) + + # final calculation + my_pi = (4.0 * count) / SZ + + elapsed_time = (datetime.datetime.now() - start_time).total_seconds() + + # Memory required is dominated by the size of x, y, and radii from + # inside_circle(), calculated in MiB + size_of_float = np.dtype(np.float64).itemsize + memory_required = 3 * SZ * size_of_float / (1024**2) + + # accuracy is calculated as a percent difference from a known estimate + # of π. + pi_specific = np.pi + accuracy = 100*(1-my_pi/pi_specific) + + # Uncomment either summary format for verbose or terse output + # summary = "{:d} threads, {:d} samples, {:f} MiB memory, {:f} seconds, {:f}% error" + summary = "{:d},{:d},{:f},{:f},{:f}" + print(summary.format(SZ, SZ, memory_required, elapsed_time, accuracy)) + + + + +if __name__ == '__main__': + main() + diff --git a/files/pi-pycuda.py b/files/pi-pycuda.py new file mode 100644 index 0000000..0a07ade --- /dev/null +++ b/files/pi-pycuda.py @@ -0,0 +1,113 @@ +#!/usr/bin/env python3 + +"""Parallel example code for estimating the value of π. + +We can estimate the value of π by a stochastic algorithm. Consider a +circle of radius 1, inside a square that bounds it, with vertices at +(1,1), (1,-1), (-1,-1), and (-1,1). The area of the circle is just π, +whereas the area of the square is 4. So, the fraction of the area of the +square which is covered by the circle is π/4. + +A point selected at random uniformly from the square thus has a +probability π/4 of being within the circle. + +We can estimate π by examining a large number of randomly-selected +points from the square, and seeing what fraction of them lie within the +circle. If this fraction is f, then our estimate for π is π ≈ 4f. + +Thanks to symmetry, we can compute points in one quadrant, rather +than within the entire unit square, and arrive at identical results. + +This task lends itself naturally to parallelization -- the task of +selecting a sample point and deciding whether or not it's inside the +circle is independent of all the other samples, so they can be done +simultaneously. We only need to aggregate the data at the end to compute +our fraction f and our estimate for π. +""" + +import pycuda.driver as cuda +from pycuda.compiler import SourceModule +import pycuda.autoinit +import sys +import datetime +import numpy as np + +def main(): + + if len(sys.argv) > 1: + NUM_BLOCK = int(sys.argv[1]) + else: + NUM_BLOCK = 512 + + # Set number of grid block size same as thread block size + NUM_THREAD = NUM_BLOCK + + # Set number of samples to be the number of thread blocks + # by the number of threads per block + SZ = NUM_BLOCK*NUM_THREAD + + # Time how long it takes to estimate π. + start_time = datetime.datetime.now() + + # Initialise x and y values to be used in estimation + x_values = np.random.uniform(size=SZ) + x_values = x_values.astype(np.float32) + y_values = np.random.uniform(size=SZ) + y_values = y_values.astype(np.float32) + + # Initialise the number of counts found inside the circle to be used in estimation + total_count = np.zeros((SZ,), dtype='int32') + + # This is the GPU kernel, defined in C and read in as a string. + # The "__global__" keyword indicates the function is intended to be called from the + # host and run on the GPU. + # The variable "tid" calculates the current thread ID. Each thread performs the estimation + # of one sample. + mod = SourceModule(""" + __global__ void calculatePi(float *x_values, float *y_values, int *total_count) { + int tid = threadIdx.x + blockIdx.x * blockDim.x; + if ((x_values[tid] * x_values[tid]) + (y_values[tid] * y_values[tid]) < 1.0f) { + total_count[tid] = 1; + } + else{ + total_count[tid] = 0; + } + } + """) + + # "get_function" enables the GPU kernel to be called in PyCUDA + func = mod.get_function("calculatePi") + + # Now the GPU kernel is called with three array inputs and two further inputs determining the + # grid block size and thread block size + func(cuda.In(x_values),cuda.In(y_values), cuda.InOut(total_count), grid=(NUM_BLOCK,1,1),block=(NUM_THREAD,1, 1)) + + # Sum up the total number of points counted + count = sum(total_count) + + # final calculation of pi + my_pi = (4.0 * count) / SZ + + elapsed_time = (datetime.datetime.now() - start_time).total_seconds() + + # Memory required is dominated by the size of x, y, and radii from + # inside_circle(), calculated in MiB + size_of_float = np.dtype(np.float64).itemsize + memory_required = 3 * SZ * size_of_float / (1024**2) + + # accuracy is calculated as a percent difference from a known estimate + # of π. + pi_specific = np.pi + accuracy = 100*(1-my_pi/pi_specific) + + # Uncomment either summary format for verbose or terse output + # summary = "{:d} threads, {:d} samples, {:f} MiB memory, {:f} seconds, {:f}% error" + summary = "{:d},{:d},{:f},{:f},{:f}" + print(summary.format(SZ, SZ, memory_required, elapsed_time, accuracy)) + + + + +if __name__ == '__main__': + main() + diff --git a/gpu-vs-cpu.png b/gpu-vs-cpu.png new file mode 100644 index 0000000..0d3d324 Binary files /dev/null and b/gpu-vs-cpu.png differ