Skip to content

High Performance Computing Profiling Project with Multithreading

Notifications You must be signed in to change notification settings

Catalin-Ripanu/HW1_APP

Repository files navigation

HW1_APP

A project which consists of measuring the performance for 3 parallel implementations (Pthreads, OpenMP and MPI) of K-Means Clustering Algorithm and comparing them with the serial one.

First Week:

A performance analysis was made for the serial code which consisted of having k = 40 clusters and n = 21000 data points. In the picture below it can be observed the system compontes on which these tests were made.

image

The execution time it's approximately 45s:

image

image

This code runs in the context of a single thread within the current process:

image

Runtime on Haswell (serial) : 9.55 s

image

Runtime on Nehalem (serial) : 14.23 s

image

Second Week:

The number of clusters and data points were increased (k = 1000 and n = 100000) to evidence the speedup:

image

image

image

The profiling for OpenMP:

image

image

image

Third Week:

Cluster Analysis

Serial

Input Size Nehalem Haswell
100_000 x 1000 143.125 87.167

Pthreads

Pthreads - Nehalem

Input Size 2 threads 4 threads 8 threads 16 threads
100_000 x 1000 134.794 59.74 35.38 18.943
50_000 x 1000 67.277 29.942 16.563 10.344
50_000 x 500 30.185 17.178 9.197 5.545

Pthreads - Haswell

Input Size 2 threads 4 threads 8 threads 16 threads 32 threads
100_000 x 1000 89.367 51.804 28.695 14.375 9.655
50_000 x 1000 44.689 25.043 14.043 11.878 5.128
50_000 x 500 23.754 12.893 7.601 3.73 2.653

The Pthreads execution results using Nehalem cluser queue:

image

The Pthreads execution results using Haswell cluser queue:

image

OpenMP

OpenMP (schedule dynamic) - Nehalem

Batch Size 2 threads 4 threads 8 threads 16 threads
10 148.257 87.084 41.73 24.026
100 143.948 76.816 43.726 26.017
1000 161.435 76.766 43.328 25.06

OpenMP (schedule dynamic) - Haswell

Batch Size 2 threads 4 threads 8 threads 16 threads 32 threads
10 115.28 60.727 33.862 19.299 10.556
100 118.624 57.926 36.297 24.152 13.855
1000 107.127 60.922 38.177 20.975 14.449

OpenMP (schedule auto) - Nehalem

Batch Size 2 threads 4 threads 8 threads 16 threads
1000 142.63 89.555 42.222 23.656

OpenMP (schedule auto) - Haswell

Batch Size 2 threads 4 threads 8 threads 16 threads 32 threads
1000 102.299 61.44 34.019 19.407 10.101

OpenMP (schedule static) - Nehalem

Batch Size 2 threads 4 threads 8 threads 16 threads
1000 141.916 90.011 46.101 22.542

OpenMP (schedule static) - Haswell

Batch Size 2 threads 4 threads 8 threads 16 threads 32 threads
1000 104.413 61.315 32.389 19.29 10.117

It can be observed the idea of having similitudes when speaking about schedule static, schedule dynamic and schedule auto (with BATCH_SIZE=10) directives using Haswell.

The OpenMP execution results using Nehalem cluser queue:

image

Making Zoom on the area of interest:

image

The OpenMP execution results using Haswell cluser queue:

image

Making Zoom on the area of interest:

image

MPI

MPI - Nehalem

Input Size 2 Processes 4 Processes 8 Processes 16 Processes 32 Processes 64 Processes 128 Processes 256 Processes 512 Processes
100_000 x 1000 77.469 39.392 20.708 16.726 17.046 18.957 20.937 ? 61.669
50_000 x 1000 39.148 20.440 11.173 9.261 10.114 11.059 13.234 ? 47.971
50_000 x 500 20.687 11.249 6.484 6.012 6.145 7.217 9.897 ? 38.113

MPI - Haswell

Input Size 2 Processes 4 Processes 8 Processes 16 Processes 32 Processes 64 Processes 128 Processes 256 Processes 512 Processes
100_000 x 1000 45.781 24.748 14.793 9.613 9.501 16.242 19.344 ? 49.467
50_000 x 1000 24.095 14.026 8.386 6.039 6.260 10.367 14.173 ? 44.060
50_000 x 500 13.410 8.011 5.443 4.501 4.631 8.326 11.119 ? 35.298

Fourth Week:

Adding the implementation of MPI and doing profiling on cluster for Pthread and OpenMP

The MPI execution results using Nehalem cluser queue:

image

The MPI execution results using Haswell cluser queue:

image

Execution of MPI using a smaller number of processes:

image

image

The Vtune execution results for Pthreads:

image

At hotspots it can be observed that the euclidian distance function is the most CPU clock / time consuming:

image

The Vtune execution results for OpenMP:

image

At hotspots it can be observed that the euclidian distance function is the most CPU clock / time consuming for OpenMP too:

image

It can be also observed the differences between OpenMP scheduling policies, Auto and Static ones being the most better:

image

Speedups:

image

image

Efficiencies:

image

image

Special thanks to Marin Eduard-Constantin and Coman Calin-Alexandru (my classmates) for their creative ideas and wonderful support while making this project.

About

High Performance Computing Profiling Project with Multithreading

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published