Going from a single threaded program to a parallel version is often a major pain, this is however often needed to take full advantage of modern multi-core processors. Having to work with threads, locks, and shared data structures is often difficult due to the non deterministic behavior of the scheduler and the difficulty in debugging the complex problems that can happen with multi-threading.
It should be noted that several algorithms in the C++ library already have a parallel implementation. If you are familiar with the algorithm library of C++ you probably have used std::sort
or std::find
etc, well most of these algorithms have a parallel-aware counterpart.
In order to take advantage of the parallel versions of these functions you need support for OpenMP, adding this is simple, compile with the -fopenmp
flag. Another requirement is that the hardware must support atomic operations, since GCC defaults to no support for atomic operations on some common hardware architectures it may be required to add explicit compiler flags such as -march=native
or e.g., -march=i686
to activate the compiler support for atomic operations.
Now that we have support for parallel, we need to add a define to convert the standard (sequential) algorithms to their parallel equivalents. This is done by adding the following flag -D_GLIBCXX_PARALLEL
. There is no guarantee that everything will end up being executed in a parallel manner, but this flag activates heuristics that will be used to determine if some algorithms will be executed using parallel versions.
As a simple example let’s try sorting in parallel
#include <algorithm> #include <chrono> #include <iostream> #include <iterator> #include <random> #include <vector> /* Compile with : g++ -std=c++11 -march=native -O3 -o test_sequential parallel.cpp g++ -std=c++11 -fopenmp -D_GLIBCXX_PARALLEL -march=native -O3 -o test_parallel parallel.cpp c++11 is required as a minimum standard since c++11 features are used in order for the libc (standard library) to be implemented with parallel functions we need the following define : _GLIBCXX_PARALLEL plus the openMP flag. (This will make std::sort parallel) the architecture flag is used in order for the parallel implementation to use architecture specific atomic/mutex operations. */ #define __TEST_VECTOR_LENGTH__ 100000000 #define __RANDOM_SEED__ 0 int main() { std::vector<unsigned int> test_vector(__TEST_VECTOR_LENGTH__, 0); std::iota(test_vector.begin(), test_vector.end(), 0); /* Assign with 0,1,2,3,... */ std::srand(__RANDOM_SEED__); /* This needs to be the fixed in order to compare the runs */ std::random_shuffle(test_vector.begin(), test_vector.end()); /* Shuffle the values */ /* Init time */ std::chrono::time_point<std::chrono::system_clock> start, end; start = std::chrono::system_clock::now(); /* Sorting */ std::sort(test_vector.begin(), test_vector.end()); /* Final time */ end = std::chrono::system_clock::now(); std::chrono::duration<double> elapsed_seconds = end-start; std::cout << "Elapsed time : " << elapsed_seconds.count() << " [s]" << std::endl; return 0; }
When running the test program with big vectors we can clearly see the speed-up (with a vector of 100’000’000 distinct elements). Running this on an Intel core i7-4790 (4 physical cores, 8 logical) we get a speed-up of almost 5x !.
./test_sequential Elapsed time : 7.45588 [s] ./test_parallel Elapsed time : 1.51954 [s]
When running with a billion elements (1’000’000’000) the results are very similar
./test_sequential Elapsed time : 83.324 [s] ./test_parallel Elapsed time : 16.9412 [s]
When running with very few elements (100) with this simple program the overhead was negligible, it may however have a bigger impact if the functions are called a lot, but if this is a problem it is possible to make only specific algorithms parallel-aware, see below.
./test_sequential Elapsed time : 5.231e-06 [s] ./test_parallel Elapsed time : 5.506e-06 [s]
The benchmark, while really crude and simplified, shows us the possible gains we can obtain by running the parallel algorithms. This shows how easy it is to change an algorithm, or all of them, to their parallel-aware counterparts to get a speed-boost almost for free.
Note : When you cannot or do not want to recompile the entire application it is possible to make only specific algorithms parallel-aware. This can be done by explicitly calling the parallel versions.
#include <vector>; #include <parallel/algorithm>; int main() { std::vector<int>; v(100); // Explicitly force a call to parallel sort. __gnu_parallel::sort(v.begin(), v.end()); return 0; }
Note : The fixed seed assures us the shuffled vector is always the same (since in my version of C++ std::random_shuffle()
uses std::rand()
, if you want a really portable solution you may want to use shuffle( RandomIt first, RandomIt last, URBG&& g )
to pass the random number generator g to the function).
References :
https://gcc.gnu.org/onlinedocs/libstdc++/manual/parallel_mode_using.html
https://gcc.gnu.org/onlinedocs/gcc-4.8.1/libstdc++/manual/manual/parallel_mode.html