Data Parallelism
Processes and threads are low-level primitives for achieving concurrency and parallelism. They place all the burden of starting up and synchronizing the flows on the developer. Some languages provide higher-level abstractions for processing large collections of data in parallel. We write seemingly serial code to process a collection, but behind the scenes, the task is broken up and distributed across parallel threads.
Haskell
Haskell's Parallel library provides a parallelized alternative to map
named parMap
. Like map
, parMap
transforms one list into another list by applying a transformation function to each element. Unlike map
, it divides the list into chunks that are processed simultaneously by threads. The call to parMap
is nearly identical to regular map
:
import Concurrent.Parallel.Strategies
main = do
-- ...
let ys = parMap rseq transform xs
-- ...
The parallel map function accepts three parameters instead of two. The first parameter controls the timing of the evaluation. Recall that Haskell normally stores expressions in an unevaluated form and doesn't force their evaluation until their value is needed, much like an iterator doesn't iterate until we call next
. The value of rseq
forces the list to be evaluated eagerly, before it is distributed out to the threads, rather than lazily.
No speedup will be observed unless we request multiple threads when compiling. This command compiles main.hs
to run with eight threads:
ghc -with-rtsopts="-N8" -threaded main.hs
Rust
In Rust, recall that we don't map, filter, or fold over collections directly. Instead we create an iterator. All of the stock iterators walk through the collection in a serial manner, but the Rayon crate provides a parallel iterator that provides threaded implementations of map, filter, fold, and many other operations. This Rust program is equivalent to the Haskell program above:
use rayon::prelude::*;
fn main() {
// ...
let ys = xs.par_iter().map(|x| transform(x));
// ...
}
OpenMP
Automatic parallelism is also available in C and C++. Consider this function that sums an arbitrary number of vectors in serial fashion:
void add_vectors(int nvectors, int length, double *vectors[], double sum[]) {
for (int i = 0; i < length; ++i) {
sum[i] = 0.0;
for (int v = 0; v < nvectors; ++v) {
sum[i] += vectors[v][i];
}
}
}
This summation would complete in a fraction of the time if the work was distributed across threads. The OpenMP platform performs this distribution automatically. We tag the serial for-loop with a preprocessor macro, and it becomes a parallel for-loop at compile time:
void add_vectors(int nvectors, int length, double *vectors[], double sum[]) {
#pragma omp parallel for
for (int i = 0; i < length; ++i) {
sum[i] = 0.0;
for (int v = 0; v < nvectors; ++v) {
sum[i] += vectors[v][i];
}
}
}
When compiled on a MacBook Pro with the default number of threads, the parallel version runs in about a quarter of the time of the serial version:
$ clang -oserial -O0 sum.c
$ clang -oparallel -Xpreprocessor -fopenmp -lomp -O0 sum.c
$ time ./serial; time ./parallel
./serial 1.48s user 0.65s system 91% cpu 2.333 total
./parallel 4.92s user 1.57s system 1051% cpu 0.617 total
OpenMP isn't part of the core C and C++ languages. It instead relies on the preprocessor's pragma
directive, which is used to customize the compiler's behavior in non-standard ways.
Processes and threads provide what one might call code parallelism. They allow one arbitrary flow of code to run alongside another. Higher-level abstractions like parMap
, par_iter
, and parallel for-loops provide data parallelism. The parallelism is driven implicitly by data rather than explicitly by code.