Data Parallelism

Dear Computer

Chapter 13: Concurrency and Parallelism

Data Parallelism

Processes and threads are low-level primitives for achieving concurrency and parallelism. They place all the burden of starting up and synchronizing the flows on the developer. Some languages provide higher-level abstractions for processing large collections of data in parallel. We write seemingly serial code to process a collection, but behind the scenes, the task is broken up and distributed across parallel threads.

Haskell

Haskell's Parallel library provides a parallelized alternative to map named parMap. Like map, parMap transforms one list into another list by applying a transformation function to each element. Unlike map, it divides the list into chunks that are processed simultaneously by threads. The call to parMap is nearly identical to regular map:

Haskell
import Concurrent.Parallel.Strategies

main = do
  -- ...
  let ys = parMap rseq transform xs
  -- ...
import Concurrent.Parallel.Strategies

main = do
  -- ...
  let ys = parMap rseq transform xs
  -- ...

The parallel map function accepts three parameters instead of two. The first parameter controls the timing of the evaluation. Recall that Haskell normally stores expressions in an unevaluated form and doesn't force their evaluation until their value is needed, much like an iterator doesn't iterate until we call next. The value of rseq forces the list to be evaluated eagerly, before it is distributed out to the threads, rather than lazily.

No speedup will be observed unless we request multiple threads when compiling. This command compiles main.hs to run with eight threads:

Shell
ghc -with-rtsopts="-N8" -threaded main.hs
ghc -with-rtsopts="-N8" -threaded main.hs

Rust

In Rust, recall that we don't map, filter, or fold over collections directly. Instead we create an iterator. All of the stock iterators walk through the collection in a serial manner, but the Rayon crate provides a parallel iterator that provides threaded implementations of map, filter, fold, and many other operations. This Rust program is equivalent to the Haskell program above:

Rust
use rayon::prelude::*;

fn main() {
  // ...
  let ys = xs.par_iter().map(|x| transform(x));
  // ...
}
use rayon::prelude::*;

fn main() {
  // ...
  let ys = xs.par_iter().map(|x| transform(x));
  // ...
}

OpenMP

Automatic parallelism is also available in C and C++. Consider this function that sums an arbitrary number of vectors in serial fashion:

C
void add_vectors(int nvectors, int length, double *vectors[], double sum[]) {
  for (int i = 0; i < length; ++i) {
    sum[i] = 0.0;
    for (int v = 0; v < nvectors; ++v) {
      sum[i] += vectors[v][i];
    }
  }
}
void add_vectors(int nvectors, int length, double *vectors[], double sum[]) {
  for (int i = 0; i < length; ++i) {
    sum[i] = 0.0;
    for (int v = 0; v < nvectors; ++v) {
      sum[i] += vectors[v][i];
    }
  }
}

This summation would complete in a fraction of the time if the work was distributed across threads. The OpenMP platform performs this distribution automatically. We tag the serial for-loop with a preprocessor macro, and it becomes a parallel for-loop at compile time:

C
void add_vectors(int nvectors, int length, double *vectors[], double sum[]) {
  #pragma omp parallel for
  for (int i = 0; i < length; ++i) {
    sum[i] = 0.0;
    for (int v = 0; v < nvectors; ++v) {
      sum[i] += vectors[v][i];
    }
  }
}
void add_vectors(int nvectors, int length, double *vectors[], double sum[]) {
  #pragma omp parallel for
  for (int i = 0; i < length; ++i) {
    sum[i] = 0.0;
    for (int v = 0; v < nvectors; ++v) {
      sum[i] += vectors[v][i];
    }
  }
}

When compiled on a MacBook Pro with the default number of threads, the parallel version runs in about a quarter of the time of the serial version:

shell
$ clang -oserial -O0 sum.c
$ clang -oparallel -Xpreprocessor -fopenmp -lomp -O0 sum.c
$ time ./serial; time ./parallel
./serial    1.48s user 0.65s system   91% cpu 2.333 total
./parallel  4.92s user 1.57s system 1051% cpu 0.617 total
$ clang -oserial -O0 sum.c
$ clang -oparallel -Xpreprocessor -fopenmp -lomp -O0 sum.c
$ time ./serial; time ./parallel
./serial    1.48s user 0.65s system   91% cpu 2.333 total
./parallel  4.92s user 1.57s system 1051% cpu 0.617 total

OpenMP isn't part of the core C and C++ languages. It instead relies on the preprocessor's pragma directive, which is used to customize the compiler's behavior in non-standard ways.

Processes and threads provide what one might call code parallelism. They allow one arbitrary flow of code to run alongside another. Higher-level abstractions like parMap, par_iter, and parallel for-loops provide data parallelism. The parallelism is driven implicitly by data rather than explicitly by code.

← Message PassingCoroutines →