Threads

Dear Computer

Chapter 13: Concurrency and Parallelism

Threads

Threads are a less costly means of achieving concurrency and parallelism than processes. They run within the context of their parent process and share its memory. Because they are not insulated from other threads, do not consume much additional memory, and have low overhead, they are often the preferred vehicle for starting up additional flows of execution.

Ruby

In Ruby, each new flow is an instance of the Thread class, which expects a block of code to execute. In a real world scenario, the block is some expensive or parallel computation. This script, which does not aspire to real world significance, spawns four threads and has each print its serial index:

Ruby
threads = (0...4).map do |threadIndex|
  Thread.new do
    puts threadIndex
  end
end

threads.each { |thread| thread.join }

The order of the output is indeterminate. Sometimes thread 0 will finish first. Sometimes thread 3. Language runtimes include a scheduler that decides when a thread starts and stops running. The join calls pause the main thread until each thread finishes. Without join, the main thread would terminate as soon as it finished spawning the other threads. When the main thread terminates, so does the larger process, thereby terminating any unfinished threads.

Threads often use their thread index to stake their claim of the workload in parallel data processing. For example, suppose a graphics program has height rows of pixels to render in an image, and there are n threads sharing the workload. Each thread therefore takes on approximately height / n rows. The share given to thread i might start at row i * height / n.

Java

Java provides a similar Thread abstraction, but expects the secondary flow to be expressed as a Runnable instead of a block of code. Because the Runnable interface has only a single method, Java's lambda syntax may be used to build the flow. This program is equivalent to the Ruby script above:

Java
Thread[] threads = new Thread[4];

for (int i = 0; i < 4; i++) {
  int threadIndex = i;
  threads[i] = new Thread(() -> {
    System.out.println(threadIndex);
  });
  threads[i].start();
}

for (int i = 0; i < 4; ++i) {
  while (true) {
    try {
      threads[i].join();
      break;
    } catch (InterruptedException e) {
    }
  }
}

Because Java lambdas are not allowed to close over variables that get reassigned, i must be copied over to threadIndex.

Rust

In Rust, secondary flows are spawned and joined using methods from the std::thread module. This Rust program emulates the Ruby and Java examples, with the spawn method expecting a parameterless closure for the thread body:

Rust
use std::thread;

fn main() {
  let threads = (0..4).map(|_| {
    thread::spawn(|| {
      println!("{:?}", thread::current().id());
    })
  });

  for thread in threads {
    thread.join().expect("join failed");
  }
}

Each thread prints its integer ID, which is assigned by the thread manager. IDs are guaranteed to be unique but are not guaranteed to be sequential or stable across separate runs of the program. They are not a suitable serial index. If we want to give a thread an index, we'll need to deal with Rust's ownership model of memory. We'll see how shortly.

Non-parallel Threads

Some languages, including Python and Ruby, give the appearance of having multiple threads, but the runtime ensures that only a single thread has the CPU at a given time. A global interpreter lock (GIL) is used as to enforce one-at-a-time execution. The thread with the GIL has the CPU. These languages use a GIL to avoid the dangers and costs of truly parallel threading.

What impact does the GIL have on performance? Consider first this non-threaded Ruby code which sums up 300 100,000-element vectors in serial fashion:

Ruby
nvectors = 300
length = 100000

vectors = Array.new(nvectors) { Array.new(length) {0} }
sum = Array.new(length)

for i in 0.step(sum.size - 1, 1)
  sum[i] = 0
  for v in 0...vectors.size
    sum[i] += vectors[v][i]
  end
end

But why burden just one thread with all the work. If there are \(n\) threads, then thread \(i\) can sum columns \(i\), \(i + n\), \(i + 2n\), and so on. This script divides up the work:

Ruby
nthreads = ARGV[0].to_i
nvectors = 300
length = 100000

vectors = Array.new(nvectors) { Array.new(length) {0} }
sum = Array.new(length)

threads = (0...nthreads).map do |ithread|
  Thread.new do
    for i in ithread.step(sum.size - 1, nthreads)
      sum[i] = 0
      for v in 0...vectors.size
        sum[i] += vectors[v][i]
      end
    end
  end
end

threads.each { |thread| thread.join }

When we time both the serial and parallel versions of these program, we find that the threaded version is slower than the serial version:

Shell
$ time ./sum_serial.rb; time ./sum_parallel.rb 4
./sum_serial.rb      3.55s user 0.07s system 99% cpu 3.628 total
./sum_parallel.rb 4  4.31s user 0.07s system 99% cpu 4.378 total

The threads do not execute in parallel because of the GIL, and creating and cleaning up after them adds overhead. Many systems programmers dismiss these languages because they lack true parallel threads.

Python's lack of parallel threading seems especially surprising given its high rate of adoption by developers who run expensive machine learning tasks, like training a neural network. To achieve true parallelism, machine learning libraries are usually implemented in a lower-level language that isn't bound by the GIL. The Python functions are just wrappers that call down into the low-level functions that run in native machine code or on a highly-parallel graphics card.

← ProcessesShared Memory →