Threads
Threads are a less costly means of achieving concurrency and parallelism than processes. They run within the context of their parent process and share its memory. Because they are not insulated from other threads, do not consume much additional memory, and have low overhead, they are often the preferred vehicle for starting up additional flows of execution.
Ruby
In Ruby, each new flow is an instance of the Thread
class, which expects a block of code to execute. In a real world scenario, the block is some expensive or parallel computation. This script, which does not aspire to real world significance, spawns four threads and has each print its serial index:
threads = (0...4).map do |threadIndex|
Thread.new do
puts threadIndex
end
end
threads.each { |thread| thread.join }
threads = (0...4).map do |threadIndex| Thread.new do puts threadIndex end end threads.each { |thread| thread.join }
The order of the output is indeterminate. Sometimes thread 0 will finish first. Sometimes thread 3. Language runtimes include a scheduler that decides when a thread starts and stops running. The join
calls pause the main thread until each thread finishes. Without join
, the main thread would terminate as soon as it finished spawning the other threads. When the main thread terminates, so does the larger process, thereby terminating any unfinished threads.
Threads often use their thread index to stake their claim of the workload in parallel data processing. For example, suppose a graphics program has height
rows of pixels to render in an image, and there are n
threads sharing the workload. Each thread therefore takes on approximately height / n
rows. The share given to thread i
might start at row i * height / n
.
Java
Java provides a similar Thread
abstraction, but expects the secondary flow to be expressed as a Runnable
instead of a block of code. Because the Runnable
interface has only a single method, Java's lambda syntax may be used to build the flow. This program is equivalent to the Ruby script above:
Thread[] threads = new Thread[4];
for (int i = 0; i < 4; i++) {
int threadIndex = i;
threads[i] = new Thread(() -> {
System.out.println(threadIndex);
});
threads[i].start();
}
for (int i = 0; i < 4; ++i) {
while (true) {
try {
threads[i].join();
break;
} catch (InterruptedException e) {
}
}
}
Thread[] threads = new Thread[4]; for (int i = 0; i < 4; i++) { int threadIndex = i; threads[i] = new Thread(() -> { System.out.println(threadIndex); }); threads[i].start(); } for (int i = 0; i < 4; ++i) { while (true) { try { threads[i].join(); break; } catch (InterruptedException e) { } } }
Because Java lambdas are not allowed to close over variables that get reassigned, i
must be copied over to threadIndex
.
Rust
In Rust, secondary flows are spawned and joined using methods from the std::thread
module. This Rust program emulates the Ruby and Java examples, with the spawn
method expecting a parameterless closure for the thread body:
use std::thread;
fn main() {
let threads = (0..4).map(|_| {
thread::spawn(|| {
println!("{:?}", thread::current().id());
})
});
for thread in threads {
thread.join().expect("join failed");
}
}
use std::thread; fn main() { let threads = (0..4).map(|_| { thread::spawn(|| { println!("{:?}", thread::current().id()); }) }); for thread in threads { thread.join().expect("join failed"); } }
Each thread prints its integer ID, which is assigned by the thread manager. IDs are guaranteed to be unique but are not guaranteed to be sequential or stable across separate runs of the program. They are not a suitable serial index. If we want to give a thread an index, we'll need to deal with Rust's ownership model of memory. We'll see how shortly.
Non-parallel Threads
Some languages, including Python and Ruby, give the appearance of having multiple threads, but the runtime ensures that only a single thread has the CPU at a given time. A global interpreter lock (GIL) is used as to enforce one-at-a-time execution. The thread with the GIL has the CPU. These languages use a GIL to avoid the dangers and costs of truly parallel threading.
What impact does the GIL have on performance? Consider first this non-threaded Ruby code which sums up 300 100,000-element vectors in serial fashion:
nvectors = 300
length = 100000
vectors = Array.new(nvectors) { Array.new(length) {0} }
sum = Array.new(length)
for i in 0.step(sum.size - 1, 1)
sum[i] = 0
for v in 0...vectors.size
sum[i] += vectors[v][i]
end
end
nvectors = 300 length = 100000 vectors = Array.new(nvectors) { Array.new(length) {0} } sum = Array.new(length) for i in 0.step(sum.size - 1, 1) sum[i] = 0 for v in 0...vectors.size sum[i] += vectors[v][i] end end
But why burden just one thread with all the work. If there are \(n\) threads, then thread \(i\) can sum columns \(i\), \(i + n\), \(i + 2n\), and so on. This script divides up the work:
nthreads = ARGV[0].to_i
nvectors = 300
length = 100000
vectors = Array.new(nvectors) { Array.new(length) {0} }
sum = Array.new(length)
threads = (0...nthreads).map do |ithread|
Thread.new do
for i in ithread.step(sum.size - 1, nthreads)
sum[i] = 0
for v in 0...vectors.size
sum[i] += vectors[v][i]
end
end
end
end
threads.each { |thread| thread.join }
nthreads = ARGV[0].to_i nvectors = 300 length = 100000 vectors = Array.new(nvectors) { Array.new(length) {0} } sum = Array.new(length) threads = (0...nthreads).map do |ithread| Thread.new do for i in ithread.step(sum.size - 1, nthreads) sum[i] = 0 for v in 0...vectors.size sum[i] += vectors[v][i] end end end end threads.each { |thread| thread.join }
When we time both the serial and parallel versions of these program, we find that the threaded version is slower than the serial version:
$ time ./sum_serial.rb; time ./sum_parallel.rb 4
./sum_serial.rb 3.55s user 0.07s system 99% cpu 3.628 total
./sum_parallel.rb 4 4.31s user 0.07s system 99% cpu 4.378 total
$ time ./sum_serial.rb; time ./sum_parallel.rb 4 ./sum_serial.rb 3.55s user 0.07s system 99% cpu 3.628 total ./sum_parallel.rb 4 4.31s user 0.07s system 99% cpu 4.378 total
The threads do not execute in parallel because of the GIL, and creating and cleaning up after them adds overhead. Many systems programmers dismiss these languages because they lack true parallel threads.
Python's lack of parallel threading seems especially surprising given its high rate of adoption by developers who run expensive machine learning tasks, like training a neural network. To achieve true parallelism, machine learning libraries are usually implemented in a lower-level language that isn't bound by the GIL. The Python functions are just wrappers that call down into the low-level functions that run in native machine code or on a highly-parallel graphics card.