Day 12: Parallel Processing

The previous post looked at metaprogramming in Julia, considering how to write code that will generate or modify other code. Today’s post considers a somewhat less esoteric, yet powerful topic: Parallel processing.

As opposed to many other languages, where parallel computing is bolted on as an afterthought, Julia was designed from the start with parallel computing in mind. It has a number of native features which lend themselves to efficient implementation of parallel algorithms. It also has packages which facilitate cluster computing (using MPI, for example). We won’t be looking at those, but focusing instead on coroutines, generic parallel processing and parallel loops.

Coroutines

Coroutines are not strictly parallel processing (in the sense of “many tasks running at the same time”) but they provide a lightweight mechanism for having multiple tasks defined (if not active) at once. According to Donald Knuth, coroutines are generalised subroutines (with which we are probably all familiar).

Under these conditions each module may be made into a _coroutine_; that is, it may be coded as an autonomous program which communicates with adjacent modules as if they were input or output subroutines. Thus, coroutines are subroutines all at the same level, each acting as if it were the master program when in fact there is no master program. This definition does not place a bound on the number of inputs and outputs a coroutine may have.

Conway, Design of a Separable Transition-Diagram Compiler, 1963.

Coroutines are implemented using produce() and consume(). In a moment you’ll see why those names are appropriate. To illustrate we’ll define a function which generates elements from the Lucas sequence. For reference, the first few terms in the sequence are 2, 1, 3, 4, 7, … If you know about Python’s generators then you’ll find the code below rather familiar.

function lucas_producer(n)
  a, b = (2, 1)
  for i = 1:n
    produce(a)
    a, b = (b, a + b)
  end
end
lucas_producer (generic function with 1 method)

This function is then wrapped in a Task, which has state :runnable.

lucas_task = Task(() -> lucas_producer(10))
Task (runnable) @0x0000000005b5ee60
lucas_task.state
:runnable

Now we’re ready to start consuming data from the Task. Data elements can be retrieved individually or via a loop (in which case the Task acts like an iterable object and no consume() is required).

consume(lucas_task)
2
consume(lucas_task)
1
consume(lucas_task)
3
for n in lucas_task
  println(n)
end
4
7
11
18
29
47
76

Between invocations the Task is effectively asleep. The task temporarily springs to life every time data is requested, before becoming dormant once more.

It’s possible to simultaneously set up an arbitrary number of coroutine tasks.

Parallel Processing

Coroutines don’t really feel like “parallel” processing because they are not working simultaneously. However it’s rather straightforward to get Julia to metaphorically juggle many balls at once. The first thing that you’ll need to do is launch the interpreter with multiple worker processes.

julia -p 4

There’s always one more process than specified on the command line (we specified the number of worker processes; add one for the master process).

nprocs()
5
workers() # Identifiers for the worker processes.
4-element Array{Int64,1}:
 2
 3
 4
 5

We can launch a job on one of the workers using remotecall().

W1 = workers()[1];
P1 = remotecall(W1, x -> factorial(x), 20)
RemoteRef(2,1,6)
fetch(P1)
2432902008176640000

@spawn and @spawnat are macros which launch jobs on individual workers. The @everywhere macro executes code across all processes (including the master).

@everywhere p = 5
@everywhere println(@sprintf("ID %d: %f %d", myid(), rand(), p))
ID 1: 0.686332 5
        From worker 4: ID 4: 0.107924 5
        From worker 5: ID 5: 0.136019 5
        From worker 2: ID 2: 0.145561 5
        From worker 3: ID 3: 0.670885 5

Parallel Loop and Map

To illustrate how easy it is to set up parallel loops, let’s first consider a simple serial implementation of a Monte Carlo technique to estimate π.

function findpi(n)
  inside = 0
    for i = 1:n
      x, y = rand(2)
      if (x^2 + y^2 <= 1)
        inside +=1
      end
    end
  4 * inside / n
end
findpi (generic function with 1 method)

The quality of the result as well as the execution time (and memory consumption!) depend directly on the number of samples.

@time findpi(10000)
elapsed time: 0.051982841 seconds (1690648 bytes allocated, 81.54% gc time)
3.14
@time findpi(100000000)
elapsed time: 9.533291187 seconds (8800000096 bytes allocated, 42.97% gc time)
3.1416662
@time findpi(1000000000)
elapsed time: 95.436185105 seconds (88000002112 bytes allocated, 43.14% gc time)
3.141605352

The parallel version is implemented using the @parallel macro, which takes a reduction operator (in this case +) as its first argument.

function parallel_findpi(n)
  inside = @parallel (+) for i = 1:n
    x, y = rand(2)
    x^2 + y^2 <= 1 ? 1 : 0
  end
  4 * inside / n
end
parallel_findpi (generic function with 1 method)

There is some significant overhead associated with setting up the parallel jobs, so that the parallel version actually performs worse for a small number of samples. But when you run sufficient samples the speedup becomes readily apparent.

@time parallel_findpi(10000)
elapsed time: 0.45212316 seconds (9731736 bytes allocated)
3.1724
@time parallel_findpi(100000000)
elapsed time: 3.870065625 seconds (154696 bytes allocated)
3.14154744
@time parallel_findpi(1000000000)
elapsed time: 39.029650365 seconds (151080 bytes allocated)
3.141653704

For reference, these results were achieved with 4 worker processes on a DELL laptop with the following CPU:

lshw | grep product | head -n 1
product: Intel(R) Core(TM) i7-4600M CPU @ 2.90GHz

More information on parallel computing facilities in Julia can be found in the documentation. As usual the code for today’s Julia journey can be found on GitHub.