Possible to use numba.guvectorize to emulate parallel forall / prange?

902 Views Asked by At

As a user of Python for data analysis and numerical calculations, rather than a real "coder", I had been missing a really low-overhead way of distributing embarrassingly parallel loop calculations on several cores. As I learned, there used to be the prange construct in Numba, but it was abandoned because of "instability and performance issues".

Playing with the newly open-sourced @guvectorize decorator I found a way to use it for virtually no-overhead emulation of the functionality of late prange.

I am very happy to have this tool at hand now, thanks to the guys at Continuum Analytics, and did not find anything on the web explicitly mentioning this use of @guvectorize. Although it may be trivial to people who have been using NumbaPro earlier, I'm posting this for all those fellow non-coders out there (see my answer to this "question").

1

There are 1 best solutions below

6
On

Consider the example below, where a two-level nested for loop with a core doing some numerical calculation involving two input arrays and a function of the loop indices is executed in four different ways. Each variant is timed with Ipython's %timeit magic:

  1. naive for loop, compiled using numba.jit
  2. forall-like construct using numba.guvectorize, executed in a single thread (target = "cpu")
  3. forall-like construct using numba.guvectorize, executed in as many threads as there are cpu "cores" (in my case hyperthreads) (target = "parallel")
  4. same as 3., however calling the "guvectorized" forall with the sequence of "parallel" loop indices randomly permuted

The last one is done because (in this particular example) the inner loop's range depends on the value of the outer loop's index. I don't know how exactly the dispatchment of gufunc calls is organized inside numpy, but it appears as if the randomization of "parallel" loop indices achieves slightly better load balancing.

On my (slow) machine (1st gen core i5, 2 cores, 4 hyperthreads) I get the timings:

1 loop, best of 3: 8.19 s per loop
1 loop, best of 3: 8.27 s per loop
1 loop, best of 3: 4.6 s per loop
1 loop, best of 3: 3.46 s per loop

Note: I'd be interested if this recipe readily applies to target="gpu" (it should do, but I don't have access to a suitable graphics card right now), and what's the speedup. Please post!

And here's the example:

import numpy as np
from numba import jit, guvectorize, float64, int64

@jit
def naive_for_loop(some_input_array, another_input_array, result):
    for i in range(result.shape[0]):
        for k in range(some_input_array.shape[0] - i):
            result[i] += some_input_array[k+i] * another_input_array[k] * np.sin(0.001 * (k+i)) 

@guvectorize([(float64[:],float64[:],int64[:],float64[:])],'(n),(n),()->()', nopython=True, target='parallel')
def forall_loop_body_parallel(some_input_array, another_input_array, loop_index, result):
    i = loop_index[0]       # just a shorthand
    # do some nontrivial calculation involving elements from the input arrays and the loop index
    for k in range(some_input_array.shape[0] - i):
        result[0] += some_input_array[k+i] * another_input_array[k] * np.sin(0.001 * (k+i)) 

@guvectorize([(float64[:],float64[:],int64[:],float64[:])],'(n),(n),()->()', nopython=True, target='cpu')
def forall_loop_body_cpu(some_input_array, another_input_array, loop_index, result):
    i = loop_index[0]       # just a shorthand
    # do some nontrivial calculation involving elements from the input arrays and the loop index
    for k in range(some_input_array.shape[0] - i):
        result[0] += some_input_array[k+i] * another_input_array[k] * np.sin(0.001 * (k+i)) 

arg_size = 20000

input_array_1 = np.random.rand(arg_size)
input_array_2 = np.random.rand(arg_size)
result_array = np.zeros_like(input_array_1)

# do single-threaded naive nested for loop
# reset result_array inside %timeit call 
%timeit -r 3 result_array[:] = 0.0; naive_for_loop(input_array_1, input_array_2, result_array)
result_1 = result_array.copy()

# do single-threaded forall loop (loop indices in-order)
# reset result_array inside %timeit call 
loop_indices = range(arg_size)
%timeit -r 3 result_array[:] = 0.0; forall_loop_body_cpu(input_array_1, input_array_2, loop_indices, result_array)
result_2 = result_array.copy()

# do multi-threaded forall loop (loop indices in-order)
# reset result_array inside %timeit call 
loop_indices = range(arg_size)
%timeit -r 3 result_array[:] = 0.0; forall_loop_body_parallel(input_array_1, input_array_2, loop_indices, result_array)
result_3 = result_array.copy()

# do forall loop (loop indices scrambled for better load balancing)
# reset result_array inside %timeit call 
loop_indices_scrambled = np.random.permutation(range(arg_size))
loop_indices_unscrambled = np.argsort(loop_indices_scrambled)
%timeit -r 3 result_array[:] = 0.0; forall_loop_body_parallel(input_array_1, input_array_2, loop_indices_scrambled, result_array)
result_4 = result_array[loop_indices_unscrambled].copy()


# check validity
print(np.all(result_1 == result_2))
print(np.all(result_1 == result_3))
print(np.all(result_1 == result_4))