I'm trying to optimise numpy.packbits:
import numpy as np
from numba import njit, prange
@njit(parallel=True)
def _numba_pack(arr, div, su):
for i in prange(div):
s = 0
for j in range(i*8, i*8+8):
s = 2*s + arr[j]
su[i] = s
def numba_packbits(arr):
div, mod = np.divmod(arr.size, 8)
su = np.zeros(div + (mod>0), dtype=np.uint8)
_numba_pack(arr[:div*8], div, su)
if mod > 0:
su[-1] = sum(x*y for x,y in zip(arr[div*8:], (128, 64, 32, 16, 8, 4, 2, 1)))
return su
>>> X = np.random.randint(2, size=99, dtype=bool)
>>> print(numba_packbits(X))
[ 75 24 79 61 209 189 203 187 47 226 170 61 0]
It appears 2 - 2.5 times slower than np.packbits(X). How is this implemeted in numpy internally? Could this be improved in numba?
I work on numpy == 1.21.2 and numba == 0.53.1 installed via conda install. My platform is:
Results:
import benchit
from numpy import packbits
%matplotlib inline
benchit.setparams(rep=5)
sizes = [100000, 300000, 1000000, 3000000, 10000000, 30000000]
N = sizes[-1]
arr = np.random.randint(2, size=N, dtype=bool)
fns = [numba_packbits, packbits]
in_ = {s/1000000: (arr[:s], ) for s in sizes}
t = benchit.timings(fns, in_, multivar=True, input_name='Millions of bits')
t.plot(logx=True, figsize=(12, 6), fontsize=14)
Update
With the response of Jérôme:
@njit('void(bool_[::1], uint8[::1], int_)', inline='never')
def _numba_pack_x64_byJérôme(arr, su, pos):
for i in range(64):
j = i * 8
su[i] = (arr[j]<<7)|(arr[j+1]<<6)|(arr[j+2]<<5)|(arr[j+3]<<4)|(arr[j+4]<<3)|(arr[j+5]<<2)|(arr[j+6]<<1)|arr[j+7]
@njit(parallel=True)
def _numba_pack_byJérôme(arr, div, su):
for i in prange(div//64):
_numba_pack_x64_byJérôme(arr[i*8:(i+64)*8], su[i:i+64], i)
for i in range(div//64*64, div):
j = i * 8
su[i] = (arr[j]<<7)|(arr[j+1]<<6)|(arr[j+2]<<5)|(arr[j+3]<<4)|(arr[j+4]<<3)|(arr[j+5]<<2)|(arr[j+6]<<1)|arr[j+7]
def numba_packbits_byJérôme(arr):
div, mod = np.divmod(arr.size, 8)
su = np.zeros(div + (mod>0), dtype=np.uint8)
_numba_pack_byJérôme(arr[:div*8], div, su)
if mod > 0:
su[-1] = sum(x*y for x,y in zip(arr[div*8:], (128, 64, 32, 16, 8, 4, 2, 1)))
return su
Usage:
>>> print(numba_packbits_byJérôme(X))
[ 75 24 79 61 209 189 203 187 47 226 170 61 0]
Results:



There are several issue with the Numba implementation. One of them is that parallel loops breaks the constant propagation optimization in LLVM-Lite (the JIT-compiler used by Numba). This cause critical information like array strides not to be propagated resulting in a slow scalar implementation instead of an SIMD one, and additional unneded instructions so to compute the offsets. Such issue can also be seen in C code. Numpy added specific macros so help compilers to automatically vectorize the code (ie. use SIMD instructions) when the stride of the working dimension is actually 1.
A solution to overcome the constant propagation issue is to call another Numba function. This function must not be inlined. The signature should be manually provided so the compiler can know the stride of the array is 1 at compilation time and generate a faster code. Finally, the function should work on fixed-size chunks because function calls are expensive and the compiler can vectorize the code. Unrolling the loop with shifts also produce a faster code (although it is uglier). Here is an example:
Benchmark
Here are performance results on my 6-core machine (i5-9600KF) with a billion random items as input:
This new implementation is 4 times faster than the initial parallel implementation and about 3 times faster than Numpy.
Delving into the generated assembly code
When
parallel=Falseis set andprangeis replaced withrange, the following assembly code is generated on my Intel processor supporting AVX-2:The code is not very good because it uses many unneeded instructions (like SIMD comparison instructions probably due to implicit casts from boolean types), a lot of register are temporary stored (register spilling) and also it uses 128-bit AVX vectors instead of 256-bit AVX ones supported on my machine. That being said, the code is vectorized and each loop iteration writes on 16-bytes at once without any conditional branches (except the one of the loop) so the resulting performance is not so bad.
In fact, the Numpy code is much smaller and more efficient. This is why it is about 2 times faster than the sequential Numba code on my machine with big inputs. Here is the hot assembly loop:
It read values by chunks of 8-bytes and compute them partially using 128-bit SSE instructions. 2 bytes are written at per iterations. That being said, it is not optimal either because 256-bit SIMD instructions are not used and I think the code can be optimized further.
When the initial parallel code is used, here is the assembly code of the hot loop:
The above code is mainly not vectorized and is quite inefficient. It use a lot of instructions (including quite slow ones like
setne/cmovlq/cmpbto do many conditional stores) for each iteration just to write 1-byte at a time. Numpy execute about 8 times less instructions for the same amount of written bytes. The inefficiency of this code is mitigated by the use of multiple threads. In the end, the parallel version can be a bit faster on machines with many cores (eg. >= 6).The improved implementation provided in the beginning of this answer generate a code similar to the above sequential implementation but using multiple thread (so still far from being optimal, but much batter).