Nonzero for integers

256 Views Asked by At

My problem is as follows. I am generating a random bitstring of size n, and need to iterate over the indices for which the random bit is 1. For example, if my random bitstring ends up being 00101, I want to retrieve [2, 4] (on which I will iterate over). The goal is to do so in the fastest way possible with Python/NumPy.

One of the fast methods is to use NumPy and do

bitstring = np.random.randint(2, size=(n,))
l = np.nonzero(bitstring)[0]

The advantage with np.non_zero is that it finds indices of bits set to 1 much faster than if one iterates (with a for loop) over each bit and checks if it is set to 1.

Now, NumPy can generate a random bitstring faster via np.random.bit_generator.randbits(n). The problem is that it returns it as an integer, on which I cannot use np.nonzero anymore. I saw that for integers one can get the count of bits set to 1 in an integer x by using x.bit_count(), however there is no function to get the indices where bits are set to 1. So currently, I have to resort to a slow for loop, hence losing the initial speedup given by np.random.bit_generator.randbits(n).

How would you do something similar to (and as fast as) np.non_zero, but on integers instead?

Thank you in advance for your suggestions!

4

There are 4 best solutions below

2
On

A minor optimisation to your code would be to use the new style random interface and generate bools rather than 64bit integers

rng = np.random.default_rng()

def original(n):
    bitstring = rng.integers(2, size=n, dtype=bool)
    return np.nonzero(bitstring)[0]

this causes it to take ~24 µs on my laptop, tested n upto 128.

I've previously noticed that getting a Numpy to generate a permutation is particularly fast, hence my comment above. Leading to:

def perm(n):
    a = rng.permutation(n)
    return a[:rng.binomial(n, 0.5)]

which takes between ~7 µs and ~10 µs depending on n. It also returns the indicies out of order, not sure if that's an issue for you. If your n isn't changing much, you could also swap to using rng.shuffle on an pre-allocated array, something like:

n = 32
a = np.arange(n)

def shuffle():
    rng.shuffle(a)
    return a[:rng.binomial(n, 0.5)]

which saves a couple of microseconds.

0
On

you could convert the number you get with randbits(n) to a numpy.ndarray. depending on the size of n the compute time of the conversion should be faster than the loop.

n = 10
l = np.random.bit_generator.randbits(n) # gives you the int 616
l_string = f'{l:0{n}b}' # gives you a string representation of the int in length n 1001101000
l_nparray = np.array(list(l_string), dtype=int) # gives you the numpy.ndarray like np.random.randint [1 0 0 1 1 0 1 0 0 0]
3
On

After some interesting proposals, I decided to do some benchmarking to understand how the running times grow as a function of n. The functions tested are the following:

def func1(n):
    bit_array = np.random.randint(2, size=n)
    return np.nonzero(bit_array)[0]

def func2(n):
    bit_int = np.random.bit_generator.randbits(n)
    a = np.zeros(bit_int.bit_count())
    i = 0
    for j in range(n):
        if 1 & (bit_int >> j):
            a[i] = j
            i += 1
    return a

def func3(n):
    bit_string = format(np.random.bit_generator.randbits(n), f'0{n}b')
    bit_array = np.array(list(bit_string), dtype=int)
    return np.nonzero(bit_array)[0]

def func4(n):
    rng = np.random.default_rng()
    a = rng.permutation(n)
    return a[:rng.binomial(n, 0.5)]

def func5(n):
    a = np.arange(n)
    rng.shuffle(a)
    return a[:rng.binomial(n, 0.5)]

I used timeit to do the benchmark, looping 1000 over a statement each time and averaging over 10 runs. The value of n ranges from 2 to 65536, growing as powers of 2. The average running time is plotted and error bars correspondond to the standard deviation.

Error plot with average running times

For solutions generating a bitstring, the simple func1 actually performs best among them whenever n is large enough (n>32). We can see that for low values of n (n< 16), using the randbits solution with the for loop (func2) is fastest, because the loop is not costly yet. However as n becomes larger, this becomes the worst solution, because all the time is spent in the for loop. This is why having a nonzero for integers would bring the best of both world and hopefully give a faster solution. We can observe that func3, which does a conversion in order to use nonzero after using randbits spends too long doing the conversion.

For implementations which exploit the binomial distribution (see Sam Mason's answer), we see that the use of shuffle (func5) instead of permutation (func4) can reduce the time by a bit, but overall they have similar performance.

Considering all values of n (that were tested), the solution given by Sam Mason which employs a binomial distribution together with shuffling (func5) is so far the most performant in terms of running time. Let's see if this can be improved!

2
On

I had a play with Cython to see how much difference it would make. I ended up with quite a lot of code and only ~5x better runtime performance:

from cpython.pycapsule cimport PyCapsule_IsValid, PyCapsule_GetPointer

import numpy as np
cimport numpy as np
cimport cython

from numpy.random cimport bitgen_t

np.import_array()

DTYPE = np.uint32
ctypedef np.uint32_t DTYPE_t

cdef extern int __builtin_popcountl(unsigned long) nogil
cdef extern int __builtin_ffsl(unsigned long) nogil

cdef const char *bgen_capsule_name = "BitGenerator"

@cython.boundscheck(False)  # Deactivate bounds checking
@cython.wraparound(False)   # Deactivate negative indexing.
cdef size_t generate_bits(object bitgen, np.uint64_t *state, Py_ssize_t state_len, np.uint64_t last_mask):
    cdef Py_ssize_t i
    cdef size_t nset
    cdef bitgen_t *rng

    capsule = bitgen.capsule
    if not PyCapsule_IsValid(capsule, bgen_capsule_name):
        raise ValueError("Expecting Numpy BitGenerator Capsule")
    rng = <bitgen_t *> PyCapsule_GetPointer(capsule, bgen_capsule_name)

    with bitgen.lock:
        nset = 0
        for i in range(state_len-1):
            state[i] = rng.next_uint64(rng.state)
            nset += __builtin_popcountl(state[i])

        i = state_len-1
        state[i] = rng.next_uint64(rng.state) & last_mask
        nset += __builtin_popcountl(state[i])
    
    return nset

cdef size_t write_setbits(DTYPE_t *result, DTYPE_t off, np.uint64_t state) nogil:
    cdef size_t j
    cdef int k
    j = 0
    while state:
        # find first set bit returns zero when nothing is set
        k = __builtin_ffsl(state) - 1
        # clear out bit k
        state &= ~(1ul<<k)
        # record in output
        result[j] = off + k
        j += 1
    return j

@cython.boundscheck(False)  # Deactivate bounds checking
@cython.wraparound(False)   # Deactivate negative indexing.
def rint(bitgen, unsigned int n):
    cdef Py_ssize_t i, j, nset
    cdef np.uint64_t[::1] state
    cdef DTYPE_t[::1] result

    state = np.empty((n + 63) // 64, dtype=np.uint64)

    nset = generate_bits(bitgen, &state[0], len(state), (1ul << (n & 63)) - 1)

    pyresult = np.empty(nset, dtype=DTYPE)
    result = pyresult

    j = 0
    for i in range(len(state)):
        j += write_setbits(&result[j], i * 64, state[i])

    return pyresult

The above code is easy to use via the Cython Jupyter extension.

Comparing this to slightly tidied up versions of the OP's code can be done via:

import random
import timeit

import numpy as np
import matplotlib.pyplot as plt

bitgen = np.random.PCG64()

def func1(n):
    # bool type is a bit faster
    bit_array = np.random.randint(2, size=n, dtype=bool)
    return np.nonzero(bit_array)[0]

def func2(n):
    # OPs variant ends up using a CSPRNG which is slower
    bit_int = random.getrandbits(n)
    # this is much easier than using numpy arrays
    return [i for i in range(n) if 1 & (bit_int >> i)]

def func3(n):
    bit_string = format(random.getrandbits(n), f'0{n}b')
    bit_array = np.array(list(bit_string), dtype='int8')
    return np.nonzero(bit_array)[0]

def func4(n):
    # shuffle variant is mostly the same
    # plot already busy enough
    a = np.random.permutation(n)
    return a[:np.random.binomial(n, 0.5)]

def func_cython(n):
    return rint(bitgen, n)

result = {}
niter = [2**i for i in range(1, 17)]
for name in 'func1 func2 func3 func4 func_cython'.split():
    result[name] = res = []
    for n in niter:
        t = timeit.Timer(f"fn({n})", f"fn = {name}", globals=globals())
        nit, dt = t.autorange()
        res.append(dt / nit)

plt.loglog()
for name, times in result.items():
    plt.plot(niter, np.array(times) * 1000, '.-', label=name)
plt.legend()

Which might produce output like:

tidied up matplotlib output

Note that in order to reduce variance it's helpful to turn off CPU frequency scaling and turn off turbo modes. The Arch wiki has useful info on how to do this under Linux.