removing loops with numpy.einsum

340 Views Asked by At

I have a some nested loops (three total) where I'm trying to use numpy.einsum to speed up the calculations, but I'm struggling to get the notation correct. I managed to get rid of one loop, but I can't figure out the other two. Here's what I've got so far:

import numpy as np
import time

def myfunc(r, q, f):
    nr = r.shape[0]
    nq = q.shape[0]
    y = np.zeros(nq)
    for ri in range(nr):
        for qi in range(nq):
            y[qi] += np.einsum('i,i',f[ri,qi]*f[:,qi],np.sinc(q[qi]*r[ri,:]/np.pi))
    return y

r = np.random.random(size=(1000,1000))
q = np.linspace(0,1,1001)
f = np.random.random(size=(r.shape[0],q.shape[0]))

start = time.time()
y = myfunc(r, q, f)
end = time.time()

print(end-start)

While this was much faster than the original, this is still too slow, and takes about 30 seconds. Note the original without the einsum call was the following (which looks like it will take ~2.5 hours, didn't wait to find out for sure):

def myfunc(r, q, f):
    nr = r.shape[0]
    nq = q.shape[0]
    y = np.zeros(nq)
    for ri in range(nr):
        for rj in range(nr):
            for qi in range(nq):
                y[qi] += f[ri,qi]*f[rj,qi]*np.sinc(q[qi]*r[ri,rj]/np.pi))
    return y

Does anyone know how to get rid of these loops with an einsum, or any other tool for that matter?

4

There are 4 best solutions below

2
On BEST ANSWER

Your function seems to be equivalent to the following:

# this is so called broadcasting
s = np.sinc(q * r[...,None]/np.pi)

np.einsum('iq,jq,ijq->q',f,f,s)

Which took about 20 seconds on my system, with most of the time to allocate s.

Let's test it for a small sample:

np.random.seed(1)
r = np.random.random(size=(10,10))
q = np.linspace(0,1,1001)
f = np.random.random(size=(r.shape[0],q.shape[0]))
(np.abs(np.einsum('iq,jq,ijq->q',f,f,s) - myfunc(r,q,f)) < 1e-6).all()
# True

Since np.sinc is not a linear operator, I'm not quite sure how we can further reduce the run time.

6
On

The simplest way (and likely the most performant) is to use an compiler, for example Numba. Since this function depends on the sinc function, also make sure that you have Intel SVML installed.

Example

import numpy as np
import numba as nb

@nb.njit(fastmath=True,parallel=False,error_model="numpy",cache=True)
def myfunc(r, q, f):
    nr = r.shape[0]
    nq = q.shape[0]
    y = np.zeros(nq)
    for ri in range(nr):
        for rj in range(nr):
            for qi in range(nq):
                y[qi] += f[ri,qi]*f[rj,qi]*np.sinc(q[qi]*r[ri,rj]/np.pi)
    return y

@nb.njit(fastmath=True,parallel=True,error_model="numpy",cache=True)
def myfunc_opt(r, q, f):
    nr = r.shape[0]
    nq = q.shape[0]
    y = np.empty(nq)

    #for contiguous memory access in the loop
    f_T=np.ascontiguousarray(f.T)
    for qi in nb.prange(nq):
        acc=0
        for ri in range(nr):
            for rj in range(nr):
                acc += f_T[qi,ri]*f_T[qi,rj]*np.sinc(q[qi]*r[ri,rj]/np.pi)
        y[qi]=acc
    return y

@nb.njit(fastmath=True,parallel=True,error_model="numpy",cache=True)
def myfunc_opt_2(r, q, f):
    nr = r.shape[0]
    nq = q.shape[0]
    y = np.empty(nq)


    f_T=np.ascontiguousarray(f.T)
    for qi in nb.prange(nq):
        acc=0
        for ri in range(nr):
            for rj in range(nr):
                #Test carefully!
                if q[qi]*r[ri,rj]!=0.:
                    acc += f_T[qi,ri]*f_T[qi,rj]*np.sin(q[qi]*r[ri,rj])/(q[qi]*r[ri,rj])
                else:
                    acc += f_T[qi,ri]*f_T[qi,rj]
        y[qi]=acc
    return y

def numpy_func(r, q, f):
    s = np.sinc(q * r[...,None]/np.pi)
    return np.einsum('iq,jq,ijq->q',f,f,s)

Timings with small arrays

r = np.random.random(size=(500,500))
q = np.linspace(0,1,501)
f = np.random.random(size=(r.shape[0],q.shape[0]))
%timeit y = myfunc(r, q, f)
#765 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit y = myfunc_opt(r, q, f)
#158 ms ± 2.59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit y = myfunc_opt_2(r, q, f)
#51.5 ms ± 1.17 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit y = numpy_func(r, q, f)
#3.81 s ± 61.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
print(np.allclose(numpy_func(r, q, f),myfunc(r, q, f)))
#True
print(np.allclose(numpy_func(r, q, f),myfunc_opt(r, q, f)))
#True
print(np.allclose(numpy_func(r, q, f),myfunc_opt_2(r, q, f)))

Timings with larger arrays

r = np.random.random(size=(1000,1000))
q = np.linspace(0,1,1001)
f = np.random.random(size=(r.shape[0],q.shape[0]))
%timeit y = myfunc(r, q, f)
#6.1 s ± 4 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit y = myfunc_opt(r, q, f)
#1.26 s ± 18.6 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit y = myfunc_opt_2(r, q, f)
#397 ms ± 2.69 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
2
On

That sinc is the actual bottleneck, as also mentioned in @Quang Hoang's post. We will make use of the einsum expression from there to end up with one way like so -

Now, from docs, numpy.sinc(x) is : \sin(\pi x)/(\pi x). We will make use of it -

v = q*r[...,None]
p = np.sin(v)/v
mask = (q==0) | (r==0)[...,None]
p[mask] = 1
out = np.einsum('iq,jq,ijq->q',f,f,p)

Also, for large data, we can leverage multi-cores with numexpr, like so -

import numexpr as ne

p = ne.evaluate('sin(q*r3D)/(q*r3D)', {'r3D':r[...,None]})
mask = (q==0) | (r==0)[...,None]
p[mask] = 1
out = np.einsum('iq,jq,ijq->q',f,f,p)

Timings with 500 length arrays -

In [12]: r = np.random.random(size=(500,500))
    ...: q = np.linspace(0,1,501)
    ...: f = np.random.random(size=(r.shape[0],q.shape[0]))

# Original soln with einsum
In [15]: %%timeit
    ...: nr = r.shape[0]
    ...: nq = q.shape[0]
    ...: y = np.zeros(nq)
    ...: for ri in range(nr):
    ...:     for qi in range(nq):
    ...:         y[qi] += np.einsum('i,i',f[ri,qi]*f[:,qi],np.sinc(q[qi]*r[ri,:]/np.pi))
9.75 s ± 977 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# @Quang Hoang's soln
In [16]: %%timeit
    ...: s = np.sinc(q * r[...,None]/np.pi)
    ...: np.einsum('iq,jq,ijq->q',f,f,s)
2.75 s ± 7.82 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [17]: %%timeit
    ...: p = ne.evaluate('sin(q3D*r)/(q3D*r)', {'q3D':q[:,None,None]})
    ...: mask = (q==0)[:,None,None] | (r==0)
    ...: p[mask] = 1
    ...: out = np.einsum('iq,jq,qij->q',f,f,p)
1.39 s ± 23.5 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

In [18]: %%timeit
    ...: v = q*r[...,None]
    ...: p = np.sin(v)/v
    ...: mask = (q==0) | (r==0)[...,None]
    ...: p[mask] = 1
    ...: out = np.einsum('iq,jq,ijq->q',f,f,p)
2.11 s ± 7.42 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

With larger data, we expect numexpr one to perform better, as long as we don't run into out-of-memory cases.

0
On

I'm writing this answer because I really learnt a lot about einsum from @Quang Hoang's post, and it will strengthen my understanding if I share how I think in the process of solving this problem.

The problem is to design a proper einsum operation for

y[qi] += np.einsum('i,i',f[ri,qi]*f[:,qi],np.sinc(q[qi]*r[ri,:]/np.pi))

  1. Look at the shapes of the relevant arrays. Inputs: r --> (a,a),q --> (c,), and f --> (a,c); Output: y --> (c,). From these shapes, for the part q[qi]*r[ri,:], a new array of shape (a,a,c) must be created by something like r[...,None]*q.

  2. Since np.sinc won't change an array's shape, and for a fixed pair of (ri,qi), f[ri,qi] is only a number, we should first think what einsum operation will reproduce f[:,qi],np.sinc(q[qi]*r[ri,:]/np.pi), i.e., how to obtain an array of shape (a,c) from two arrays of shapes (a,c) and (a,a,c). Intuitively, it is kl,ikl->il.

  3. For a fixed pair of (ri,qi), since f[:,qi],np.sinc(q[qi]*r[ri,:]/np.pi) gives an array of shape (a,c) and the shape of f is (a,c), the final operation is simply il,il->l

Based on the analysis above, we have the solution:

s = q*r[...,None]/np.pi
res = np.einsum('kl,ikl->il',f,s)
res = np.einsum('il,il->l',f,res)