'Remapping' a Python numpy array in a 'vectorized' way?

93 Views Asked by At

It has recently become apparent to me that when working with at least marginally large data sets Python does not particularly play well with 'for loops' in terms of speed of operation.

I have a task where I am performing a transform operation and thus I have to 'remap' the data from one array to a new array, but where the data ends up in the new array depends on the result of the calculation.

So let's say I'm now considering 3 numpy arrays:

A (has original data), B (remap of points, no data), C (resultant array).

Is there some way to perform this operation 'quickly'?

I mean I know if I were writing this in 'C' it probably wouldn't even be a question as it runs fast enough anyways. And if I really needed speed, I could probably put together B and C from A with an equation to directly transfers out of the pointer locations straight from memory.

I am not looking to get that complicated, just yet. But am wondering if there is a fast, efficent way to do this in Python.

Here is an example: Let's say I am doing a triple shear image rotation. 'A' in this case is my original image of shape (x, y, 3) (3 obviously being the RGB values), 'B' is my 'transform' of where the old pixels from 'A' should now end up. And then 'C' is when I have to actually shift everything out (i.e. of x, y 3, move all the data in the third dimension to their new location).

As mentioned, this is not that hard with a for loop, but for loops in Python are painfully slow.

import numpy as np
import math as m

# Define theta in degrees
angle = 30
theta = m.radians(angle)

# Triple shear transformation

# first = np.array([[1, -np.tan(theta/2)], [0, 1]])
# second = np.array([[1, 0], [np.sin(theta), 1]])
# third = np.array([[1, -np.tan(theta/2)], [0, 1]])

side = np.array([[1, -np.tan(theta/2)], [0, 1]])
middle = np.array([[1, 0], [np.sin(theta), 1]])

# Define vector test set - X, Y
pixels = np.array([[10, 5], [20, 3], [4, 5]])

# Perform the matrix multiplication

# result = first @ second @ third @ pixels.T
result = side @ middle @ side @ pixels.T

# Print the result
print(np.round(result.T))

Given the output of this 'toy code' you'll get a matrix of [[ 6. 9.], [16. 13.], [ 1. 6.]].

Is there direct way I can tell Python to send the data from the original locations ([[10, 5], [20, 3], [4, 5]]) straight to the new ones, without performing an iteration?

i.e. 'like' a copy command, but I present it just with a list for 'what should go where'? Perhaps I might be able to do what I was looking to with np.roll()?

1

There are 1 best solutions below

2
BitsAreNumbersToo On

The operation you are performing is highly optimized already, numpy is amazing at this type of thing, and you are using it exactly as it was intended to, so it is performing expectedly quite well, and getting much better performance will be pretty challenging. During testing, I found your original algorithm to have a run time of less than 10 microseconds, even with vector lengths of up to 1000, which is quite fast already.

For some vector lengths, you can squeeze a bit better performance out of it using numba, but has diminishing benefits as vector length increases, which surprised me, but that's why we test these things. Here is some code that shows how it could be optimized, improving your length 3 vector from 7.6 to 2.2 us/run, and offering meaningful improvements up to length 1000 vectors where the improvement is from 10.5 to 4.2 us/run. If you are trying to process large vectors, such as length >1000, this probably isn't worth the extra work and dependency. If you are trying to process many small vectors, then it will likely be more effective to optimize the loop rather than each individual operation. If you are always performing the same rotation, that part can be pulled out of the function to speed it up even more.

To optimized this, I converted everything to floats so that numba could compile it, swapped math.radians for numpy.radians, removed print since that dominated the run time. I left your original code, lightly modified to facilitate testing, in the test function for comparison.

import numpy as np
import math as m
import numba
# May require scipy


def test(angle, pixels):
    # Define theta in degrees
    # angle = 30
    theta = m.radians(angle)

    # Triple shear transformation
    side = np.array([[1, -np.tan(theta/2)], [0, 1]])
    middle = np.array([[1, 0], [np.sin(theta), 1]])

    # Define vector test set - X, Y
    # pixels = np.array([[10, 5], [20, 3], [4, 5]])

    # Perform the matrix multiplication

    result = side @ middle @ side @ pixels.T

    # Print the result
    # print(np.round(result.T))
    return result.T


@numba.njit
def test_optimized(angle, pixels):
    # Define theta in degrees
    theta = np.radians(angle)

    # Triple shear transformation
    side = np.array([[1.0, -np.tan(theta/2.0)], [0.0, 1.0]])
    middle = np.array([[1.0, 0.0], [np.sin(theta), 1.0]])

    # Perform the matrix multiplication
    result = side @ middle @ side @ pixels.T

    return result.T


def _main():
    import time

    def _gt(s=0.0) -> float:
        return time.perf_counter() - s

    N = int(1e3)

    angle = 30.0
    pixels = np.array([[10, 5], [20, 3], [4, 5]])

    # Get any compilation things out of the way
    test(angle, pixels)
    test_optimized(angle, pixels.astype(float))

    for L in (3, 10, 100, 1_000, 10_000, 100_000):
        pixels2 = np.random.random(size=(L, 2))
        for name, fun in (('orig', test), ('optm', test_optimized)):
            s = _gt()
            for i in range(N):
                fun(angle, pixels2)
            print(f'Run time {name}: {_gt(s) * 1e6 / N:6.1f} us / run  -  L = {L}')
        print()

    print(f'Valid: {np.allclose(test(angle, pixels), test_optimized(angle, pixels.astype(float)))}')


if __name__ == '__main__':
    _main()

Running that I get the following output:

Run time orig:    7.6 us / run  -  L = 3
Run time optm:    2.2 us / run  -  L = 3

Run time orig:    7.7 us / run  -  L = 10
Run time optm:    2.3 us / run  -  L = 10

Run time orig:    8.0 us / run  -  L = 100
Run time optm:    2.6 us / run  -  L = 100

Run time orig:   10.5 us / run  -  L = 1000
Run time optm:    4.2 us / run  -  L = 1000

Run time orig:   26.5 us / run  -  L = 10000
Run time optm:   20.7 us / run  -  L = 10000

Run time orig:  442.1 us / run  -  L = 100000
Run time optm:  421.4 us / run  -  L = 100000

Valid: True

Tested on Windows 10, Python 3.11.8, i9-10900K