I want to apply the cupy.linalg.pinv function to 100k arrays, but I see a drop in performance compared to the Numpy equivalent.
My 100k arrays are two dimensions arrays. The main array shape is: (100000, 1397, 2)
# generating the data
arr = np.random.uniform(low=0.5, high=1500.20,size=(1397, 2))
main_arr = np.tile(arr, (100000, 1, 1))
With NumPy, the function runs in 11s:
%%time
np.linalg.pinv(main_arr)
CPU times: user 22.5 s, sys: 27.4 s, total: 49.9 s Wall time: 11 s
The exact equivalent on GPU using Cupy gives an error:
main_arr_gpu = cp.array(main_arr) # Copy the array to the GPU
cp.linalg.pinv(main_arr_gpu)
LinAlgError: 3-dimensional array given. Array must be two-dimensional
So I use list comprehension to iterate through the arrays:
%%time
[cp.linalg.pinv(arr_gpu) for arr_gpu in main_arr_gpu]
CPU times: user 22.3 s, sys: 0 ns, total: 22.3 s Wall time: 22.3 s
It takes 22.3s, twice the time on CPU without counting the data transfer. Nvidia-smi command confirms that the GPU is working.
So why is the performance on CPU way better?
Note: CPU is an Intel 24 Core 13900k, and the GPU is an Nvidia RTX 4090
The performance you are seeing is not too surprising. Inverses are not as easily parallelizable as matrix multiplication so often you do not see any performance gain when switching to GPUs.
Here you can see that your experience has been shared by other's benchmarking.
This is partially why "traditional" compute clusters often used for scientific computing prefer high core count rather than GPUs