CUDA Library for Computing Kronecker Product

1.6k Views Asked by At

I have an application that requires me to calculate some large Kronecker products of 2D matrices and multiply the result by large 2D matrices. I would like to implement this on a GPU in CUDA and would prefer to use a tuned library implementation for this, rather than writing my own (certainly suboptimal) Kronecker product. I have experience with CUDA, BLAS, LAPACK etc, but unfortunately there is no kron(A,B) function in the common GPU implementations (magma, cuBLAS, cula, etc).

I've searched for some solutions, but can't find a library that suits my needs. (The closest question on SO is parallel Kronecker tensor product on gpu using CUDA, but this looks like a custom solution to a special case, which won't suit my needs. I'm looking for Kronecker product that will work in the most general case.)

I have read that DGEMM in BLAS can be used to implement a Kronecker product. Is there a standard algorithm to implement a Kronecker product using DGEMM (or its single/complex variants)? It's seems to me that the only way would be to call DGEMM in a loop and tile the results into a larger matrix, which does not seem very efficient. Or, does anyone know another implementation or paper that might provide what I'm looking for?

1

There are 1 best solutions below

0
On

The paper you have linked to is exploiting the following identity

enter image description here

to eliminate the need for explicitly calculating the Kronecker product and replacing it with a level 3 BLAS gemm call instead. If your problem is a matrix equation then you can use gemm in this way, otherwise it is of no use to you.

The other identity which could potentially useful would be to calculate the Kronecker product using an outer product (rank 1 update in level 2 BLAS IIRC):

enter image description here

Note again that the ordering of the resulting matrix will not be the same as the Kronecker product of the matrices A and B.

I am not aware of a CUDA library for computing the true Kronecker product of a pair of arbitrary sized matrices. It should be a memory bound problem, so even a relatively naïve approach which coalesces loads and re-uses as much data as possible should get fairly close to peak bandwidth.