I have n
separate GPUs, each storing its own data. I would like to have each of them perform a set of calculations simultaneously. The CUDArt documentation here describes the use of streams to asynchronously call custom C kernels in order to achieve parallelization (see also this other example here). With custom kernels, this can be accomplished through the use of the stream
argument in CUDArt's implementation of the launch()
function. As far as I can tell, however, the CUSPARSE (or CUBLAS) functions don't have a similar option for stream specification.
Is this possible with CUSPARSE, or do I just need to dive down to the C if I want to use multiple GPUs?
REVISED Bounty Update
Ok, so, I now have a relatively decent solution working, finally. But, I'm sure it could be improved in a million ways - it's quite hacky right now. In particular, I'd love suggestions for solutions along the lines of what I tried and wrote about in this SO question (which I never got to work properly). Thus, I'd be delighted to award the bounty to anyone with further ideas here.
Ok, so, I think I've finally come upon something that works at least relatively well. I'd still be absolutely delighted to offer the Bounty though to anyone who has further improvements. In particular, improvements based on the design that I attempted (but failed) to implement as described in this SO question would be great. But, any improvements or suggestions on this and I'd be delighted to give the bounty.
The key breakthrough that I discovered for a way to get things like CUSPARSE and CUBLAS to parallelize over multiple GPUs is that you need to create a separate handle for each GPU. E.g. from the documentation on CUBLAS API:
(emphasis added)
See here and here for some additional helpful docs.
Now, in order to actually move forward on this, I had to do a bunch of rather messy hacking. In the future, I'm hoping to get in touch with the folks who developed the CUSPARSE and CUBLAS packages to see about incorporating this into their packages. For the time being though, this is what I did:
First, the CUSPARSE and CUBLAS packages come with functions to create handles. But, I had to modify the packages a bit to export those functions (along with needed other functions and object types) so that I could actually access them myself.
Specifically, I added to
CUSPARSE.jl
the following:to
libcusparse_types.jl
the following:to
libcusparse.jl
the following:and to
sparse.jl
the following:Through all of these, I was able to get functional access to the
cusparseCreate()
function which can be used to create new handles (I couldn't just useCUSPARSE.cusparseCreate()
because that function depended on a bunch of other functions and data types). From there, I defined a new version of the matrix multiplication operation that I wanted that took an additional argument, the Handle, to feed in theccall()
to the CUDA driver. Below is the full code: