I'm trying to use a cuda DevicePtr
(which is called a CUdeviceptr
in CUDA-land) returned from foreign code as an accelerate Array
with accelerate-llvm-ptx.
The code I've written below somewhat works:
import Data.Array.Accelerate
(Acc, Array, DIM1, Z(Z), (:.)((:.)), use)
import qualified Data.Array.Accelerate as Acc
import Data.Array.Accelerate.Array.Data
(GArrayData(AD_Float), unsafeIndexArrayData)
import Data.Array.Accelerate.Array.Sugar
(Array(Array), fromElt, toElt)
import Data.Array.Accelerate.Array.Unique
(UniqueArray, newUniqueArray)
import Data.Array.Accelerate.LLVM.PTX (run)
import Foreign.C.Types (CULLong(CULLong))
import Foreign.CUDA.Driver (DevicePtr(DevicePtr))
import Foreign.ForeignPtr (newForeignPtr_)
import Foreign.Ptr (intPtrToPtr)
-- A foreign function that uses cuMemAlloc() and cuMemCpyHtoD() to
-- create data on the GPU. The CUdeviceptr (initialized by cuMemAlloc)
-- is returned from this function. It is a CULLong in Haskell.
--
-- The data on the GPU is just a list of the 10 floats
-- [0.0, 1.0, 2.0, ..., 8.0, 9.0]
foreign import ccall "mytest.h mytestcuda"
cmyTestCuda :: IO CULLong
-- | Convert a 'CULLong' to a 'DevicePtr'.
--
-- A 'CULLong' is the type of a CUDA @CUdeviceptr@. This function
-- converts a raw 'CULLong' into a proper 'DevicePtr' that can be
-- used with the cuda Haskell package.
cullongToDevicePtr :: CULLong -> DevicePtr a
cullongToDevicePtr = DevicePtr . intPtrToPtr . fromIntegral
-- | This function calls 'cmyTestCuda' to get the 'DevicePtr', and
-- wraps that up in an accelerate 'Array'. It then uses this 'Array'
-- in an accelerate computation.
accelerateWithDataFromC :: IO ()
accelerateWithDataFromC = do
res <- cmyTestCuda
let DevicePtr ptrToXs = cullongToDevicePtr res
foreignPtrToXs <- newForeignPtr_ ptrToXs
uniqueArrayXs <- newUniqueArray foreignPtrToXs :: IO (UniqueArray Float)
let arrayDataXs = AD_Float uniqueArrayXs :: GArrayData UniqueArray Float
let shape = Z :. 10 :: DIM1
xs = Array (fromElt shape) arrayDataXs :: Array DIM1 Float
ys = Acc.fromList shape [0,2..18] :: Array DIM1 Float
usedXs = use xs :: Acc (Array DIM1 Float)
usedYs = use ys :: Acc (Array DIM1 Float)
computation = Acc.zipWith (+) usedXs usedYs
zs = run computation
putStrLn $ "zs: " <> show z
When compiling and running this program, it correctly prints out the result:
zs: Vector (Z :. 10) [0.0,3.0,6.0,9.0,12.0,15.0,18.0,21.0,24.0,27.0]
However, from reading through the accelerate and accelerate-llvm-ptx source code, it doesn't seem like this should work.
In most cases, it seems like an accelerate Array
carries around a pointer to array data in HOST memory, and a Unique
value to uniquely identify the Array
. When performing Acc
computations, accelerate will load the array data from HOST memory into GPU memory as needed, and keep track of it with a HashMap
indexed by the Unique
.
In the code above, I am creating an Array
directly with a pointer to GPU data. This doesn't seem like it should work, but it appears to work in the above code.
However, some things don't work. For instance, trying to print out xs
(my Array
with a pointer to GPU data) fails with a segfault. This makes sense, since the Show
instance for Array
just tries to peek
the data from the HOST pointer. This fails because it is not a HOST pointer, but a GPU pointer:
-- Trying to print xs causes a segfault.
putStrLn $ "xs: " <> show xs
Is there a proper way to take a CUDA DevicePtr
and use it directly as an accelerate Array
?
Actually, I am surprised that the above worked as well as it did already; I couldn't replicate that.
One of the problems here is that device memory is implicitly associated with an execution context; pointers in one context are not valid in a different context, even on the same GPU (unless you explicitly enable peer memory access between those contexts).
So, there are actually two components to this problem:
solution
Here is the C code we'll use to generate data on the GPU:
And the Haskell/Accelerate code which uses it: