__global__ void sum(const float * __restrict__ indata, float * __restrict__ outdata) {
unsigned int tid = blockIdx.x * blockDim.x + threadIdx.x;
// --- Specialize BlockReduce for type float.
typedef cub::BlockReduce<float, BLOCKSIZE> BlockReduceT;
// --- Allocate temporary storage in shared memory
__shared__ typename BlockReduceT::TempStorage temp_storage;
float result;
if(tid < N) result = BlockReduceT(temp_storage).Sum(indata[tid]);
// --- Update block reduction value
if(threadIdx.x == 0) outdata[blockIdx.x] = result;
return;
}
I have tested the reduction sum(as shown in above code snippet) with cuda cub successfully, I want to perform the inner product of two vectors based on this code. But I have some confusions about it:
We need two input vectors for the inner_product, need I to conduct a component-wise multiplication of this two input vectors before the reduction sum on the resulting new vector.
In the code examples of the cuda cub, the dimension of input vectors is equal to the blocknumber*threadnumber. what if we have a very large vector.
Yes, with cub, and assuming your vectors were stored separately (i.e. not interleaved), you would need to do an element-wise multiplication first. On the other hand, thrust transform_reduce could handle it in a single function call.
blocknumber*threadnumber should give you all the range you need. on a cc3.0 or higher GPU, blocknumber (i.e.
gridDim.x
) can range up to 2^31-1 and threadnumber (i.e.blockDim.x
) can range up to 1024. This gives you the possibility to handle 2^40 elements. If each element is 4 bytes, this would constitute (i.e. require) 2^42 bytes. That is about 4TB (or double that if you are considering 2 input vectors), which is much larger than any GPU memory currently. So you will run out of GPU memory space before you run out of grid dimension.Note that what you are showing is
cub::BlockReduce
. However if you are doing a vector dot product of two large vectors, you might want to usecub::DeviceReduce
instead.