(Using NVRTC run-time compiler)
There is a string of CUDA function:
R"(
extern "C" __global__ void test1(float * a, float * b, float *c)
{
int id= blockIdx.x * blockDim.x + threadIdx.x;
c[id]=a[id]+b[id];
}
)"
that is successfully compiled by driver API into ptx code and used in program to compute c=a+b.
But when I try some header to include an algorithm
R"(
#include <climits>
extern "C" __global__ void test1(float * a, float * b, float *c, int * gpuOffset)
{
int id=blockIdx.x * blockDim.x + threadIdx.x;
device_vector<int> dv;
c[id]=a[id]+b[id];
}
)"
it returns an error saying
test1.cu(23): catastrophic error: cannot open source file "climits"
1 catastrophic error detected in the compilation of "test1.cu".
Compilation terminated.
or
test1.cu(28): error: identifier "device_vector" is undefined
depending on the include or a header's class (such as device_vector).
Also Documentation shows that both cuFFT and thrust is usable only on host side and it seems I can't use any "partial" algorithm that I wanted to use on each thread-block independently.
Is there a list of headers for some cuda-supported algorithms to be used as per-block:
R"(
#include "driver_api_fft.h"
#include "driver_api_ifft.h"
extern "C" __global__ void test1(float * a, float * b, float *c)
{
int id=blockIdx.x * blockDim.x + threadIdx.x;
fft(a,id,1024);
ifft(b,id,1024);
c[id]=a[id]+b[id];
}
)"
to successfully compile and run on any target machine or is it possible to link those algorithm libraries(thrust for device_vector) to ptx linker from host-side so that I can use them, somehow, from compiled kernel? If these are not possible, then do I need to write a Fourier-Transform myself and make it "fast" by implementing algorithms myself?