NVIDIA __constant memory: how to populate constant memory from host in both OpenCL and CUDA?

827 Views Asked by At

I have a buffer (array) on the host that should be resided in the constant memory region of the device (in this case, an NVIDIA GPU).

So, I have two questions:

  1. How can I allocate a chunk of constant memory? Given the fact that I am tracing the available constant memory on the device and I know, for a fact, that we have that amount of memory available to us (at this time)

  2. How can I initialize (populate) those arrays from values that are computed at the run time on the host?

I searched the web for this but there is no concise document documenting this. I would appreciate it if provided examples would be in both OpenCL and CUDA. The example for OpenCL is more important to me than CUDA.

3

There are 3 best solutions below

5
On BEST ANSWER

For cuda, I use driver API and NVRTC and create kernel string with a global constant array like this:

auto kernel = R"(
..
__constant__ @@Type@@ buffer[@@SIZE@@]={
   @@elm@@
};
..
__global__ void test(int * input)
{   }

)";   

then replace @@-pattern words with size and element value information in run-time and compile like this:

__constant__ int buffer[16384]={ 1,2,3,4, ....., 16384 };

So, it is run-time for the host, compile-time for the device. Downside is that the kernel string gets too big, has less readability and connecting classes needs explicitly linking (as if you are compiling a side C++ project) other compilation units. But for simple calculations with only your own implementations (no host-definitions used directly), it is same as runtime API.

Since large strings require extra parsing time, you can cache the ptx intermediate data and also cache the binary generated from ptx. Then you can check if kernel string has changed and needs to be re-compiled.

Are you sure just __constant__ worths the effort? Do you have some benchmark results to show that actually improves performance? (premature optimization is source of all evil). Perhaps your algorithm works with register-tiling and the source of data does not matter?

0
On
  1. How can I allocate a chunk of constant memory? Given the fact that I am tracing the available constant memory on the device and I know, for a fact, that we have that amount of memory available to us (at this time)

In CUDA, you can't. There is no runtime allocation of constant memory, only static definition of memory via the __constant__ specifier which get mapped to constant memory pages at assembly. You could generate some code contain such a static declaration at runtime and compile it via nvrtc, but that seems like a lot of effort for something you know can only be sized up to 64kb. It seems much simpler (to me at least) to just statically declare a 64kb constant buffer and use it at runtime as you see fit.

  1. How can I initialize (populate) those arrays from values that are computed at the runtime on the host?

As noted in comments, see here. The cudaMemcpyToSymbol API was created for this purpose and it works just like standard memcpy.

Functionally, there is no difference between __constant in OpenCL and __constant__ in CUDA. The same limitations apply: static definition at compile time (which is runtime in the standard OpenCL execution model), 64kb limit.

5
On

Disclaimer: I cannot help you with CUDA.

For OpenCL, constant memory is effectively treated as read-only global memory from the programmer/API point of view, or defined inline in kernel source.

  1. Define constant variables, arrays, etc. in your kernel code, like constant float DCT_C4 = 0.707106781f;. Note that you can dynamically generate kernel code on the host at runtime to generate derived constant data if you wish.
  2. Pass constant memory from host to kernel via a buffer object, just as you would for global memory. Simply specify a pointer parameter in the constant memory region in your kernel function's prototype and set the buffer on the host side with clSetKernelArg(), for example:
kernel void mykernel(
    constant float* fixed_parameters,
    global const uint* dynamic_input_data,
    global uint* restrict output_data)
{
    cl_mem fixed_parameter_buffer = clCreateBuffer(
        cl_context,
        CL_MEM_READ_ONLY | CL_MEM_HOST_NO_ACCESS | CL_MEM_COPY_HOST_PTR,
        sizeof(cl_float) * num_fixed_parameters, fixed_parameter_data,
        NULL);
    clSetKernelArg(mykernel, 0, sizeof(cl_mem), &fixed_parameter_buffer);

Make sure to take into account the value reported for CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE for the context being used! It usually doesn't help to use constant memory buffers for streaming input data, this is better stored in global buffers, even if they are marked read-only for the kernel. constant memory is most useful for data that are used by a large proportion of work-items. There is typically a fairly tight size limitation such as 64KiB on it - some implementations may "spill" to global memory if you try to exceed this, which will lose you any performance advantages you would gain from using constant memory.