CUDA dynamic parallelism: invalid global write when using texture memory

464 Views Asked by At

I seem to have troubles when a kernel call within a kernel (even recursive call) uses texture memory to get a value.

If the child kernel, say a different one, doesn't use texture memory, everything is fine. If I don't call a kernel within a kernel, the results are the expected ones. As long as I use texture memory which in my case is very useful due to spatial locality and fast filtering, cuda-memcheck returns "Invalid __global__ write of size 4".

I've seen that, in dynamic parallelism in the programming guide, one must be carefull when using texture memory that may result in inconsistent data, but here the child kernel does not even launch.

I've tried __syncthreads() and cudaDeviceSynchronize placed before or after the call to texture memory but nothing.

Are there some already reported cases, am I doing something wrong or it is just that you can't use texture memory that way?

system: gtx titan black (sm_3.5), CUDA6.0.

EDIT: some example code to illustrate.

Obviously, EField is declared and filled before. HANDLE_ERROR comes from the book.h include from CUDA by examples

Here is a compilable code:

#include "cuda.h"
#include "/common/book.h"

#define DIM 2048

texture<float4, 2, cudaReadModeElementType> texEField;

__device__ int oneChild = 0;


__global__ void test_cdp( float x0, float y0 ){
    int x = threadIdx.x + blockIdx.x * blockDim.x;
    int y = threadIdx.y + blockIdx.y * blockDim.y;
    int idx = x + y * blockDim.x * gridDim.x;

    printf("Propa started from thread %d\n", idx);      
    float4 E = tex2D( texEField, x0, y0 );

    printf("E field %f -- %f\n", E.z, E.w);     
    if( oneChild < 1 ){
        test_cdp<<<1, 1>>>(x0, y0);
        oneChild++;
    }
}

int main( void ){   

    //Start of texture allocation

    float4 *EField = new float4 [DIM*DIM];
    for( int u = 0; u < DIM*DIM; u++ ){
        EField[u].x = 1.0f;
        EField[u].y = 1.0f;
        EField[u].z = 1.0f;
        EField[u].w = 1.0f;
    }   


    cudaChannelFormatDesc desc = cudaCreateChannelDesc<float4>();

    float4 *dev_EField;
    HANDLE_ERROR( cudaMalloc( (void**)&dev_EField, DIM * DIM * sizeof(float4) ) );

    HANDLE_ERROR( cudaMemcpy( dev_EField, EField, DIM * DIM * sizeof(float4), cudaMemcpyHostToDevice ) );

    HANDLE_ERROR( cudaBindTexture2D( NULL, texEField, dev_EField, desc, DIM, DIM, sizeof(float4) * DIM ) );

    texEField.addressMode[0] = cudaAddressModeWrap;
    texEField.addressMode[1] = cudaAddressModeWrap;
    texEField.filterMode = cudaFilterModeLinear;
    texEField.normalized = true;

    test_cdp<<<1, 1>>>(0.5, 0.5);

    HANDLE_ERROR( cudaFree( dev_EField ) );
    HANDLE_ERROR( cudaUnbindTexture( texEField ) );
    return 0;
}
1

There are 1 best solutions below

3
On BEST ANSWER

In the future, please provide a complete, compilable code. SO expects this. As one example of uncertainty, your kernel definition is test_cdp. Your kernel called from host code is test2_cdp. Please don't make others guess at your intentions, or play 20 questions to clarify your code. Post a complete, compilable code, that requires no additions or changes, that demonstrates the issue. This is the reason for the close votes on your question.

I can see 2 problems.

  1. If you were to fix the above issue, this code as written could lead to an endless chain of child kernels being launched. It appears that you may think the oneChild variable is somehow shared between parent and child kernels. It is not. Therefore every launched child kernel will see that oneChild is zero, and it will launch its own child kernel. I don't know where this sequence would end, but its not a sensible use of CDP.

  2. CDP does not support module-scope texture referencing from device-launched kernels. Use texture objects instead.