CUDA - How to make thread in kernel wait for it's children

1.5k Views Asked by At

I'm trying to implement a really simple merge sort using CUDA recursive (for cm > 35) technology, but I can not find a way to tell the parent thread to launch it's children concurrently and then wait for it's children computation, since cudaEventSynchronize() and cudaStreamSynchronize() are host only. __syncthread() would not archive the desired effect, since the parent's next line should only be executed after it's children has completed all the computation.

__global__ void simple_mergesort(int* data,int *dataAux,int begin,int end, int depth){
     int middle = (end+begin)/2;
     int i0 = begin;
     int i1 = middle;
     int index;
     int n = end-begin;

     cudaStream_t s,s1;

     //If we're too deep or there are few elements left, we use an insertion sort...
     if( depth >= MAX_DEPTH || end-begin <= INSERTION_SORT ){
         selection_sort( data, begin, end );
         return;
     }

     if(n < 2){
         return;
     }

    // Launches a new block to sort the left part.
    cudaStreamCreateWithFlags(&s,cudaDeviceScheduleBlockingSync);
    simple_mergesort<<< 1, 1, 0, s >>>(data,dataAux, begin, middle, depth+1);
    cudaStreamDestroy(s);

    // Launches a new block to sort the right part.
    cudaStreamCreateWithFlags(&s1,cudaDeviceScheduleBlockingSync);
    simple_mergesort<<< 1, 1, 0, s1 >>>(data,dataAux, middle, end, depth+1);
    cudaStreamDestroy(s1);

    // Waits until children have returned, does not compile.
    cudaStreamSynchronize(s);
    cudaStreamSynchronize(s1);


    for (index = begin; index < end; index++) {
        if (i0 < middle && (i1 >= end || data[i0] <= data[i1])){
            dataAux[index] = data[i0];
            i0++;
        }else{
            dataAux[index] = data[i1];
            i1++;
        }
    }

    for(index = begin; index < end; index ++){
        data[index] = dataAux[index];
    }
}

Which adaptation should I make to my code so I can achieve the desired effect?

Thanks for reading.

1

There are 1 best solutions below

0
On BEST ANSWER

The typical barrier used to force kernels to complete is cudaDeviceSynchronize() and it works in parent kernels as well, forcing child kernels to complete.

As indicated in the documentation:

As cudaStreamSynchronize() and cudaStreamQuery() are unsupported by the device runtime, cudaDeviceSynchronize() should be used instead when the application needs to know that stream-launched child kernels have completed.