CUDA - How to make thread in kernel wait for it's children

Question

CUDA - How to make thread in kernel wait for it's children

1.5k Views Asked by Eugênio Fonseca At 04 April 2025 at 10:42

I'm trying to implement a really simple merge sort using CUDA recursive (for cm > 35) technology, but I can not find a way to tell the parent thread to launch it's children concurrently and then wait for it's children computation, since cudaEventSynchronize() and cudaStreamSynchronize() are host only. __syncthread() would not archive the desired effect, since the parent's next line should only be executed after it's children has completed all the computation.

__global__ void simple_mergesort(int* data,int *dataAux,int begin,int end, int depth){
     int middle = (end+begin)/2;
     int i0 = begin;
     int i1 = middle;
     int index;
     int n = end-begin;

     cudaStream_t s,s1;

     //If we're too deep or there are few elements left, we use an insertion sort...
     if( depth >= MAX_DEPTH || end-begin <= INSERTION_SORT ){
         selection_sort( data, begin, end );
         return;
     }

     if(n < 2){
         return;
     }

    // Launches a new block to sort the left part.
    cudaStreamCreateWithFlags(&s,cudaDeviceScheduleBlockingSync);
    simple_mergesort<<< 1, 1, 0, s >>>(data,dataAux, begin, middle, depth+1);
    cudaStreamDestroy(s);

    // Launches a new block to sort the right part.
    cudaStreamCreateWithFlags(&s1,cudaDeviceScheduleBlockingSync);
    simple_mergesort<<< 1, 1, 0, s1 >>>(data,dataAux, middle, end, depth+1);
    cudaStreamDestroy(s1);

    // Waits until children have returned, does not compile.
    cudaStreamSynchronize(s);
    cudaStreamSynchronize(s1);


    for (index = begin; index < end; index++) {
        if (i0 < middle && (i1 >= end || data[i0] <= data[i1])){
            dataAux[index] = data[i0];
            i0++;
        }else{
            dataAux[index] = data[i1];
            i1++;
        }
    }

    for(index = begin; index < end; index ++){
        data[index] = dataAux[index];
    }
}

Which adaptation should I make to my code so I can achieve the desired effect?

Thanks for reading.

Original Q&A

There are 1 best solutions below

**Robert Crovella** · Accepted Answer

The typical barrier used to force kernels to complete is cudaDeviceSynchronize() and it works in parent kernels as well, forcing child kernels to complete.

As indicated in the documentation:

As cudaStreamSynchronize() and cudaStreamQuery() are unsupported by the device runtime, cudaDeviceSynchronize() should be used instead when the application needs to know that stream-launched child kernels have completed.

CUDA - How to make thread in kernel wait for it's children

There are 1 best solutions below

Related Questions in SORTING

Related Questions in PARALLEL-PROCESSING

Related Questions in CUDA

Related Questions in DYNAMIC-PARALLELISM

Trending Questions

Popular # Hahtags

Popular Questions