CUDA dynamic parallelism is computing sequentially

227 Views Asked by At

I need to write an application that computes some matrices from other matrices. In general, it sums outer products of rows of initial matrix E and multiplies it by some numbers calculated from v and t for each t in a given range. I am a newbie to CUDA, so there might be some very wrong ideas in the implementation. So, there is my code and some explanation in comments:

#include <cupy/complex.cuh>
#include <thrust/device_vector.h>
#include <thrust/functional.h>
#include <thrust/sequence.h>
#include <thrust/transform.h>
const int BLOCK_SIZE = 16;
const int DIM_SIZE = 16;
const double d_hbar=1.0545718176461565e-34;

extern "C"
struct sum_functor { //sum functor for thrust::transform, summs array of matrices
    int N;
    complex<float> *arr;
    complex<float> *result;
    __host__ __device__ sum_functor(int _N, complex<float>* _arr, complex<float>* _result) : N(_N), arr(_arr), result(_result) {};

    __host__ __device__ complex<float> operator()(int n){
        complex<float> sum = result[n];
            for (int i = 0; i < BLOCK_SIZE; i++) {
                sum += arr[N * N * i + n];
            }
        return sum;
    }
};


extern "C" //outer product multiplied by given exp and rho
__global__ void outer(const complex<float>* E, int size, complex<float>* blockResult, 
                        int m, int n, complex<float> exp, complex<float> rho) {
    int col = blockIdx.y*blockDim.y+threadIdx.y;
    int row = blockIdx.x*blockDim.x+threadIdx.x;
    if (row < size && col < size) {
        blockResult[row * size + col] = exp * rho * E[m * size + row] * E[n * size + col];
    }
}

//compute constants and launch outer product kernels
//outer products are stored in blockResult, i.e. 16 matrices one after another
extern "C"
__global__ void calcBlock(const complex<float>* v, const complex<float>* E, int size, double t,
                    int rhoind, complex<float>* blockResult, int blockInd) {
    int i = threadIdx.x;
    int j = i + blockInd;
    int m = j / size;
    int n = j % size;
    if (m < size && n < size) {
        const complex<float>hbar(d_hbar);
        complex<float> exp = thrust::exp(complex<float>(0, -1)*(v[m] - v[n]) * complex<float>(t)/hbar);
        complex<float> rho = E[m * size + rhoind] * E[n * size + rhoind];
        dim3 dimGrid((size - 1)/DIM_SIZE + 1, (size - 1) / DIM_SIZE + 1, 1);
        dim3 dimBlock(DIM_SIZE, DIM_SIZE, 1);
        outer<<<dimGrid, dimBlock>>>(E, size, blockResult + i * size * size, m, n, exp, rho);
    }
}

//launch the block calculation, then sum the all matrices in block and add it to the result
//repeat block by block until all size*size  matrices in total are summed
extern "C" 
__global__ void calcSum(const complex<float>* v, const complex<float>* E, int size, double t, int ind,
                    int rhoind,  complex<float>* blockResult, complex<float>* result, int* resultIndexes) {
    for (int i = 0; i < size * size; i += BLOCK_SIZE) {
        calcBlock<<<1, BLOCK_SIZE>>>(v, E, size, t, rhoind, blockResult, i);
        cudaDeviceSynchronize();
        thrust::transform(thrust::device, resultIndexes,
            resultIndexes + size * size,
                result + ind * size * size, sum_functor(size, blockResult, result + ind * size * size));

    }
}

//launch calcSum in parallel for every t in range 
extern "C" 
__global__ void eigenMethod(const complex<float>* v, const complex<float>* E, int size, const double* t, int t_size,
                    int rhoind,  complex<float>* blockResult, complex<float>* result, int* resultIndexes) {
    int i = threadIdx.x;
    if (i < t_size) {
        calcSum<<<1, 1>>>(v, E, size, t[i], i, rhoind, blockResult + i * BLOCK_SIZE * size * size, result, resultIndexes);
    }
}

//main is simplified cause I am using CuPy
int main() {
    *Calculate E(size * size), v(size)*
    *t is vector of size t_size*
    *Initialize blockResult(t_size*BLOCK_SIZE*size*size)*
    *resultIndexes(size*size) is enumerate from 0 to size * size)*
    *result(t_size*size*size) filled with zeros*
    eigenMetod<<<1, t_size>>>(v, E, size, t, t_size, 0, blockResult, result, resultIndexes)
}

The overall idea might be strange and stupid, but it is working. Thus, the problem I've encountered is that all calcSum kernels that are called from eigenMethod are scheduled one after another.

The calcSum function and everything above works fast enough for the purposes for which it was created. The main problem is that when I am trying to call multiple of these in the eigenMethod function. I have tried benchmarking it and got a linear dependence between runtime and the number of calls. For example, the eigenMethod function with t_size = 32 works almost two times faster than with t_size = 64. Also, I have tried profiling it, but did not get the information that I wanted since Nsight Systems does not support CDP (CUDA Dynamic Parallelism) according to the the topics I saw. I think that accessing the same part of global memory (arrays E and v are the same pointer for all functions I call) might be a problem. As a hotfix, I have created individual copies for every calcSum function, but it did not help. Is there a way to compute multiple calcSum kernels in parallel? The benchmark results are listed below (matrix size is 128x128):

t_size time, s
1 0.32
4 1.036
8 1.9
16 3.6
1

There are 1 best solutions below

1
On BEST ANSWER

By not specifying a stream you are using the default stream which is shared among threads of the same block. Therefore all launches of calcSum go into the same stream and have to be executed after another. This can be fixed by using explicit streams instead.

"[A]ccessing the same part of global memory" from multiple kernels has nothing to do with this. As long as the kernels are only reading from the same locations and not writing to them, this is not problematic at all. Writing to the same locations would cause race conditions and therefore potentially non-deterministic output, but the CUDA runtime can not detect this and wont "sequentialize" your kernels to get around it.

As discussed in the comments, I don't think CDP is needed here at all and it is potentially expensive and in this form not future-proof. So performance will most probably not be ideal.