Optimizing specific memory usage for CUDA

647 Views Asked by At

I have a data-processing task, which can be stylized in a following way. I have data (~1-10GB), and a function, which generates summary (~1MB) based on this data and some (double) input x. I need to obtain this summary for ~1000 values of the x, which looked like a perfect task for GPU. To repeat, the input data is same for all threads, and is read in a linear fashion, but each thread has to produce his own summary. Functions are executed independently for different x.

However, brute one-threaded cycling through all values of x on CPU yields only 3x worse performance than K520. I do understand that this is memory-intensive task (the threads have to access and write to random parts of his summary), but I still struggle to understand how the GPU can lose it's initial 1000x advantage. I've tried feeding the data to feeds in chunks using __constant__ memory (as it's the same input across all threads), with no visible improvement. The typical block run time, as reported by nvprof, is 10-30 seconds.

I would appreciate any insight into the optimizations suitable for this task.

EDIT: Below is a sample code which replicates the problem. It can be compiled under both g++ (reporting run time of 5s) and nvcc (reporting runtime of 7s). The profiling results are as follows

==23844== Profiling result:
Time(%) Time Calls Avg Min Max Name
98.86% 4.68899s 1 4.68899s 4.68899s 4.68899s Kernel(Observation*, int*, Info**)
1.09% 51.480ms 4 12.870ms 1.9200us 50.426ms [CUDA memcpy HtoD]
0.06% 2.6634ms 800 3.3290us 3.2950us 5.1200us [CUDA memcpy DtoD]
0.00% 4.3200us 1 4.3200us 4.3200us 4.3200us [CUDA memcpy DtoH]

#include <iostream>
#include <fstream>
#include <cstdlib>
#include <ctime>
#include <cstring>

#define MAX_OBS 1000000
#define MAX_BUCKETS 1000

using namespace std;

// Cross-arch defines
#ifndef __CUDACC__

#define GPU_FUNCTION

#define cudaSuccess 0

typedef int cudaError_t;

struct dim3
{
    int x;
    int y;
    int z;
} blockIdx, threadIdx;

enum cudaMemcpyKind
{
    cudaMemcpyHostToDevice = 0,
    cudaMemcpyDeviceToHost = 1, 
    cudaMemcpyDeviceToDevice = 2
};

cudaError_t cudaMalloc(void ** Dst, size_t bytes)
{
    return !(*Dst = malloc(bytes));
}

cudaError_t cudaMemcpy(void * Dst, const void * Src, size_t bytes, cudaMemcpyKind kind)
{
    return !memcpy(Dst, Src, bytes);
}

#else
#define GPU_FUNCTION __global__
#endif

// Basic observation structure as stored on disk
struct Observation
{
    double core[20];
};

struct Info
{
    int left;
    int right;
};

GPU_FUNCTION void Kernel(Observation * d_obs, 
                         int * d_bucket,
                         Info ** d_summaries)
{
    Info * summary = d_summaries[threadIdx.x * 40 + threadIdx.y];

    for (int i = 0; i < MAX_OBS; i++)
    {
        if (d_obs[i].core[threadIdx.x] < (threadIdx.x + 1) * threadIdx.y)
            summary[d_bucket[i]].left++;
        else
            summary[d_bucket[i]].right++;
    }
}

int main()
{
    srand((unsigned int)time(NULL));

    // Generate dummy observations
    Observation * obs = new Observation [MAX_OBS];
    for (int i = 0; i < MAX_OBS; i++)
        for (int j = 0; j < 20; j++)
            obs[i].core[j] = (double)rand() / RAND_MAX;

    // Attribute observations to one of the buckets
    int * bucket = new int [MAX_OBS];
    for (int i = 0; i < MAX_OBS; i++)
        bucket[i] = rand() % MAX_BUCKETS;

    Info summary[MAX_BUCKETS];
    for (int i = 0; i < MAX_BUCKETS; i++)
        summary[i].left = summary[i].right = 0;

    time_t start;
    time(&start);

    // Init device objects
    Observation * d_obs;                    
    int * d_bucket; 
    Info * d_summary;
    Info ** d_summaries;

    cudaMalloc((void**)&d_obs, MAX_OBS * sizeof(Observation));
    cudaMemcpy(d_obs, obs, MAX_OBS * sizeof(Observation), cudaMemcpyHostToDevice);
    cudaMalloc((void**)&d_bucket, MAX_OBS * sizeof(int));
    cudaMemcpy(d_bucket, bucket, MAX_OBS * sizeof(int), cudaMemcpyHostToDevice);
    cudaMalloc((void**)&d_summary, MAX_BUCKETS * sizeof(Info));
    cudaMemcpy(d_summary, summary, MAX_BUCKETS * sizeof(Info), cudaMemcpyHostToDevice);

    Info ** tmp_summaries = new Info * [20 * 40];
    for (int k = 0; k < 20 * 40; k++)           
        cudaMalloc((void**)&tmp_summaries[k], MAX_BUCKETS * sizeof(Info));
    cudaMalloc((void**)&d_summaries, 20 * 40 * sizeof(Info*));
    cudaMemcpy(d_summaries, tmp_summaries, 20 * 40 * sizeof(Info*), cudaMemcpyHostToDevice);
    for (int k = 0; k < 20 * 40; k++)
        cudaMemcpy(tmp_summaries[k], d_summary, MAX_BUCKETS * sizeof(Info), cudaMemcpyDeviceToDevice);

#ifdef __CUDACC__
    Kernel<<<1, dim3(20, 40, 1)>>>(d_obs, d_bucket, d_summaries);
#else
    for (int k = 0; k < 20 * 40; k++)
    {
        threadIdx.x = k / 40;
        threadIdx.y = k % 40;
        Kernel(d_obs, d_bucket, d_summaries);
    }
#endif      

    cudaMemcpy(summary, d_summary, MAX_BUCKETS * sizeof(Info), cudaMemcpyDeviceToHost);

    time_t end;
    time(&end);
    cout << "Finished calculations in " << difftime(end, start) << "s" << endl;
    cin.get();
    return 0;
}

EDIT 2: I've tried reworking the code by parallelizing tough scattered memory access. To be brief, my new kernel looks like this

__global__ void Kernel(Observation * d_obs, 
                         int * d_bucket,
                         double * values,
                         Info ** d_summaries)
{
    Info * summary = d_summaries[blockIdx.x * 40 + blockIdx.y];

    __shared__ Info working_summary[1024];
    working_summary[threadIdx.x] = summary[threadIdx.x];
    __syncthreads();

    for (int i = 0; i < MAX_OBS; i++)
    {
        if (d_bucket[i] != threadIdx.x) continue;
        if (d_obs[i].core[blockIdx.x] < values[blockIdx.y])
            working_summary[threadIdx.x].left++;
        else
            working_summary[threadIdx.x].right++;
    }
    __syncthreads();

    summary[threadIdx.x] = working_summary[threadIdx.x];
} 

This takes 18s for <<<dim(20, 40, 1), 1000>>> and 172s for <<<dim(20,40,10), 1000>>> --- which is worse that single CPU thread and increases linearly in the number of parallel tasks.

1

There are 1 best solutions below

6
On

The K520 board you're using has two GPUs, each with 8 Streaming Multiprocessors, with, I beleive, a peak bandwidth of ~160 GB/s per GPU. With the code above you should be limited by this bandwidth, and should be looking at getting at least 100 GB/s per GPU (though I'd target a single GPU to start). Maybe you won't be able to hit it, maybe you'll beat it, but it's a good target to aim for.

Number of blocks

The first thing to do is to fix your launch parameters. This line:

Kernel<<<1, dim3(20, 40, 1)>>>(d_obs, d_bucket, d_summaries);

means you're launching 1 CUDA block of 800 threads. This is nowhere near enough parallelism for the GPU. You need at least as many blocks as streaming multiprocessors (ie. 8), preferably significantly more (ie. 100+). This will give you a large performance improvement. 800-way parallelism just isn't enough for a GPU.

Scattered Writes

GPUs can be rather sensitive to access patterns. The following code:

summary[d_bucket[i]].left++;

does a scattered 4-byte write into summary. Scattered memory transactions are expensive on the GPU, and for reasonable performance on memory bound codes they should be avoided. What can we do about it in this case? The solution is, in my opinion, to add more parallelism. Instead of having a summary per thread, have a summary per block. Each thread can work on a subset of the range 0...MAX_OBS, and can increment the block-wide summary array which should be located in shared memory. At the end of the kernel you can write the result back to global memory. Happily, this also solves your lack of parallelism problem noted above!

What next?

At this point you should figure out a way to measure how much room for improvement there is. You'll want to work out how close to peak bandwidth you are (I find it's best to consider both data you have to move, and data you are actually moving), and if you're still significantly off it, you want to look at both reducing memory accesses and optimising accesses further, if possible.