Best strategy for grid search with CUDA

1.2k Views Asked by At

Recently I started working with CUDA and I read an introductory book on the computing language. To see if I understood it well, I considered the following problem.

Consider a function minimize f(x,y) on the grid [-1,1] X [-1,1]. This provided me with a few practical questions and I would like to have your look on things.

  1. Do I explicitly calculate the grid? If I create the grid on the CPU, then I'll have to transfer the information to the GPU. I can then use a 2D block layout and access data efficiently using texture memory. Is it then best to use square blocks or perhaps blocks of different shapes?

  2. Suppose I don't explicitly make a grid. I can assign discretise the X and Y direction with constant float arrays (which provides fast memory access) and then use 1 list of blocks.

Thanks!

1

There are 1 best solutions below

6
On

This was an interesting question for me because it represents a type of problem that I think is rare:

  1. potentially high compute load
  2. little to no data that needs to be communicated host->device
  3. very low volume of results that need to be communicated device->host

In other words, pretty much all compute, with not much dependence on data transfer, or even global memory usage/bandwidth.

Having said that, the question seems to be looking for a brute-force search approach to functional optimization/minimization, which is not an efficient technique for functions that are amenable to other optimization methods. But as a learning exercise, it's interesting (to me, anyway). It may also be useful for functions that are otherwise difficult to handle such as functions with discontinuities or other irregularities.

To answer your questions:

Do I explicitly calculate the grid? If I create the grid on the CPU, then I'll have to transfer the information to the GPU. I can then use a 2D block layout and access data efficiently using texture memory. Is it then best to use square blocks or perhaps blocks of different shapes?

I wouldn't bother calculating the grid on the CPU. (I assume by "grid" you mean the functional value of f at each point on the grid.) First of all, this is a fairly computationally intensive task - which GPUs are good at, and secondly, it is potentially a large data set, so transferring it to the GPU (so the GPU can then do the search) will take time. I propose to let the GPU do this (compute the functional value at each grid point.) Since we won't be using global access to data for this, texture memory is not an issue.

Suppose I don't explicitly make a grid. I can assign discretise the X and Y direction with constant float arrays (which provides fast memory access) and then use 1 list of blocks.

Yes, you could use a 1D array of blocks (list) or a 2D array. I don't think this significantly impacts the problem either way, and I think the 2D grid approach fits the problem better (and I think allows for slightly cleaner code) so I would suggest starting with a 2D array of blocks.

Here's a sample code that might be interesting to play with or crystallize ideas. Each thread has the responsibility to compute its respective value of x and y, and then the functional value f at that point. Then a reduction followed by a block-draining reduction is used to search over all computed values for the minimum value (in this case).

$ cat t811.cu
#include <stdio.h>
#include <math.h>
#include <assert.h>

// grid dimensions and divisions

#define XNR -1.0f
#define XPR  1.0f
#define YNR -1.0f
#define YPR  1.0f
#define DX   0.0001f
#define DY   0.0001f

// threadblock dimensions - product must be a power of 2
#define BLK_X 16
#define BLK_Y 16

// optimization functions - these are currently set for minimization

#define TST(X1,X2) ((X1)>(X2))
#define OPT(X1,X2) (X2)

// error check macro

#define cudaCheckErrors(msg) \
    do { \
        cudaError_t __err = cudaGetLastError(); \
        if (__err != cudaSuccess) { \
            fprintf(stderr, "Fatal error: %s (%s at %s:%d)\n", \
                msg, cudaGetErrorString(__err), \
                __FILE__, __LINE__); \
            fprintf(stderr, "*** FAILED - ABORTING\n"); \
            exit(1); \
        } \
    } while (0)

// for timing
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

// the function f that will be "optimized"
__host__ __device__ float f(float x, float y){
  return (x+0.5)*(x+0.5) + (y+0.5)*(y+0.5) +0.1f;
}


// variable for block-draining reduction block counter
__device__ int blkcnt = 0;

// GPU optimization kernel
__global__ void opt_kernel(float * __restrict__ bf, float * __restrict__ bx, float * __restrict__ by, const float scx, const float scy){

  __shared__ float sh_f[BLK_X*BLK_Y];
  __shared__ float sh_x[BLK_X*BLK_Y];
  __shared__ float sh_y[BLK_X*BLK_Y];
  __shared__ int lblock;

// compute x,y coordinates for this thread
  float x = ((threadIdx.x+blockDim.x*blockIdx.x) * (XPR-XNR))*scx + XNR;
  float y = ((threadIdx.y+blockDim.y*blockIdx.y) * (YPR-YNR))*scy + YNR;

  int thid = (threadIdx.y*BLK_X)+threadIdx.x;
  lblock = 0;
  sh_x[thid] = x;
  sh_y[thid] = y;
  sh_f[thid] = f(x,y);  // compute functional value of f(x,y)
  __syncthreads();

// perform block-level shared memory reduction
  // assume block size is a power of 2
  for (int i = (blockDim.x*blockDim.y)>>1; i > 16; i>>=1){
    if (thid < i)
      if (TST(sh_f[thid],sh_f[thid+i])){
        sh_f[thid] = OPT(sh_f[thid],sh_f[thid+i]);
        sh_x[thid] = OPT(sh_x[thid],sh_x[thid+i]);
        sh_y[thid] = OPT(sh_y[thid],sh_y[thid+i]);}
    __syncthreads();}
  volatile float *vf = sh_f;
  volatile float *vx = sh_x;
  volatile float *vy = sh_y;
  for (int i = 16; i > 0; i>>=1)
    if (thid < i)
      if (TST(vf[thid],vf[thid+i])){
        vf[thid] = OPT(vf[thid],vf[thid+i]);
        vx[thid] = OPT(vx[thid],vx[thid+i]);
        vy[thid] = OPT(vy[thid],vy[thid+i]);}
// save block reduction result, and check if last block
  if (!thid){
    bf[blockIdx.y*gridDim.x+blockIdx.x] = sh_f[0];
    bx[blockIdx.y*gridDim.x+blockIdx.x] = sh_x[0];
    by[blockIdx.y*gridDim.x+blockIdx.x] = sh_y[0];
    int myblock = atomicAdd(&blkcnt, 1);
    if (myblock == (gridDim.x*gridDim.y-1)) lblock = 1;}
  __syncthreads();
  if (lblock){
    // do last-block reduction
    float my_x, my_y, my_f;
    int myid = thid;
    if (myid < gridDim.x * gridDim.y){
      my_x = bx[myid];
      my_y = by[myid];
      my_f = bf[myid];}
    else { assert(0);} // does not work correctly if block dims are greater than grid dims
    myid += blockDim.x*blockDim.y;
    while (myid < gridDim.x*gridDim.y){
      if TST(my_f,bf[myid]){
        my_x = OPT(my_x,bx[myid]);
        my_y = OPT(my_y,by[myid]);
        my_f = OPT(my_f,bf[myid]);}
      myid += blockDim.x*blockDim.y;}
    sh_f[thid] = my_f;
    sh_x[thid] = my_x;
    sh_y[thid] = my_y;
    __syncthreads();
    for (int i = (blockDim.x*blockDim.y)>>1; i > 0; i>>=1){
      if (thid < i)
        if (TST(sh_f[thid],sh_f[thid+i])){
          sh_f[thid] = OPT(sh_f[thid],sh_f[thid+i]);
          sh_x[thid] = OPT(sh_x[thid],sh_x[thid+i]);
          sh_y[thid] = OPT(sh_y[thid],sh_y[thid+i]);}
      __syncthreads();}
    if (!thid){
      bf[0] = sh_f[0];
      bx[0] = sh_x[0];
      by[0] = sh_y[0];
      }
    }
}

// cpu (naive,serial) function for comparison

float3 opt_cpu(){
  float optx = XNR;
  float opty = YNR;
  float optf = f(optx,opty);
  for (float x = XNR; x < XPR; x += DX)
    for (float y = YNR; y < YPR; y += DY){
      float test = f(x,y);
      if (TST(optf,test)){
        optf = OPT(optf,test);
        optx = OPT(optx,x);
        opty = OPT(opty,y);}}
  return make_float3(optf, optx, opty);
}

int main(){

// compute threadblock and grid dimensions
  int nx = ceil(XPR-XNR)/DX;
  int ny = ceil(YPR-YNR)/DY;
  int bx = ceil(nx/(float)BLK_X);
  int by = ceil(ny/(float)BLK_Y);
  dim3 threads(BLK_X, BLK_Y);
  dim3 blocks(bx, by);
  float *d_bx, *d_by, *d_bf;
  cudaFree(0);
// run GPU test case
  unsigned long gtime = dtime_usec(0);
  cudaMalloc(&d_bx, bx*by*sizeof(float));
  cudaMalloc(&d_by, bx*by*sizeof(float));
  cudaMalloc(&d_bf, bx*by*sizeof(float));
  opt_kernel<<<blocks, threads>>>(d_bf, d_bx, d_by, 1.0f/(blocks.x*threads.x), 1.0f/(blocks.y*threads.y));
  float rf, rx, ry;
  cudaMemcpy(&rf, d_bf, sizeof(float), cudaMemcpyDeviceToHost);
  cudaMemcpy(&rx, d_bx, sizeof(float), cudaMemcpyDeviceToHost);
  cudaMemcpy(&ry, d_by, sizeof(float), cudaMemcpyDeviceToHost);
  cudaCheckErrors("some error");
  gtime = dtime_usec(gtime);
  printf("gpu val: %f, x: %f, y: %f, time: %fs\n", rf, rx, ry, gtime/(float)USECPSEC);
//run CPU test case
  unsigned long ctime = dtime_usec(0);
  float3 cpu_res = opt_cpu();
  ctime = dtime_usec(ctime);
  printf("cpu val: %f, x: %f, y: %f, time: %fs\n", cpu_res.x, cpu_res.y, cpu_res.z, ctime/(float)USECPSEC);

  return 0;
}
$ nvcc -O3 -o t811 t811.cu
$ ./t811
gpu val: 0.100000, x: -0.500000, y: -0.500000, time: 0.193248s
cpu val: 0.100000, x: -0.500017, y: -0.500017, time: 2.810862s
$

Notes:

  1. This problem is set up to find the minimum value of f(x,y) = (x+0.5)^2 + (y+0.5)^2 + 0.1 over the domain: x(-1,1), y(-1,1)
  2. The test was run on Fedora 20, CUDA 7, Quadro5000 GPU (cc2.0) and a Xeon X5560 2.8GHz CPU. Different CPU or GPU will obviously affect the comparison.
  3. The observed speedup here is about 14x. The CPU code is a naive, single threaded code.
  4. It should be possible, for example, via modification of the OPT and TST macros, to perform a different kind of optimization - such as maximum instead of minimum.
  5. The domain (and grid) dimensions and granularity to search over can be modified by the compile time constants such as XNR, XPR, etc.