Cuda reduce kernel result off by 2

60 Views Asked by At

I'm following a previous answered question here about how to implement an all reduce in cuda, which links to a slide deck from nvidia. What I have works majority of the time (when the input size is a nice power of 2), but some input sizes results in answers which seems to be off by 2.

In the reduction, once there is <= 32 elements to reduce, the last warp simply unrolls. The final reduce will sum partial_sum[0] + partial_sum[1]. In the example below, I print out the stored values before and after the final partial sum to see where the calculation is deviating. Here is the output of the program:

threadIdx 0, partial_sum[0] 14710790.000000
threadIdx 1, partial_sum[1] 18872320.000000
threadIdx 0, partial_sum[0] 33583112.000000
threadIdx 1, partial_sum[1] 33059334.000000
expected 33583110, result 33583112

The first two lines are the stored partial sums. The correct result is 14710790 + 18872320 = 33583110 which should then be stored in partial_sum[0] on line 3. The stored results on the first 2 lines are correct, but the result on line 3 of the final addition is not.

Here is the full working example, compiling with nvcc test.cu. Any guidance would be appreciated!

#include <iostream>
#include <numeric>
#include <vector>

constexpr unsigned int THREADS_PER_DIM = 512;

__global__ void reduce(float *v, float *v_r, std::size_t N, bool print) {
    // Allocate shared memory
    volatile __shared__ float partial_sum[THREADS_PER_DIM * 4];

    // Load elements AND do first add of reduction
    // We halved the number of blocks, and reduce first 2 elements into current and what would be the next block
    unsigned int i = blockIdx.x * (blockDim.x * 2) + threadIdx.x;

    // Store first partial result instead of just the elements
    // Set to 0 first to be safe
    partial_sum[threadIdx.x + blockDim.x] = 0;
    partial_sum[threadIdx.x] = 0;
    partial_sum[threadIdx.x] = (i < N ? v[i] : 0) + ((i + blockDim.x) < N ? v[i + blockDim.x] : 0);
    __syncthreads();

    // Start at 1/2 block stride and divide by two each iteration
    // Stop early (call device function instead)
    for (unsigned int s = blockDim.x / 2; s > 32; s >>= 1) {
        // Each thread does work unless it is further than the stride
        if (threadIdx.x < s) {
            partial_sum[threadIdx.x] += partial_sum[threadIdx.x + s];
        }
        __syncthreads();
    }

    if (threadIdx.x < 32) {
        partial_sum[threadIdx.x] += partial_sum[threadIdx.x + 32];
        partial_sum[threadIdx.x] += partial_sum[threadIdx.x + 16];
        partial_sum[threadIdx.x] += partial_sum[threadIdx.x + 8];
        partial_sum[threadIdx.x] += partial_sum[threadIdx.x + 4];
        partial_sum[threadIdx.x] += partial_sum[threadIdx.x + 2];
        if (threadIdx.x < 2 && blockIdx.x == 0 && print) {
            printf("threadIdx %d, partial_sum[%d] %f\n", threadIdx.x, threadIdx.x, partial_sum[threadIdx.x]);
        }

        partial_sum[threadIdx.x] += partial_sum[threadIdx.x + 1];

        if (threadIdx.x < 2 && blockIdx.x == 0 && print) {
            printf("threadIdx %d, partial_sum[%d] %f\n", threadIdx.x, threadIdx.x, partial_sum[threadIdx.x]);
        }
    }

    // Let the thread 0 for this block write it's result to main memory
    // Result is inexed by this block
    if (threadIdx.x == 0) {
        v_r[blockIdx.x] = partial_sum[0];
    }
}

constexpr std::size_t N = (1 << 13) + 4;
int main() {
    // Init and fill vec
    std::vector<float> a(N);
    for (std::size_t i = 0; i < N; ++i) {
        a[i] = static_cast<float>(i);
    }

    // allocate and copy on device
    std::size_t bytes = N * sizeof(float);
    float *d_A;
    float *d_A_reduction;
    cudaMalloc((void **)&d_A, bytes);
    cudaMalloc((void **)&d_A_reduction, bytes);
    cudaMemset(d_A_reduction, 0, bytes);
    cudaMemcpy(d_A, a.data(), bytes, cudaMemcpyHostToDevice);

    // 8192 + 4 elements with 512 threads = 16 + 1 blocks
    // Each block actually considered elements in adjacent block on first add in kernel, so we need 9 blocks
    auto num_blocks = 9;

    reduce<<<num_blocks, THREADS_PER_DIM>>>(d_A, d_A_reduction, N, false);
    cudaMemcpy(d_A, d_A_reduction, bytes, cudaMemcpyDeviceToDevice);
    reduce<<<1, THREADS_PER_DIM>>>(d_A, d_A_reduction, num_blocks, true);

    float result = 0;
    cudaMemcpy(&result, d_A_reduction, sizeof(float), cudaMemcpyDeviceToHost);
    cudaFree(d_A);
    cudaFree(d_A_reduction);

    auto correct_result = std::accumulate(a.begin(), a.end(), 0.0);
    std::cout << "expected " << static_cast<int>(correct_result) << ", result " << static_cast<int>(result)
              << std::endl;
}
1

There are 1 best solutions below

0
JakeTuero On

As noted in the comments, there were 2 issues. Cuda 9 changed the guarantees about threads in a warp remaining in lockstep, previously answered here: Cuda min warp reduction produces race condition

The second issue after fixing this was due to imprecision in floating point representations, where 32774 + 33550336 cannot be represented by 33583110, but instead is 33583112 which matches the off by 2.