CUDA unified memory and Windows 10

1k Views Asked by At

While using CudaMallocManaged() to allocate an array of structs with arrays inside, I'm getting the error "out of memory" even though I have enough free memory. Here's some code that replicates my problem:

#include <iostream>
#include <cuda.h>

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
   if (code != cudaSuccess) 
   {
      fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
      if (abort) exit(code);
   }
}

#define N 100000
#define ARR_SZ 100

struct Struct
{
    float* arr;
};

int main()
{
    Struct* struct_arr;

    gpuErrchk( cudaMallocManaged((void**)&struct_arr, sizeof(Struct)*N) );
    for(int i = 0; i < N; ++i)
        gpuErrchk( cudaMallocManaged((void**)&(struct_arr[i].arr), sizeof(float)*ARR_SZ) ); //out of memory...

    for(int i = 0; i < N; ++i)
        cudaFree(struct_arr[i].arr);
    cudaFree(struct_arr);

    /*float* f;
    gpuErrchk( cudaMallocManaged((void**)&f, sizeof(float)*N*ARR_SZ) ); //this works ok
    cudaFree(f);*/

    return 0;
}

There doesn't seem to be a problem when I call cudaMallocManaged() once to allocate a single chunk of memory, as I'm showing in the last piece of commented code. I have a GeForce GTX 1070 Ti, and I'm using Windows 10. A friend tried to compile the same code in a PC with Linux and it worked correctly, while it had the same issue in another PC with Windows 10. WDDM TDR is deactivated. Any help would be appreciated. Thanks.

1

There are 1 best solutions below

8
On BEST ANSWER

There is an allocation granularity.

This means that if you ask for 1 byte, or 400 bytes, what is actually used up is something like 4096 65536 bytes. So a bunch of very small allocations will actually use up memory at a much faster rate than what you would predict based on the requested allocation size. The solution is to not make very small allocations, but instead to allocate in larger chunks.

An alternative strategy here would also be to flatten your allocation, and carve out pieces from it for each of your arrays:

#include <iostream>
#include <cstdio>

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char *file, int line, bool abort=true)
{
   if (code != cudaSuccess)
   {
      fprintf(stderr,"GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
      if (abort) exit(code);
   }
}

#define N 100000
#define ARR_SZ 100

struct Struct
{
    float* arr;
};

int main()
{
    Struct* struct_arr;
    float* f;

    gpuErrchk( cudaMallocManaged((void**)&struct_arr, sizeof(Struct)*N) );
    gpuErrchk( cudaMallocManaged((void**)&f, sizeof(float)*N*ARR_SZ) );
    for(int i = 0; i < N; ++i)
        struct_arr[i].arr = f+i*ARR_SZ;
    cudaFree(struct_arr);
    cudaFree(f);

    return 0;
}

ARR_SZ divisible by 4 means the various created pointers can also be up-cast to larger vector types e.g. float2 or float4, if your use had any intention of doing that.

A possible reason the original code works on linux is because managed memory on linux, in a proper setup, can oversubscribe the GPU physical memory. The result is the actual allocation limit is much higher than what the GPU on-board memory would suggest. It might also be that the linux case has a bit more free memory, or perhaps the allocation granularity on linux is different (smaller).

Based on a question in the comments, I decided to estimate the allocation granularity, using this code:

#include <iostream>
#include <cstdio>

#define gpuErrchk(ans) { gpuAssert((ans), __FILE__, __LINE__); }
inline void gpuAssert(cudaError_t code, const char* file, int line, bool abort = true)
{
    if (code != cudaSuccess)
    {
        fprintf(stderr, "GPUassert: %s %s %d\n", cudaGetErrorString(code), file, line);
        if (abort) exit(code);
    }
}

#define N 100000
#define ARR_SZ 100

struct Struct
{
    float* arr;
};

int main()
{
    Struct* struct_arr;
    //float* f;

    gpuErrchk(cudaMallocManaged((void**)& struct_arr, sizeof(Struct) * N));
#if 0
    gpuErrchk(cudaMallocManaged((void**)& f, sizeof(float) * N * ARR_SZ));
    for (int i = 0; i < N; ++i)
        struct_arr[i].arr = f + i * ARR_SZ;
#else
    size_t fre, tot;
    gpuErrchk(cudaMemGetInfo(&fre, &tot));
    std::cout << "Free: " << fre << " total: " << tot << std::endl;
    for (int i = 0; i < N; ++i)
        gpuErrchk(cudaMallocManaged((void**) & (struct_arr[i].arr), sizeof(float) * ARR_SZ)); 
    gpuErrchk(cudaMemGetInfo(&fre, &tot));
    std::cout << "Free: " << fre << " total: " << tot << std::endl;

    for (int i = 0; i < N; ++i)
        cudaFree(struct_arr[i].arr);
#endif
    cudaFree(struct_arr);
    //cudaFree(f);

    return 0;
}

When I compile a debug project with that code, and run that on a windows 10 desktop with RTX 2070 GPU (8GB memory, same as GTX 1070 Ti) I get the following output:

Microsoft Windows [Version 10.0.17763.973]
(c) 2018 Microsoft Corporation. All rights reserved.

C:\Users\Robert Crovella>cd C:\Users\Robert Crovella\source\repos\test12\x64\Debug

C:\Users\Robert Crovella\source\repos\test12\x64\Debug>test12
Free: 7069866393 total: 8589934592
Free: 516266393 total: 8589934592

C:\Users\Robert Crovella\source\repos\test12\x64\Debug>test12
Free: 7069866393 total: 8589934592
Free: 516266393 total: 8589934592

C:\Users\Robert Crovella\source\repos\test12\x64\Debug>
  1. Note that on my machine there is only 0.5GB of reported free memory left after the 100,000 allocations. So if for any reason your 8GB GPU starts out with less free memory (entirely possible) you may run into an out-of-memory error, even though I did not.

  2. The calculation of the allocation granularity is as follows:

    7069866393 - 516266393 / 100000 = 65536 bytes per allocation(!)
    

    So my previous estimate of 4096 bytes per allocation was way off, by at least 1 order of magnitude, on my machine/test setup.

  3. The allocation granularity may vary based on:

    • windows or linux
    • WDDM or TCC
    • x86 or Power9
    • managed vs ordinary cudaMalloc
    • possibly other factors (e.g. CUDA version)

    so my advice to future readers would not be to assume that it is always 65536 bytes per allocation, minimum.