How can I debug code 700 "illegal memory access" aka `CUDA_EXCEPTION_14, Warp Illegal Address`?

495 Views Asked by At

My code is showing

CUDA error calling "cudaStreamSynchronize(0)", code is 700 an illegal memory access was encountered on 489
 0# my_func(signed char const*, unsigned char const*, int*, int*, int, int) in libthing.so

How can I debug this?

2

There are 2 best solutions below

2
On BEST ANSWER

While using cuda-gdb to debug kernels is generally a good skill to develop (See Richard's answer), there is an easier, less interactive way of debugging this kind of memory access error. Nvidia has been shipping their memory checking tool with the CUDA Toolkit for a long time, first as cuda-memcheck (not available in CUDA 12 anymore) and nowadays as part of the compute-sanitizer tool (available since CUDA 11).

Both tools are used by passing them the name of the program to check which will then instantly be run without further user input required which makes them usable for e.g. automated CI jobs making sure that there are no correctness regressions. Different checkers can be applied, but memcheck is the default checker and needs no further command line arguments to be chosen.

If you compile a debug build (-G for device code) or just compile with -lineinfo, these tools are not only able to tell you which thread in which kernel does an illegal access of which kind and size, but also pinpoint in which line of code that access is made.

The ComputeSanitizer documentation also gives an example of how the output looks like:

========= Invalid __global__ write of size 4 bytes
=========     at unaligned_kernel():0x160 in memcheck_demo.cu:6
=========     by thread (0,0,0) in block (0,0,0)
=========     Address 0x7f6510c00001 is misaligned

So memcheck_demo.cu:6 points to line 6 of the demo source code.

Further demonstrations can be found in the NVIDIA/compute-sanitizer-samples repository.

1
On

Debugging illegal memory access / Warp Illegal Address

Here's a comprehensive guide to discovering what will probably be a stupid mistake. The first two steps are somewhat superfluous because we'll ultimately use a debugger, but they're very good ideas for helping isolate a variety of problems.

Step 1. Make sure you're checking all CUDA API calls for errors on the host, including kernel launches.

My output looked like this

CUDA error calling "cudaStreamSynchronize(0)", code is 700 an illegal memory access was encountered on 489

where Line 489 is the first time I checked a CUDA API call for errors. This has almost nothing to do with where the error happened.

Since we can also see CUDA errors from the GPU if we check with them, we need to make sure our code checks for them any time we interact with the GPU.

To do so, I use the following header file to define error-checking functions:

#pragma once

#include <boost/stacktrace.hpp>
#include <iostream>
#include <stdexcept>

#define STRINGIZE_DETAIL(x) #x
#define STRINGIZE(x) STRINGIZE_DETAIL(x)

#define CUDA_CHECK(call)                                                                                               \
  do {                                                                                                                 \
    if ((call) != cudaSuccess) {                                                                                       \
      const cudaError_t err = cudaGetLastError();                                                                      \
      std::cerr << "CUDA error calling \"" #call "\", code is " << err << " " << cudaGetErrorString(err) << " on "     \
                << __LINE__ << "\n"                                                                                    \
                << boost::stacktrace::stacktrace() << std::endl;                                                       \
      throw std::runtime_error("Problem.");                                                                            \
    }                                                                                                                  \
  } while (0)

#define CUDA_KERNEL_LAUNCH_CHECK()                                                                                     \
  do {                                                                                                                 \
    const auto cuda_err = cudaGetLastError();                                                                          \
    if (cuda_err != cudaSuccess) {                                                                                     \
      throw std::runtime_error(std::string("CUDA kernel launch failed! ") + cudaGetErrorString(cuda_err));             \
    }                                                                                                                  \
  } while (0)

If your code looked like this beforehand:

cudaMemcpy(dst, src, 20*sizeof(float), cudaMemcpyHostToDevice);
my_kernel<<<blocks, threads, shared_mem, stream>>>(dst, result);

after adding error checking it will look like

CUDA_CHECK(cudaMemcpy(dst, src, 20*sizeof(float), cudaMemcpyHostToDevice));
my_kernel<<<blocks, threads, shared_mem, stream>>>(dst, result);
CUDA_KERNEL_LAUNCH_CHECK();

Note that you want to have a CUDA_KERNEL_LAUNCH_CHECK() immediately following every kernel launch.

Now that your code is checking for errors you'll see problems closer to where they happened.

2. Use CUDA_LAUNCH_BLOCKING

Run your program with

CUDA_LAUNCH_BLOCKING=1 ./my_program.exe

CUDA_LAUNCH_BLOCKING will cause each kernel to finish running before moving on to the next line. Since every kernel launch is followed by CUDA_KERNEL_LAUNCH_CHECK() now this will tell you exactly which kernel caused the problem.

3. Compile in debug mode

Compile your code like

nvcc -g G my.cu source.cu files.cu

This ensures that you'll have access to the CUDA source code inside the debugger and be able to step through it.

4. Use cuda-gdb

Run your code like so.

cuda-gdb --silent --ex run --args ./my_program.exe

whereas running the program without the debugger gave an address in the C++ host source:

CUDA error calling "cudaStreamSynchronize(0)", code is 700 an illegal memory access was encountered on 489

running the program in cuda-gdb gives:

CUDA Exception: Warp Illegal Address
The exception was triggered at PC 0x10000320778 (cuda.cu:414)

Thread 1 "benchmark.exe" received signal CUDA_EXCEPTION_14, Warp Illegal Address.
[Switching focus to CUDA kernel 0, grid 15, block (0,0,0), thread (0,0,0), device 0, sm 0, warp 0, lane 0]
0x000001000031f450 in my_kernel<<<(1,1,1),(128,1,1)>>> (rec=0x7fffc7c00000 '\377' <repeats 200 times>, arg1=0x7fffce800000, arg2=0x7fffc7de8a00, 
    arg3=0x555555a7a9a0, arg4=0x7fffc7cf4400 "", arg5=1000, arg6=1000) at my_source_file.cu:382
382       const auto ci = order[i];

The debugger has identified that Line 414 is the problem - and it is!

That lines looks like this

array[i] = val;

we suspect that the assignment is causing a problem because that involves an address. But why?

Notes that all of the arguments have pointers beginning with 0x7fffc except for argument 5 which is 0x555555a7a9a0. This isn't hexadecimal garbage - it's actually the source of the problem. CUDA hasn't allocated a single block of memory at 0x555555a7a9a0 far distant from all the other addresses. Instead, a host pointer has been passed to the device and that is causing the failure. This is why the values of the pointers are so different.

Note that although the debugger appears to have stopped at

382       const auto ci = order[i];

this line has nothing to do with the problem. I'm not sure why the compiler stopped here.