Multiple CUDA streams crashing GPU

596 Views Asked by At

This is a continuation of this post.

It seems as though a special case has been solved by adding volitile but now something else has broken. If I add anything between the two kernel calls, the system reverts back to the old behavior, namely freezing and printing everything at once. This behavior is shown by adding sleep(2); between set_flag and read_flag. Also, when put in another program, this causes the GPU to lock up. What am I doing wrong now?

Thanks again.

1

There are 1 best solutions below

4
On

There is an interaction with X and the display driver, as well as the standard output queue and it's interaction with the graphical display driver.

A few experiments you can try, (with the sleep(2); added between the set_flag and read_flag kernels):

  1. Log into your machine over the network via ssh from another machine. I think your program will work. (X is not involved in the display in this case)
  2. comment out the line that prints out "Starting..." I think your program will then work. (This avoids the display driver/ print queue deadlock, see below).
  3. add a sleep(2); in between the "Starting..." print line and the first kernel. I think your program will then work. (This allows the display driver to fully service the first printout before the first kernel is launched, so no CPU thread stall.)
  4. Stop X and run from a console. I think your program will work.

When the GPU is both hosting an X display and also running CUDA tasks, it has to switch between the two. For the duration of the CUDA task, ordinary display processing is suspended. You can read more about this here.

The problem here is that when running X, the first printout is getting sent to the print queue but not actually displayed before the first kernel is launched. This is evident because you don't see the printout before the display freeze. After that, the CPU thread is getting stalled waiting for the display of the text. The second kernel is not starting. The intervening sleep(2); and it's interaction with the OS is enough for this stall to occur. And the executing first kernel has the display driver "stopped" for ordinary display tasks, so the OS never gets past it's stall, so the 2nd kernel doesn't get launched, leading to the apparent hang.

Note that options 1,2, or 3 in the linked custhelp article would be effective in your case. Option 4 would not.