ctrl+c not killing a process

1.1k Views Asked by At

I have a process that responds perfectly well to CTRL+C on my local machine. And it appears to also be working.

But on an EC2 instance it freezes and becomes a defunct or zombie process.

kill -9 <PID> doesn't remove it and I have to reboot the EC2 instance to clean it up properly.

When it runs it also loads an in house developed shared library that I have no influence over and have no access to any source code in it to see what it's doing. This library also uses CUDA and appears to start multiple threads.

I tried installing a signal handler on the main thread and it does get installed but calling _exit doesn't shut the whole process down, it seems to still be waiting.

Why might be happening here that is preventing CTRL+C from exiting the process cleanly? Can I override or examine what the other threads could be doing?

1

There are 1 best solutions below

0
hookenz On

Ah, I found the problem. I'll leave the question as it stands in case it helps someone else.

It turns out that on my PC, I have a GTX 680 and the drivers get installed when installing CUDA. On EC2 the card is a GRID K520, and the driver installed by CUDA doesn't work. I downloaded and installed the latest stable card specific driver and it then worked.

The discovery was made after running nvidia-smi and it wouldn't print any details about the card but rather would just show Killed. Run nvidia-smi again and it would lock up the console.

Unfortunately, I hadn't tested that CUDA app's were working but relied on the driver appearing to print a message in the log saying it was loaded and assumed it was working.

Updating the driver consisted of downloading the latest driver from nvidia (use the .run version). Then:

sudo modprobe -r nvidia_uvm
sudo modprobe -r nvidia

Finally install it with a command like:

sudo ./NVIDIA-Linux-x86_64-3xx.xx.xx.run

I then rebooted the instance and verified it with nvidia-smi

This link was insightful - CUDA 7.5 unstable on EC2