I have a process that responds perfectly well to CTRL+C on my local machine. And it appears to also be working.
But on an EC2 instance it freezes and becomes a defunct or zombie process.
kill -9 <PID> doesn't remove it and I have to reboot the EC2 instance to clean it up properly.
When it runs it also loads an in house developed shared library that I have no influence over and have no access to any source code in it to see what it's doing. This library also uses CUDA and appears to start multiple threads.
I tried installing a signal handler on the main thread and it does get installed but calling _exit doesn't shut the whole process down, it seems to still be waiting.
Why might be happening here that is preventing CTRL+C from exiting the process cleanly? Can I override or examine what the other threads could be doing?
Ah, I found the problem. I'll leave the question as it stands in case it helps someone else.
It turns out that on my PC, I have a GTX 680 and the drivers get installed when installing CUDA. On EC2 the card is a GRID K520, and the driver installed by CUDA doesn't work. I downloaded and installed the latest stable card specific driver and it then worked.
The discovery was made after running
nvidia-smiand it wouldn't print any details about the card but rather would just showKilled. Run nvidia-smi again and it would lock up the console.Unfortunately, I hadn't tested that CUDA app's were working but relied on the driver appearing to print a message in the log saying it was loaded and assumed it was working.
Updating the driver consisted of downloading the latest driver from nvidia (use the .run version). Then:
Finally install it with a command like:
I then rebooted the instance and verified it with nvidia-smi
This link was insightful - CUDA 7.5 unstable on EC2