I want to intercept at PTX level of opencl programs on NVIDIA GPU.
I imagine the routine would probably look like this.
First, I write an opencl program (both host and device code), using NVIDIA compiler to produce respective ptx code. Then I write what I want to do by modifying the PTX code (please don't ask why I didn't do this on the device C code - I have some reasons for it). But problem is, after being modified, how do I compile this PTX code to binary code?
You can use ptxas, which is included in the CUDA toolkit. It compiles .ptx into .cubin, which can then be loaded with the driver API.