I am adding usage of a small library to a large existing piece of software and would like to analyze (in finder detail than just in&out rdtsc() or gettimeofday calls) the overhead and it's attribution of the small library. Using things like rdtsc() I can get a sense of the latency that calling my libraries functions have, but I cannot do latency attribution unless I am also able to see whether branches are not being predicted well, caching isnt working properly, etc..I looked into PAPI as I imagined looking at a certain hardware events going into and out of a routine in my library within the context of the bigger binary but it seems I would need a specific kernel module for PAPI to work for me (Linux 2.6.18 && Intel Xeon 5570)...there is Vtune which is specifically geared for intel processors but it seems like it's something which would profile the entire binary for performance and not specific code snippets (the 3-4 calls into my library).
Is there a way for me to use Vtune for my goal, or possibly something which can give me access to such counters without having to patch my kernel?
It's not possible to define an entry point in vtune to start recording.
However, what you can do, is to start the trace without recording and then when you expect to hit your library, you start the trace and let it record the calls. After the calls, you may stop it again and can now look up the library call using the top-bottom tab in vtune.
With it, you should be able to see all the information regarding the calls, and the time spent in each.
If you want to be sure that you only trace while the calls are active, you may start the application under gdb and insert breakpoints when accessing and leaving the functions you wish to examine and then start and stop the profiler appropriately.