Callgrind performance analysis with cycle detection

4.3k Views Asked by At

I'm trying to use Callgrind/Kcachegrind for the first time to profile my C++ application and what I noticed is that the two functions that take more time are:

  1. < cycle 1 > (50% self) and
  2. do_lookup_x (15% self)

Now, from my understanding cycle 1 is related to the estimation of the time taken by recursively called functions, but it is not very clear to me how I should interpret a so high time spent here. If there are some cycles, I would like to see which function is called more often and take more CPU time at the end. If I disable Cycle Detection (View->Cycle Detection), then cycle 1 disappears but the "Self" time sum up to roughly 60%, and I'm not sure this is the best thing to do. Regarding do_lookup_x I'm totally clueless...

Can you clarify me a bit how should I interpret these results?

Thanks in advance.

2

There are 2 best solutions below

0
On

Cycles may be detected incorrectly in KCachegrind: http://valgrind.org/docs/manual/cl-manual.html#cl-manual.cycles

6.2.4. Avoiding cycles Informally speaking, a cycle is a group of functions which call each other in a recursive way. ...

Cycles are not bad in itself, but tend to make performance analysis of your code harder. This is because inclusive costs for calls inside of a cycle are meaningless. The definition of inclusive cost, i.e. self cost of a function plus inclusive cost of its callees, needs a topological order among functions. For cycles, this does not hold true: callees of a function in a cycle include the function itself. Therefore, KCachegrind does cycle detection and skips visualization of any inclusive cost for calls inside of cycles. Further, all functions in a cycle are collapsed into artificial functions called like Cycle 1.

Now, when a program exposes really big cycles (as is true for some GUI code, or in general code using event or callback based programming style), you lose the nice property to let you pinpoint the bottlenecks by following call chains from main, guided via inclusive cost. In addition, KCachegrind loses its ability to show interesting parts of the call graph, as it uses inclusive costs to cut off uninteresting areas.

Despite the meaningless of inclusive costs in cycles, the big drawback for visualization motivates the possibility to temporarily switch off cycle detection in KCachegrind, which can lead to misguiding visualization. However, often cycles appear because of unlucky superposition of independent call chains in a way that the profile result will see a cycle. Neglecting uninteresting calls with very small measured inclusive cost would break these cycles. In such cases, incorrect handling of cycles by not detecting them still gives meaningful profiling visualization.

Try to turn off Cycle Detection in KCachegrind's View menu and check "Self" time column, as "Incl" will be incorrect.

You can also try some other profiler with exact and full function stack saving. Many profilers supported by https://github.com/jrfonseca/gprof2dot script saves full stack, not only the callee-caller pairs as in callgrind/cachegrind format.

0
On

I agree with @osgx that you need a different profiler, one that captures entire call stacks.

Then, inclusive time percent of a function is a very simple number. It is just the fraction of stack samples in which that function appears, regardless of how many times it appears in single samples.

Here's a way to think of it.
- Suppose samples are taken every 10ms, for a total of 100 seconds, or 10,000 samples.
- Suppose function Foo appears on 30% of those samples, either once or more than once.
- That means if you could change Foo so it takes almost no time, such as by passing it off to a very fast sub-processor, then no samples would see it, because it would never be on the stack long enough for a sample to hit it.
- So those 30% of samples would simply disappear, and the program would take 70 seconds instead of 100.
- That means Foo is personally responsible for 30% of the time (regardless of recursion).

Actually, I prefer this method, because I'm more interested in finding out what the problem is, rather than whether it takes 29% or 31%. It takes whatever it takes, and what it takes will not be affected by how precisely it gets measured.