Need help understanding kcachegrind

1.6k Views Asked by At

I'm trying to understand kcachegrind, there doesn't seem to be much information out there, for example, on the left window, what is "Self", What is "incl."? (see 1 core ).

I've done some weak scaling tests, there is no communication, so my guess is it's something to do with cache misses. But from what I can see, there is the same number of data misses for both 1 core and 16 cores, see: 16 cores.

The only difference which I can see between 1 core and 16 core is that there is significantly less calls to memcpy on 16 cores (which I can explain). But I still can't work out why on one core, the execution time is 0.62 secs, whilst on 16 cores, the execution time is closer to 1 second. Each processor is doing the same amount of work. If someone could tell me what to look for in kcachegrind, that would be awesome, this is my first time using kcachegrind and valgrind.

Edit: My code concatenate matrices in compressed row format. It involves looping over the entries of the sub-matrices and using memcpy to copy the values into a result matrix. Here is the code: - I can't post more than 2 links... so I'll post it in a comment.

I've only initiated valgrind on the loop itself, the loop is also what's making the difference between 0.62 sec execution time and 1 sec execution time. The part which takes the most time is the call to memcpy (line 37 in the github gist below), when I comment that out, my code executes in less than 0.2 secs, although there is still an increase between 1 and 16 cores (about 30% increase).

I'm running my code on a haswell node, which consists of 24 cores, (two Intel® Xeon® Processor E5-2690 v3)

Each core is has 5GB of memory.

0

There are 0 best solutions below