Why "setne %al" used "a lot of cycles" in perf annotation?

871 Views Asked by At

I was very confused when I saw this perf report. I have tried it for several times, and this setne instruction always takes the most in the function. The function is a big function and below just shows a small piece of the function.

The report is produced with:

perf record ./test

And I check the perf result with:

perf report --showcpuutilization

I opened annotation for one of my most cost functions, which is very large, and small piece is shown in the figure: enter image description here

From it, we can see the setne instruction (on about line 10 from top, shown in red) hits about 9% cycles.

Would anyone help me because I cannot understand why this "simple instruction" cost a so much time? Maybe it's related to pipeline ordering which has dependencies to other instructions? Thanks in advance!

BTW: the program was compiled with below command on x86_64 architecture:

gcc -g -pg -m32 -o test test.c

Below is the CPU information:

processor       : 0
vendor_id       : GenuineIntel
cpu family      : 6
model           : 63
model name      : Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
stepping        : 2
microcode       : 0x1
cpu MHz         : 2494.222
cache size      : 16384 KB
physical id     : 0
siblings        : 1
core id         : 0
cpu cores       : 1
apicid          : 0
initial apicid  : 0
fpu             : yes
fpu_exception   : yes
cpuid level     : 13
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ss syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon rep_good nopl xtopology eagerfpu pni pclmulqdq vmx ssse3 fma cx16 pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand hypervisor lahf_lm abm invpcid_single ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid xsaveopt arat md_clear spec_ctrl intel_stibp
bogomips        : 4988.44
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:
1

There are 1 best solutions below

4
On

Just trying to provide a non-accurate answer here:

  • "perf" works based on samples. At each sample, it checks the current EIP value and record it.
  • The percentage for an instruction just refers to the samples when EIP shows the address compared to the total samples of the scope. When a previous instruction is slow, EIP just stays here.
  • For some modern CPUs, sometimes the reported hot spot may be a few instruction ahead of the real "blocking point". So it's usually good to look back to see if there's any instructions may cause a delay in the execution.

References: https://perf.wiki.kernel.org/index.php/Tutorial#Sampling_with_perf_record