I am trying to use linux perf to profile the L3 cache bandwidth gor a python script. I see that there are no available commands to measure that directly. But I know how to get the llc performance counters using the below command. Can anyone let me know on how to calculate the L3 cache bandwidth using the perf counters or refer me to any tools that are available to measure the l3 cache bandwidth? Thanks in advance for the help.
perf stat -e LLC-loads,LLC-load-misses,LLC-stores,LLC-prefetches python hello.py
update:
perf
has changed, now you wantperf stat
with-M tma_info_memory_core_l3_cache_access_bw
for L3 bandwidth or-M tma_info_memory_core_l3_cache_fill_bw
for DRAM bandwidth (L3 fill = misses, I think?)Or better
-M tma_info_system_dram_bw_use
should be more accurate, but only works system-wide. (perf stat -a -M tma_info_system_dram_bw_use -e task-clock,page-faults,cycles,instructions
)It seems they measure total read+write bandwidth, and I think "access" bandwidth might be counting reads+writes from the cores plus dirty write-back to DRAM. With the test code from There is a huge speed difference between reading and writing in DRAM, is this normal? (with
write
beforeread
to avoid CoW mapping to the same physical page of zeros) with EPP =performance
to avoid downclocking. Actually I commented out read so the process would spend its whole time in the write test, allowing easy use ofperf
: I measured22.84 tma_info_memory_core_l3_cache_fill_bw
during the write test whileintel_gpu_top
showed peaks of 14G+ B/s read + 14+ GB/s write, average less including startup. And37.36 tma_info_memory_core_l3_cache_access_bw
during the same test (both metric-groups active in the sameperf
run.)29.11 tma_info_system_dram_bw_use
seems more like the sum of DRAM read+write bandwidths, so I'd trust that. (All the numbers in this paragraph came from the same run, and run-to-run is fairly consistent, within +- 0.5 GB/s.)There should be negligible L3 hits during that test, and the rest of my system was idle, like 200MiB/s read, 25 MiB/s write according to
intel_gpu_top
which measures at the DRAM controllers.According to
perf list
on my Skylake, those reports average per-core data access or fill bandwidth in GB/s. (So not counting instruction fetch, and maybe only reads?) I'm not 100% sure exactly what these counters measure, but the metric-groups described in my old answer below don't exist anymore. I have perf 6.5 at the moment.perf stat
has some named "metrics" that it knows how to calculate from other things. According toperf list
on my system, those includeL3_Cache_Access_BW
andL3_Cache_Fill_BW
.This is from my system with a Skylake (i7-6700k). Other CPUs (especially from other vendors and architectures) might have different support for it, or IDK might not support these metrics at all.
I tried it out for a simplistic sieve of Eratosthenes (using a bool array, not a bitmap), from a recent codereview question since I had a benchmarkable version of that (with a repeat loop) lying around. It measured 52 GB/s total bandwidth (read+write I think). The n=4000000 problem-size I used thus consumes 4 MB total, which is larger than the 256K L2 size but smaller than the 8MiB L3 size.
Or with just
-M L3_Cache_Access_BW
and no-e
events, it just showsoffcore_requests.all_requests # 54.52 L3_Cache_Access_BW
andduration_time
. So it overrides the default and doesn't countcycles,instructions
and so on.I think it's just counting all off-core requests by this core, assuming (correctly) that each one involves a 64-byte transfer. It's counted whether it hits or misses in L3 cache. Getting mostly L3 hits will obviously enable a higher bandwidth than if the uncore bottlenecks on the DRAM controllers instead.