I am trying to measure the latency of multiple memory accesses that are executing in parallel in an out-of-order processor.
The problem is that any attempt to measure the latency of a load serializes it with respect to other loads.
Take for example a naively written code that measures the latency of two loads:
1. rdtscp
2. load-1
3. rdtscp
4. rdtscp
5. load-2
6. rdtscp
In the above code, the ordering property of rdtscp in Intel's x86 serializes the execution of load-1 and load-2 as per my testing (i.e. load-2 is issued to the memory-system only after load-1 completes execution). As a result, the above code does not utilize the available memory bandwidth. Ideally, I would like to ensure the maximum throughput for the loads, while measuring the latency of each load independently.
Is there a way to measure latency of load-1 and load-2, while allowing them to execute in parallel?
Ideally, what I need is a form of rdtscp that is ordered with respect to the load whose latency is being measured, and not ordered explicitly with any other instruction. I was wondering if there is a way to obtain this either with rdtscp or rdtsc.
I don't think there's any way to sample a time with an input-dependency on a specific register, or any other way to let loads complete out of order but still time each one individually. Or even to just let them overlap.
There are perf events for
mem_trans_retired.load_latency_gt_32and so on for powers of 2 from 4 to 512. You could program counters andrdpmcfor that. But it wouldn't tell you which load triggered which event.Given your overall goal, you could use those counters with
perf statorperf recordto get an average for a whole loop case when (single-core) memory bandwidth is maxed out.Note that they count latency from first dispatch (to a load port), not issue into the back-end.