Is there a way to measure latency of multiple loads in parallel in x86 (using RDTSCP or RDTSC), without serialization?

159 Views Asked by gururaj At 28 January 2020 at 22:00

I am trying to measure the latency of multiple memory accesses that are executing in parallel in an out-of-order processor.

The problem is that any attempt to measure the latency of a load serializes it with respect to other loads.

Take for example a naively written code that measures the latency of two loads:

1. rdtscp
2. load-1
3. rdtscp

4. rdtscp 
5. load-2
6. rdtscp

In the above code, the ordering property of rdtscp in Intel's x86 serializes the execution of load-1 and load-2 as per my testing (i.e. load-2 is issued to the memory-system only after load-1 completes execution). As a result, the above code does not utilize the available memory bandwidth. Ideally, I would like to ensure the maximum throughput for the loads, while measuring the latency of each load independently.

Is there a way to measure latency of load-1 and load-2, while allowing them to execute in parallel?

Ideally, what I need is a form of rdtscp that is ordered with respect to the load whose latency is being measured, and not ordered explicitly with any other instruction. I was wondering if there is a way to obtain this either with rdtscp or rdtsc.

Original Q&A

There are 1 best solutions below

Peter Cordes On 28 January 2020 at 22:46

I don't think there's any way to sample a time with an input-dependency on a specific register, or any other way to let loads complete out of order but still time each one individually. Or even to just let them overlap.

There are perf events for mem_trans_retired.load_latency_gt_32 and so on for powers of 2 from 4 to 512. You could program counters and rdpmc for that. But it wouldn't tell you which load triggered which event.

Given your overall goal, you could use those counters with perf stat or perf record to get an average for a whole loop case when (single-core) memory bandwidth is maxed out.

Note that they count latency from first dispatch (to a load port), not issue into the back-end.

Is there a way to measure latency of multiple loads in parallel in x86 (using RDTSCP or RDTSC), without serialization?

There are 1 best solutions below

Related Questions in PERFORMANCE

Related Questions in X86

Related Questions in MEMORY-BARRIERS

Related Questions in MICROBENCHMARK

Related Questions in TRANSACTIONAL-MEMORY

Trending Questions

Popular # Hahtags

Popular Questions