The overhead-free monitor codes in the AMD CPU significantly increases the total synchronization duration

107 Views Asked by At

I am conducting a test to measure the message synchronization latency between different cores of a CPU. Specifically, I am measuring how many clock cycles it takes for CPU2 to detect changes in the shared data made by CPU1. Both CPU1 and CPU2 are using rdtsc instruction to record the timing. I have observed inconsistent behavior between Intel and AMD CPU platforms. Any thoughts or suggestions you may have regarding this issue is welcome.

The code is as follows. The program contains two threads, "ping" and "pong," running on CPU1 and CPU2 respectively. As their names suggest, they take turns incrementing their own shared data (shd_data) to implement a ping-pong loop and finally measure the total running time (the minimum of ts_end-ts_start and tb_end-tb_start). In order to also measure the average time for the "ping" operations, I added rdtsc(tsc_start[loop]); and rdtsc(tsc_end[loop]); within the while loop. However, these two lines of monitor code significantly affected the total running time, increasing the average ping-pong loop time from about 180 cycles to 290 cycles. I cannot find a reasonable explanation for this.

#define MAX_LOOPS  10
#define BIDX 0
#define SIDX 15

static volatile int start_flag = 0;
static int shd_data[16];
unsigned long tb_start , tb_end, ts_start, ts_end;
unsigned long gap[12];
unsigned long tsc_start[MAX_LOOPS];
unsigned long tsc_end[MAX_LOOPS];

void* ping(void *args)
{
    int loop=0;
    int old = 0;
    int cur = 0;

    memset(tsc_start, 0, MAX_LOOPS*sizeof(unsigned long));
    memset(shd_data, 0, 64);    // preheat
    while (start_flag == 0) ;   // sync. start

    rdtsc(tb_start);
    while (++loop < MAX_LOOPS) {
        rdtsc(tsc_start[loop]);
        WRITE_ONCE(shd_data[BIDX], shd_data[BIDX]+1);
        do {
            cur = READ_ONCE(shd_data[SIDX]);
        } while (cur <= old);
        old = cur;
    }
    WRITE_ONCE(shd_data[BIDX], shd_data[BIDX]+1);
    rdtsc(tb_end);

    return NULL;
}


void* pong(void *args)
{
    int loop=0;
    int old = 0;
    int cur = 0;
    
    memset(tsc_end, 0, MAX_LOOPS*sizeof(unsigned long));
    memset(shd_data, 0, 64);    //preheat
    while (start_flag == 0) ;   // sync. start

    rdtsc(ts_start);
    while (++loop < MAX_LOOPS) {
        do {
            cur = READ_ONCE(shd_data[BIDX]);
            rdtsc(tsc_end[loop]);
        } while (cur <= old);
        old = cur;
        WRITE_ONCE(shd_data[SIDX], shd_data[SIDX]+1);
    }
    WRITE_ONCE(shd_data[SIDX], shd_data[SIDX]+1);
    rdtsc(ts_end);

    return NULL;
}

The experimental machine is an AMD 3910X with gcc-9.4.0. I also ran the same code on an Intel i9-9900 CPU. Interestingly, monitoring the "ping" operation with the rdtsc code did not affect the overall ping-pong time, which remained around 400 cycles. I am not sure whether this phenomenon can be replicated on other AMD or Intel CPUs, as I only have access to these two machines at the moment.

Update:

By adjusting the CPU Frequency Scaling, the CPU frequency on my two machines is fixed at the base frequency of 3.6GHz.

you can find the complete runnable code from the link below:

https://onlinegdb.com/HoCs3AChhp

1

There are 1 best solutions below

1
On

Brief answer: Cache line contention

The added monitoring code, such as the use of rdtsc() mentioned in the question, will result in changes to the layout of the variables in the data segment, leading to cache line contention during runtime. Modifying the value defined by "MAX_LOOPS" will also have the same effect. It should be noted that cache contention can also be caused by speculative execution.

How to fix

Adding __attribute__((aligned(64))) to the definition of the shd_data[32] ensures that it is aligned on the cache line.

constraints

The situations mentioned above are specific to the AMD 3970X CPU and may also be applicable to the Zen2 architecture or other AMD CPUs.

I also conducted this experiment on an Intel i9-9900K, where the data segment layouts of the two programs were identical. However, I haven't observed cache thrashing on the Intel CPU, which is something I haven't fully understood yet.