Compiler produce slower program although I gave information

73 Views Asked by At

In my knowledge, giving information(like using restrict, static on function, __builtin_expect(), etc) to compiler makes program better or equal. However, this works opposite to what was expected.

This is a function that changes the order of data storage in a matrix(packing method for matrix multiplication). Size of src matrix is m * n, and size of dst matrix is MAX_M * MAX_N. Case 2) line is disabled yet.

// pack.c

#define MAX_M 5000
#define MAX_N 5000
#define EPC 8  // number of Elements Per Cache line
               // also AVX-512 SIMD register can hold up to 8 double-precision floating points.

void pack(int m, int n, const double *restrict src, double *restrict dst) {
    int upper_n = (n + EPC - 1) / EPC;
    int remainder_n = n % EPC;
    for (int i = 0; i < m; ++i) {
        for (int j = 0; j < upper_n; ++j) {
            int len = j < upper_n - 1 || remainder_n == 0 ? EPC : remainder_n; // case 1)
            // int len = EPC;                                                  // case 2)
            for (int k = 0; k < len; ++k) {
                dst[i * EPC + j * EPC * MAX_M + k] = src[i * n + j * EPC + k];
            }
        }
    }
}

I used the code below to measure performance of pack function. This code runs the pack(5000, 5000, A, B) 50 times and measures the average execution time. A and B are aligned with 64 bytes, and both sizes are 5000 * 5000.

// main.c

#include <stdio.h>
#include <stdlib.h>
#include <omp.h>

#define MAX_M 5000
#define MAX_N 5000

#define ITERATION 50

void pack(int m, int n, const double *restrict src, double *restrict dst);

int main(int argc, char **argv) {
    int m = 5000;
    int n = 5000;

    double *A;
    double *B;

    posix_memalign((void **)&A, 64, sizeof(double) * m * n);
    posix_memalign((void **)&B, 64, sizeof(double) * MAX_M * MAX_N);

    for (int i = 0; i < m * n; ++i) A[i] = i;

    double total_duration = 0;
    for (int i = 0; i < ITERATION; ++i) {
        double start_time = omp_get_wtime();
        pack(m, n, A, B);
        double end_time = omp_get_wtime();
        double duration = end_time - start_time;

        total_duration += duration;
    }
    printf("avg duration: %.8lf s\n", total_duration / ITERATION);

    free(A);
    free(B);

    return 0;
}

It only calls pack with n=5000. It means remaninder_n in pack is always 0 and len is always 8. So I used case 2) instead of case 1) in pack function.

Then weird thing happens. Performance becomes worse. case 2) is slower than case 1). I gave information(len is always 8) to compiler, but compiler produced slow code.

avg duration: 0.05746786 s    <- case 1)
avg duration: 0.06110375 s    <- case 2)

Is it possible that giving information to compiler makes program slower? Or is it just an issue with the compiler?

Target machine is Intel Xeon Phi 7250(Intel Knight Landing). Compile command is icc -o perf_test main.c pack.c -qopenmp -march=knl -O3. Assembly of pack function is like this except that mine uses movslq but the link uses movsxd.


I tested by modifying some codes. So I could figure out that 'case 1) is faster than case 2)' is a special case.

Case 2) becomes faster than case 1) if I

  • change compiler to gcc from icc
  • move pack function to main.c file
  • remove restrict keyword from pack function
  • remove -march=knl flag

Case 1) becomes slow as case 2) if I

  • change case 1)'s remainder_n to any integer literal
    int len = j < upper_n - 1 || remainder_n == 0 ? EPC : 0;
    
    or
    
    int len = j < upper_n - 1 || remainder_n == 0 ? EPC : EPC;
    

In other words, case 2) is slower than case 1) if none of the above conditions are used. I don't know why compiler create slow program when these conditions are given.

0

There are 0 best solutions below