In my knowledge, giving information(like using restrict, static on function, __builtin_expect(), etc) to compiler makes program better or equal. However, this works opposite to what was expected.
This is a function that changes the order of data storage in a matrix(packing method for matrix multiplication). Size of src matrix is m * n, and size of dst matrix is MAX_M * MAX_N. Case 2) line is disabled yet.
// pack.c
#define MAX_M 5000
#define MAX_N 5000
#define EPC 8 // number of Elements Per Cache line
// also AVX-512 SIMD register can hold up to 8 double-precision floating points.
void pack(int m, int n, const double *restrict src, double *restrict dst) {
int upper_n = (n + EPC - 1) / EPC;
int remainder_n = n % EPC;
for (int i = 0; i < m; ++i) {
for (int j = 0; j < upper_n; ++j) {
int len = j < upper_n - 1 || remainder_n == 0 ? EPC : remainder_n; // case 1)
// int len = EPC; // case 2)
for (int k = 0; k < len; ++k) {
dst[i * EPC + j * EPC * MAX_M + k] = src[i * n + j * EPC + k];
}
}
}
}
I used the code below to measure performance of pack function. This code runs the pack(5000, 5000, A, B) 50 times and measures the average execution time. A and B are aligned with 64 bytes, and both sizes are 5000 * 5000.
// main.c
#include <stdio.h>
#include <stdlib.h>
#include <omp.h>
#define MAX_M 5000
#define MAX_N 5000
#define ITERATION 50
void pack(int m, int n, const double *restrict src, double *restrict dst);
int main(int argc, char **argv) {
int m = 5000;
int n = 5000;
double *A;
double *B;
posix_memalign((void **)&A, 64, sizeof(double) * m * n);
posix_memalign((void **)&B, 64, sizeof(double) * MAX_M * MAX_N);
for (int i = 0; i < m * n; ++i) A[i] = i;
double total_duration = 0;
for (int i = 0; i < ITERATION; ++i) {
double start_time = omp_get_wtime();
pack(m, n, A, B);
double end_time = omp_get_wtime();
double duration = end_time - start_time;
total_duration += duration;
}
printf("avg duration: %.8lf s\n", total_duration / ITERATION);
free(A);
free(B);
return 0;
}
It only calls pack with n=5000. It means remaninder_n in pack is always 0 and len is always 8. So I used case 2) instead of case 1) in pack function.
Then weird thing happens. Performance becomes worse. case 2) is slower than case 1). I gave information(len is always 8) to compiler, but compiler produced slow code.
avg duration: 0.05746786 s <- case 1)
avg duration: 0.06110375 s <- case 2)
Is it possible that giving information to compiler makes program slower? Or is it just an issue with the compiler?
Target machine is Intel Xeon Phi 7250(Intel Knight Landing). Compile command is icc -o perf_test main.c pack.c -qopenmp -march=knl -O3. Assembly of pack function is like this except that mine uses movslq but the link uses movsxd.
I tested by modifying some codes. So I could figure out that 'case 1) is faster than case 2)' is a special case.
Case 2) becomes faster than case 1) if I
- change compiler to
gccfromicc - move
packfunction to main.c file - remove
restrictkeyword frompackfunction - remove
-march=knlflag
Case 1) becomes slow as case 2) if I
- change case 1)'s
remainder_nto any integer literalint len = j < upper_n - 1 || remainder_n == 0 ? EPC : 0; or int len = j < upper_n - 1 || remainder_n == 0 ? EPC : EPC;
In other words, case 2) is slower than case 1) if none of the above conditions are used. I don't know why compiler create slow program when these conditions are given.