Edit I remove some part of the question because it might be confusing > and thanks of answers I should be able to found out by myself.
I am trying to write a program that resample a signal using a spectral function. I am writing the code in C but I discover the problem coding in C++.
Here is a portion of the code I am using. I have a function test() that call another function test3(tmp) passing an array as pointer. As long as I combine the += operator and I am trying to fill the array of the first function, my code is slowing down.
I initialize and calculate using zeros but all arrays must have double values.
#include <stdio.h>
#include <stdlib.h>
void test3(double *tmp);
void test() {
double *tmp;
int k;
tmp = (double *) calloc(250 * 6, sizeof(double));
for (k = 0; k < 1000; ++k)
test3(tmp);
free(tmp);
}
void test3(double *tmp) {
int k;
double *fx;
double *srf;
double *ptr;
int p;
int l;
int c;
double a;
fx = (double *) calloc(2100 * 6, sizeof(double));
srf = (double *) calloc(2100 * 250, sizeof(double));
for (int k = 0; k < 2100 * 6; ++k) fx[k] = 0;
for (int k = 0; k < 2100 * 250; ++k) srf[k] = 0;
ptr = (double *) calloc(250 * 6, sizeof(double));
for (p = 0; p < 6; ++p) {
for (l = 0; l < 250; ++l) {
ptr[l + p*250] = 0.0;
for (c = 0; c < 2100; ++c)
ptr[l+p*250] += fx[c+p*2100] * srf[c+l*2100];
}
}
for (k = 0; k < 250 * 6; ++k) tmp[k] = ptr[k];
free(ptr);
free(fx);
free(srf);
}
int main(int argc, char *argv[]) {
test();
return 0;
}
I compile using:
gcc -O2 -o test test.c -lm
When I run the code time ./test I get
./test_lut 3.84s user 0.11s system 98% cpu 4.008 total
The compiler does a good job at optimizing out useless code, but not always:
if you remove the line
for (k = 0; k < 250*6; ++k) tp[k] = ptr[k];you essentially remove all side effects of the functiontest3(): you initialize arrays pointed to byfxandsrfand compute the entries in the one pointed to byptr, then free these arrays. The compiler seems able to determine that no object is impacted that survives after returning from the function and removes the code altogether, resulting in almost no time spent in the function.if you change
+=to=, the loopkeep modifying the same location, hence only the last iteration is useful so the code is reduced to
ptr[l+p*250] += fx[2099+p*2100] * srf[2099+l*2100];resulting in much faster operation.the third attempt using a temporary variable
ahas no impact as the compiler probably already optimizes the constant destination out of the loop.You can investigate the compiler work using Godbolt's Compiler Explorer and play with the code and compiler flags.
As can be seen analyzing the code generated, neither gcc not clang generate any code for the initialization loops (setting memory allocated with
calloc()to all bit zero value0.0has no effect) and convert the finalforloop to a call tomemcpy. Commenting this finalforloop causes both compilers to generate no code and changing the+=to=simplifies the inner loop code.Interestingly, albeit both compilers correctly determine that
fxandsrfpoint to arrays with all elements equal to0.0, they do not assume that the contents of these arrays stay null for the whole computation and thus all results in the destination array should be null too asptr[l+p*250] += fx[c+p*2100] * srf[c+l*2100];simplifies toptr[l+p*250] += 0.0;If you compile the code as C (not C++) and define
fx,srfandptrasBoth compilers should determine that the arrays
fxandsrfcannot be modified as a side effect of writing viaptr, hence remain null. The code generated would be drastically simplified and should execute almost instantly, but this optimisation does not seem to be effective.