Direct inclusion of template slower than separate instantiation

90 Views Asked by At

I have a simple template header containing 3 templated functions (no declarations, just definitions and marked static inline), two of these functions being 5000 lines long. These long functions are very simple, but are long because they are in strainghtline program form / no loops. On my main program file where I use an instantiation of the template, if I include the template file directly, the program runs about 10x slower than if I build a separate c++ file to include the template and instantiate it, and link to it as a static library (-fPIC used). Why?

Is the compiler too slow, the instruction cache is getting messed up, the compiler suddenly inlined the long functions when it shouldn’t, or something else?

Code is highly optimized, being compiled with flags: -O3 -ffast-math -march=native -std=gnu++11 and GCC 5.5.0 in Mac OS 10.14.3.

2

There are 2 best solutions below

8
Andrey Mishchenko On

If you declare the function template to be static, doesn't that cause one copy of it to be generated per translation unit (compiled object file)? It could be that this results in 3 copies of the method being generated and yeah, caching issues.

Does getting rid of the static keyword resolve the performance problems?

0
rfabbri On

The optimization flags were being left out when compiling the main program, perhaps a CMake bug. When compiling the template instantiation separately as a library, the optimization flags were being used, causing the program to be fast. I forced the optimization flags to be used in the main program with direct template inclusion and it now runs just as fast.

For the sake of curiosity: the inline and static keywords were harmless - removing them didn't alter the speed. In fact the compiler is not inlining the functions despite my hint, as it knows when it shouldn't. Forcing inlining using __attribute__((always_inline)) makes compilation very slow, and also runtime performance slows down a bit (2x).