So for the following code, I see a strange inlining/optimisation artefact.
I'm curious to know if they are "necessary" in some hypothetical scenario I'm not appreciating?
Godbolt: https://godbolt.org/z/M8PW1obE7
#include <cstdint>
#include <stdio.h>
struct ThreadStateLogger
{
static thread_local struct Instance
{
char TLS_byteLoc[3] {' ', 0, 0};
uint8_t TLS_numBytes {1};
// 4 wasted bytes here...
char* TLS_byteLocPtr {TLS_byteLoc};
void Log(char v)
{
TLS_byteLocPtr[0] = v;
}
void Log(char v1, char v2)
{
TLS_byteLocPtr[0] = v1;
if (TLS_numBytes>1)
TLS_byteLocPtr[1] = v2;
}
} instance;
static void Log(char v1, char v2)
{
instance.Log(v1, v2);
}
// static void Log(char v1, char v2)
// {
// instance.TLS_byteLocPtr[0] = v1;
// if (instance.TLS_numBytes>1)
// instance.TLS_byteLocPtr[1] = v2;
// }
};
extern ThreadStateLogger theThreadStateLogger;
int main()
{
ThreadStateLogger::Log('a', 'b');
// printf("Hello world");
ThreadStateLogger::Log('c', 'd');
return 0;
}
The whole primary implementation gets inlined with -O3, which is what I want :-)
So it appears that the first Log() call correctly checks if this TLS needs allocating, and then gets the converted address using __tls_get_addr@PLT, which is all good.
The second Log() call also apparently checks if the object needs initialising, but then uses the cached addresses from the first call (rbx)! So if it did initialise, that could be wrong?
Below is the output from clang16 on godbolt, which is comparable to gcc - the same reinitialization test and cached address, but it does somewhat better than my current clang10 with -fPIC. https://godbolt.org/z/M8PW1obE7
main: # @main
push rbx
cmp qword ptr [rip + _ZTHN17ThreadStateLogger8instanceE@GOTPCREL], 0
je .LBB0_2
call TLS init function for ThreadStateLogger::instance@PLT
.LBB0_2:
data16
lea rdi, [rip + ThreadStateLogger::instance@TLSGD]
data16
data16
rex64
call __tls_get_addr@PLT
mov rbx, rax
mov rax, qword ptr [rax + 8]
mov byte ptr [rax], 97
cmp byte ptr [rbx + 3], 2
jae .LBB0_3
cmp qword ptr [rip + _ZTHN17ThreadStateLogger8instanceE@GOTPCREL], 0
jne .LBB0_5
.LBB0_6:
mov rax, qword ptr [rbx + 8]
mov byte ptr [rax], 99
cmp byte ptr [rbx + 3], 2
jae .LBB0_7
.LBB0_8:
xor eax, eax
pop rbx
ret
.LBB0_3:
mov rax, qword ptr [rbx + 8]
mov byte ptr [rax + 1], 98
cmp qword ptr [rip + _ZTHN17ThreadStateLogger8instanceE@GOTPCREL], 0
je .LBB0_6
.LBB0_5:
call TLS init function for ThreadStateLogger::instance@PLT
mov rax, qword ptr [rbx + 8]
mov byte ptr [rax], 99
cmp byte ptr [rbx + 3], 2
jb .LBB0_8
.LBB0_7:
mov rax, qword ptr [rbx + 8]
mov byte ptr [rax + 1], 100
xor eax, eax
pop rbx
ret
edits
Removed a printf - ?MORE? init checks have been added, but __tls_get_addr is still cached in rbx.
- Hmm It looks like an attempt to optimise the compares with TLS_numBytes in at least 1 code path. I guess that is because I have no complexity between the two calls, so that seems like a valid optimisation but wouldn't be useful in practice.
So the runtime cost is only a repeated check of the TLS-singleton-initialised flag, which we know is set. There is the code bloat as well, which is annoying.
tldr
So reminder of the question: Why is the init check/call repeated? If that is necessary, then why does the address not need to be regenerated? Or is this just an optimisation nobody has thought of doing? Is there any way to get a better optimisation by juggling the code pattern? I have had a few goes to get to here, as you can see.
Of course, I need the functionality of a logging class with this behaviour here, and the use of a class to encapsulate the 3 TLS variables means they are all constructed together instead of 1-at-a-time.
LLVM and likely GCC are unable to optimize this. Devirtualization, thread-local storage initialization, and other language features can be optimized locally within a function, but once they are inlined into another, there is no further potential.
The problem becomes clear when we look at the LLVM IR that clang outputs for this:
After all optimization passes, this function becomes:
The important thing is that this function contains:
call void @TLS init function for ThreadStateLogger::instance()(demangled)call void @_ZTHN17ThreadStateLogger8instanceE()(mangled)It's not a magic function, and the compiler has no way of knowing that calling it once makes future calls unnecessary. The function call is located in a branch which reads from mutable global memory:
and there is no IR-level information that this memory must be set to
!= 0by ourcall void @_ZTHN17ThreadStateLogger8instanceE().If you subsequently inline
ThreadStateLogger::Log(char)intomain, the two branches which contains this function call don't obsolete each other, and you see two instances ofcall TLS init function for ThreadStateLogger::instance@PLTWhy does
__tls_get_addrget optimized then?__tls_get_addris@llvm.threadlocal.address.p0, it's likely subject to optimizations that a "random" call to a_ZTHN17ThreadStateLogger8instanceEfunction wouldn't have.The call gets eliminated during an EarlyCSEPass in
main. See compiler explorer with LLVM Optimization Pipeline.