We have been struggling with this very weird issue when upgrading from C++14 to C++17 (Ubuntu 18.04, GCC 7.5.0). The toolchain is Linaro's on Jetson TX2 and is the default.
Background:
We have a C++ application A
that uses algorithms from library L
also developed by us running on Ubuntu 18.04. Builds and extensive system tests have been running for two years on Intel
and on Jetson TX2
.
Now we decided to upgrade to C++17 (-std=c++1z with GCC). We first built L
with C++17 enabled and everything first seemed to work fine, but then we noticed that some test runs started to act weirdly on ARM only. Like 2 tests out of 30 and this was deterministic(!).
We then started to investigate and noticed that one constructor in the library that accepted const std::pair<float, float> &
got somehow corrupted data. Inside constructor .first
seemed to be .second
and .second
was always 0
. Something weird like this.
So this happens if A
is still on C++14 and L
is on C++17.
Ok.
Then we tried this the other way around. L
on C++14 and the application A
on C++17. The results were similar. Some tests started to fail (not the same though) and it was deterministic. The root cause was again the same: somehow std::pair<float, float>
in the API gets messed up.
So the combinations so far are like this:
A: C++14, L: C++14, Intel => OK
A: C++14, L: C++17, Intel => OK
A: C++17, L: C++14, Intel => OK
A: C++17, L: C++17, Intel => OK
A: C++14, L: C++14, ARM => OK
A: C++14, L: C++17, ARM => FAIL
A: C++17, L: C++14, ARM => FAIL
A: C++17, L: C++17, ARM => OK
Apparently this is a big commercial application so I cannot just copy-paste code here. I first suspected this would be a compiler bug (what it still might be), but it just would seem to be too obvious!
And there's more:
We also recently noticed that if we just replace the const std::pair<float, float> &
with just plain float
arguments the tests are passing again.
Any guesses what the hell is going on? A compiler bug? How the switch to C++17 would even in theory cause anything like this (the compiler is exactly the same)? And especially like this (doesn't matter which component is upgraded).
We just fail to find anything wrong with the API. It has been working almost two years without any issues on Intel and ARM with C++14.
EDIT: Managed to make a working example project: https://drive.google.com/open?id=1B5SceFB1mKkCnE8iE59Mq0lScK2F0iOl
Instructions and example outputs in README.md
Outputs from this example on Intel and on Jetson TX2:
On Intel (Ubuntu 18.04, GCC 7.5.0) this app prints:
$ ./app/App
S: 42
L: 3.14
R: 666
In Foo::update(): s: 42
In Foo::update(): l: 3.14
In Foo::update(): r: 666
On Jetson TX2 (Ubuntu 18.04, GCC 7.5.0 / Linaro) this app prints:
$ ./app/App
S: 42
L: 0
R: 2.39152e+29
In Foo::update(): s: 42
In Foo::update(): l: 0
In Foo::update(): r: 2.39152e+29
I don't know anything for sure since I haven't looked, but this sounds like a case of the binary interface changing. The ABI. This could happen because of a structure layout change, maybe part of the effort to unify pairs and tuples. It could also be a change in padding rules. Or alignment rules. Suddenly thinking that's the most likely one. If it allocated using float alignment vs double alignment or one side decided to use 64-bit alignment for everything, that would definitely cause weird things.
Passing by reference passes a pointer in the implementation. Usually. So if the structure changes between C++ versions, it can have a different byte layout.
This may be an accident in the ARM compilers, because if the ABI changed on purpose there would have been some effort to put it into a new namespace like was done for the C++11 std::string in the GNU libc++.
I would test some of this by making structs and arrays of std::pairs in each compiler version and dump them to disk files or examine them in a debugger. See what bytes change.