API with std::pair<float, float> breaks when switching from C++14 to C++17 on ARM?

536 Views Asked by At

We have been struggling with this very weird issue when upgrading from C++14 to C++17 (Ubuntu 18.04, GCC 7.5.0). The toolchain is Linaro's on Jetson TX2 and is the default.

Background:

We have a C++ application A that uses algorithms from library L also developed by us running on Ubuntu 18.04. Builds and extensive system tests have been running for two years on Intel and on Jetson TX2.

Now we decided to upgrade to C++17 (-std=c++1z with GCC). We first built L with C++17 enabled and everything first seemed to work fine, but then we noticed that some test runs started to act weirdly on ARM only. Like 2 tests out of 30 and this was deterministic(!).

We then started to investigate and noticed that one constructor in the library that accepted const std::pair<float, float> & got somehow corrupted data. Inside constructor .first seemed to be .second and .second was always 0. Something weird like this.

So this happens if A is still on C++14 and L is on C++17.

Ok.

Then we tried this the other way around. Lon C++14 and the application A on C++17. The results were similar. Some tests started to fail (not the same though) and it was deterministic. The root cause was again the same: somehow std::pair<float, float> in the API gets messed up.

So the combinations so far are like this:

A: C++14, L: C++14, Intel => OK

A: C++14, L: C++17, Intel => OK

A: C++17, L: C++14, Intel => OK

A: C++17, L: C++17, Intel => OK

A: C++14, L: C++14, ARM => OK

A: C++14, L: C++17, ARM => FAIL

A: C++17, L: C++14, ARM => FAIL

A: C++17, L: C++17, ARM => OK

Apparently this is a big commercial application so I cannot just copy-paste code here. I first suspected this would be a compiler bug (what it still might be), but it just would seem to be too obvious!

And there's more:

We also recently noticed that if we just replace the const std::pair<float, float> & with just plain float arguments the tests are passing again.

Any guesses what the hell is going on? A compiler bug? How the switch to C++17 would even in theory cause anything like this (the compiler is exactly the same)? And especially like this (doesn't matter which component is upgraded).

We just fail to find anything wrong with the API. It has been working almost two years without any issues on Intel and ARM with C++14.

EDIT: Managed to make a working example project: https://drive.google.com/open?id=1B5SceFB1mKkCnE8iE59Mq0lScK2F0iOl

Instructions and example outputs in README.md

Outputs from this example on Intel and on Jetson TX2:

On Intel (Ubuntu 18.04, GCC 7.5.0) this app prints:

$ ./app/App 
S: 42
L: 3.14
R: 666
In Foo::update(): s: 42
In Foo::update(): l: 3.14
In Foo::update(): r: 666

On Jetson TX2 (Ubuntu 18.04, GCC 7.5.0 / Linaro) this app prints:

$ ./app/App 
S: 42
L: 0
R: 2.39152e+29
In Foo::update(): s: 42
In Foo::update(): l: 0
In Foo::update(): r: 2.39152e+29
2

There are 2 best solutions below

2
On

I don't know anything for sure since I haven't looked, but this sounds like a case of the binary interface changing. The ABI. This could happen because of a structure layout change, maybe part of the effort to unify pairs and tuples. It could also be a change in padding rules. Or alignment rules. Suddenly thinking that's the most likely one. If it allocated using float alignment vs double alignment or one side decided to use 64-bit alignment for everything, that would definitely cause weird things.

Passing by reference passes a pointer in the implementation. Usually. So if the structure changes between C++ versions, it can have a different byte layout.

This may be an accident in the ARM compilers, because if the ABI changed on purpose there would have been some effort to put it into a new namespace like was done for the C++11 std::string in the GNU libc++.

I would test some of this by making structs and arrays of std::pairs in each compiler version and dump them to disk files or examine them in a debugger. See what bytes change.

7
On

How the switch to C++17 would even in theory cause anything like this (the compiler is exactly the same)?

There are LOADS of ways it could change something in theory.

The most straightforward is that the standard library headers have lots of conditional compilation with things like:

#if __cplusplus <= 201402L
/* code for C++14 ... */
#else
/* code for C++17 ... */
#endif

All it takes is for the two bits of code to be incompatible. We try pretty hard to ensure that doesn't happen. But in theory it can happen.

We then started to investigate and noticed that one constructor in the library that accepted const std::pair<float, float> & got somehow corrupted data. Inside constructor .first seemed to be .second and .second was always 0. Something weird like this.

I'm unable to reproduce anything like this. When I examine the assembly generated by GCC 7.3 for Aarch64 the results are identical for C++14 and C++17. So you'll need to provide more information about your code. It shouldn't be hard to show the constructor signature and the data members of the constructor, without needing to show big chunks of proprietary code.

Edit: I've reduced the working example to this live example which shows the generated code for a class with an empty base is different for C++14 and C++17, which is a compiler bug: https://godbolt.org/z/E46NFc

I've reported as https://gcc.gnu.org/bugzilla/show_bug.cgi?id=94383