In GCC 13.2 and Clang 18.0.1 using -O3, the generated assembly for operator < of a tuple of four std::uint16_t objects is less than impressive (pun intended). Two other implementations using reinterpret_cast and type punning both produce optimized assembly.
EDIT: Since this question is already getting downvotes, I'd like to point out that the default assembly uses 3 conditionals, which is quite impactful when this code is being run in a high-performance loop. Both other implementations have zero branches.
Question 1: Are these implementations correct?
Question 2: If not, what conditions are required to guarantee correctness?
Question 3: Assuming there is a test for correctness that can be evaluated at compile time, why doesn't the compiler simply output the optimized assembly?
#include <cstdint>
#include <tuple>
#include <bit>
using TupleU16 = std::tuple<std::uint16_t, std::uint16_t, std::uint16_t, std::uint16_t>;
bool less(TupleU16 lhs, TupleU16 rhs) {
return lhs < rhs;
}
constexpr bool canSafelyUseTypePunning() {
TupleU16 t;
std::uint16_t* val0 = &std::get<0>(t);
std::uint16_t* val1 = &std::get<1>(t);
std::uint16_t* val2 = &std::get<2>(t);
std::uint16_t* val3 = &std::get<3>(t);
if constexpr (std::endian::native == std::endian::little) {
return sizeof(TupleU16) == 8 &&
(val0 - val3) == 3 && (val0 - val2) == 2 && (val0 - val1) == 1;
}
else if constexpr (std::endian::native == std::endian::big) {
return sizeof(TupleU16) == 8 &&
(val3 - val0) == 3 && (val2 - val0) == 2 && (val1 - val0) == 1;
}
else {
return false;
}
}
bool less2(TupleU16 lhs, TupleU16 rhs) { // cannot be constexpr
static_assert(canSafelyUseTypePunning());
auto* lhsp = reinterpret_cast<const std::uint64_t*>(&lhs);
auto* rhsp = reinterpret_cast<const std::uint64_t*>(&rhs);
return *lhsp < *rhsp;
}
bool less3(TupleU16 lhs, TupleU16 rhs) { // can be constexpr
static_assert(canSafelyUseTypePunning());
union U {
std::uint64_t u64;
TupleU16 tu16;
};
return U{.tu16 = lhs}.u64 < U{.tu16 = rhs}.u64;
}
No, that's
std::tupleobject through a pointer of a different type and many compilers actually do make use of the aliasing rule for optimization (although there are often options to disable these optimizations)std::tuples memory layout at all and there are in factstd::tupleimplementations which order elements in different orders. They are not even guaranteed to be contiguous in memory.The reason that
std::tuple'soperator<is suboptimal for your use case is probably that it guarantees that the lexicographically next element is only accessed if the previous element's comparison hasn't been conclusive yet. That means that it gives extra guarantees in the case that another thread modifies other elements of the tuple.It seems that you don't need that guarantee and would prefer a more performant implementation. If you need that, implement a
std::tuple-equivalent with these requirements and layout assumptions you need in mind yourself and for the comparison usestd::bit_cast<uint64_t>instead ofreinterpret_castto get defined behavior (or equivalentlystd::memcpyto a localuint64_tvariable). Also, if all element types of the tuple are equal, thenstd::arraywould be enough (but it has the sameoperator<performance issue).