I really like std::byte as a distinct type that implements the concept of byte as specified in the C++ language definition. What I don't like is the fact that modern C++ compilers will produce less optimized code using the standard algorithms.
Here I'm playing with a function that checks the first 4 bytes in the header, you can follow my snippets on Godbolt
bool func_bytes(const std::array<std::byte, 1024>& buf) {
constexpr std::array<std::byte, 4> header {
std::byte{0xDE}, std::byte{0xAD}, std::byte{0xBE}, std::byte{0xAF}
};
return std::equal(header.begin(), header.end(), buf.begin());
}
This will produce the following assembly on x86-64 gcc trunk
func_bytes(std::array<std::byte, 1024ul> const&):
cmp BYTE PTR [rdi], -34
jne .L5
cmp BYTE PTR [rdi+1], -83
jne .L5
cmp BYTE PTR [rdi+2], -66
jne .L5
cmp BYTE PTR [rdi+3], -81
sete al
ret
.L5:
xor eax, eax
ret
If I replace the std::byte with unsigned char, then compiler will optimize to just an dword comparison.
bool func_chars(const std::array<unsigned char, 1024>& buf) {
constexpr std::array<unsigned char, 4> header {0xDE, 0xAD, 0xBE, 0xAF};
return std::equal(header.begin(), header.end(), buf.begin());
}
Here is the assembly produced
.LC0:
.string "\336\255\276\257"
func_chars(std::array<unsigned char, 1024ul> const&):
mov eax, DWORD PTR [rdi]
cmp DWORD PTR .LC0[rip], eax
sete al
ret
My solution to optimized std::byte version is to use old friends memcmp() and memcpy(), which is translated into the compiler's builtin.
bool func_bytes_memcmp(const std::array<std::byte, 1024>& buf) {
constexpr std::array<std::byte, 4> header {
std::byte{0xDE}, std::byte{0xAD}, std::byte{0xBE}, std::byte{0xAF}
};
return 0==std::memcmp(header.data(), buf.data(), header.size());
}
which produces the smallest code!
func_bytes_memcmp(std::array<std::byte, 1024ul> const&):
cmp DWORD PTR [rdi], -1346458146
sete al
ret
Is this the modern C++ approach?
That's a quality issue with the implementation of
std::equalin the standard library, not with the compiler itself (I think).Note that on compiler explorer by default both GCC and Clang use libstdc++ as the standard library implementation.
It has a generic implementation of
std::equalwhich uses a loop iterating linearly through the range and short-circuits if comparison of an element fails. When the compiler compiles this loop, it can't replace the individual byte access with single comparison because it would break the short-circuit, potentially performing a load on a byte it wasn't supposed to.I am not entirely sure whether there is actually an architectural justification to not invent loads though or whether this is simply not done because the performance benefit is ambiguous. As far as I can see there isn't any issue with inventing a load given that we know the address to be accessible (since it is part of a C++ object in its lifetime). But that high-level information might not make it until the point that these optimizations are made.
There is also a specialization of
std::equalwhich forwards tostd::memcmpin case the value type is "simple", but that currently considers only integers and pointers. It should includestd::byte.This has already been reported on GCC's bug tracker here but doesn't seem to have received any response. Nonetheless I think this should be easy to fix and I don't think there is any problem with also considering
std::bytesimple for that purpose.Note that other standard library implementations may not have the
memcmpoptimization for any types. E.g. with libc++ Clang produces the non-optimal code with bothstd::byteandunsigned char.Btw. you can further improve the compiler output by making the
headerarraystatic. It will then produce identical output to thememcmpversion. I don't actually see any particular reason that it loads the string instead of using an immediate withoutstatic, but withoutstaticeach call to the function ought to have a differentheaderinstance with a different address. At least before inlining that requires making the copy, since a pointer into the array is formed and the compiler wouldn't be sure whetherstd::equalwouldn't e.g. compare pointers to different instances ofheader.As a workaround I would write my own implementation of
std::equalthat forwards tomemcmpin more situations than the actual standard library implementation does. And then I would also make my constantsstatic constexprto help out the compiler a bit. But if this is a one-off situation I don't see anything wrong with your approach of usingmemcmpdirectly.