I really like std::byte
as a distinct type that implements the concept of byte as specified in the C++ language definition. What I don't like is the fact that modern C++ compilers will produce less optimized code using the standard algorithms.
Here I'm playing with a function that checks the first 4 bytes in the header, you can follow my snippets on Godbolt
bool func_bytes(const std::array<std::byte, 1024>& buf) {
constexpr std::array<std::byte, 4> header {
std::byte{0xDE}, std::byte{0xAD}, std::byte{0xBE}, std::byte{0xAF}
};
return std::equal(header.begin(), header.end(), buf.begin());
}
This will produce the following assembly on x86-64 gcc trunk
func_bytes(std::array<std::byte, 1024ul> const&):
cmp BYTE PTR [rdi], -34
jne .L5
cmp BYTE PTR [rdi+1], -83
jne .L5
cmp BYTE PTR [rdi+2], -66
jne .L5
cmp BYTE PTR [rdi+3], -81
sete al
ret
.L5:
xor eax, eax
ret
If I replace the std::byte
with unsigned char
, then compiler will optimize to just an dword
comparison.
bool func_chars(const std::array<unsigned char, 1024>& buf) {
constexpr std::array<unsigned char, 4> header {0xDE, 0xAD, 0xBE, 0xAF};
return std::equal(header.begin(), header.end(), buf.begin());
}
Here is the assembly produced
.LC0:
.string "\336\255\276\257"
func_chars(std::array<unsigned char, 1024ul> const&):
mov eax, DWORD PTR [rdi]
cmp DWORD PTR .LC0[rip], eax
sete al
ret
My solution to optimized std::byte
version is to use old friends memcmp()
and memcpy()
, which is translated into the compiler's builtin.
bool func_bytes_memcmp(const std::array<std::byte, 1024>& buf) {
constexpr std::array<std::byte, 4> header {
std::byte{0xDE}, std::byte{0xAD}, std::byte{0xBE}, std::byte{0xAF}
};
return 0==std::memcmp(header.data(), buf.data(), header.size());
}
which produces the smallest code!
func_bytes_memcmp(std::array<std::byte, 1024ul> const&):
cmp DWORD PTR [rdi], -1346458146
sete al
ret
Is this the modern C++ approach?
That's a quality issue with the implementation of
std::equal
in the standard library, not with the compiler itself (I think).Note that on compiler explorer by default both GCC and Clang use libstdc++ as the standard library implementation.
It has a generic implementation of
std::equal
which uses a loop iterating linearly through the range and short-circuits if comparison of an element fails. When the compiler compiles this loop, it can't replace the individual byte access with single comparison because it would break the short-circuit, potentially performing a load on a byte it wasn't supposed to.I am not entirely sure whether there is actually an architectural justification to not invent loads though or whether this is simply not done because the performance benefit is ambiguous. As far as I can see there isn't any issue with inventing a load given that we know the address to be accessible (since it is part of a C++ object in its lifetime). But that high-level information might not make it until the point that these optimizations are made.
There is also a specialization of
std::equal
which forwards tostd::memcmp
in case the value type is "simple", but that currently considers only integers and pointers. It should includestd::byte
.This has already been reported on GCC's bug tracker here but doesn't seem to have received any response. Nonetheless I think this should be easy to fix and I don't think there is any problem with also considering
std::byte
simple for that purpose.Note that other standard library implementations may not have the
memcmp
optimization for any types. E.g. with libc++ Clang produces the non-optimal code with bothstd::byte
andunsigned char
.Btw. you can further improve the compiler output by making the
header
arraystatic
. It will then produce identical output to thememcmp
version. I don't actually see any particular reason that it loads the string instead of using an immediate withoutstatic
, but withoutstatic
each call to the function ought to have a differentheader
instance with a different address. At least before inlining that requires making the copy, since a pointer into the array is formed and the compiler wouldn't be sure whetherstd::equal
wouldn't e.g. compare pointers to different instances ofheader
.As a workaround I would write my own implementation of
std::equal
that forwards tomemcmp
in more situations than the actual standard library implementation does. And then I would also make my constantsstatic constexpr
to help out the compiler a bit. But if this is a one-off situation I don't see anything wrong with your approach of usingmemcmp
directly.