C++17 std::byte produces less optimized code with the standard algorithms in GCC

294 Views Asked by At

I really like std::byte as a distinct type that implements the concept of byte as specified in the C++ language definition. What I don't like is the fact that modern C++ compilers will produce less optimized code using the standard algorithms.

Here I'm playing with a function that checks the first 4 bytes in the header, you can follow my snippets on Godbolt

bool func_bytes(const std::array<std::byte, 1024>& buf) {
    constexpr std::array<std::byte, 4> header {
        std::byte{0xDE}, std::byte{0xAD}, std::byte{0xBE}, std::byte{0xAF}
    };
    return std::equal(header.begin(), header.end(), buf.begin());
}

This will produce the following assembly on x86-64 gcc trunk

func_bytes(std::array<std::byte, 1024ul> const&):
        cmp     BYTE PTR [rdi], -34
        jne     .L5
        cmp     BYTE PTR [rdi+1], -83
        jne     .L5
        cmp     BYTE PTR [rdi+2], -66
        jne     .L5
        cmp     BYTE PTR [rdi+3], -81
        sete    al
        ret
.L5:
        xor     eax, eax
        ret

If I replace the std::byte with unsigned char, then compiler will optimize to just an dword comparison.

bool func_chars(const std::array<unsigned char, 1024>& buf) {
    constexpr std::array<unsigned char, 4> header {0xDE, 0xAD, 0xBE, 0xAF};
    return std::equal(header.begin(), header.end(), buf.begin());
}

Here is the assembly produced

.LC0:
        .string "\336\255\276\257"
func_chars(std::array<unsigned char, 1024ul> const&):
        mov     eax, DWORD PTR [rdi]
        cmp     DWORD PTR .LC0[rip], eax
        sete    al
        ret

My solution to optimized std::byte version is to use old friends memcmp() and memcpy(), which is translated into the compiler's builtin.

bool func_bytes_memcmp(const std::array<std::byte, 1024>& buf) {
    constexpr std::array<std::byte, 4> header {
        std::byte{0xDE}, std::byte{0xAD}, std::byte{0xBE}, std::byte{0xAF}
    };
    return 0==std::memcmp(header.data(), buf.data(), header.size());
}

which produces the smallest code!

func_bytes_memcmp(std::array<std::byte, 1024ul> const&):
        cmp     DWORD PTR [rdi], -1346458146
        sete    al
        ret

Is this the modern C++ approach?

1

There are 1 best solutions below

1
On

That's a quality issue with the implementation of std::equal in the standard library, not with the compiler itself (I think).

Note that on compiler explorer by default both GCC and Clang use libstdc++ as the standard library implementation.

It has a generic implementation of std::equal which uses a loop iterating linearly through the range and short-circuits if comparison of an element fails. When the compiler compiles this loop, it can't replace the individual byte access with single comparison because it would break the short-circuit, potentially performing a load on a byte it wasn't supposed to.

I am not entirely sure whether there is actually an architectural justification to not invent loads though or whether this is simply not done because the performance benefit is ambiguous. As far as I can see there isn't any issue with inventing a load given that we know the address to be accessible (since it is part of a C++ object in its lifetime). But that high-level information might not make it until the point that these optimizations are made.

There is also a specialization of std::equal which forwards to std::memcmp in case the value type is "simple", but that currently considers only integers and pointers. It should include std::byte.

This has already been reported on GCC's bug tracker here but doesn't seem to have received any response. Nonetheless I think this should be easy to fix and I don't think there is any problem with also considering std::byte simple for that purpose.

Note that other standard library implementations may not have the memcmp optimization for any types. E.g. with libc++ Clang produces the non-optimal code with both std::byte and unsigned char.

Btw. you can further improve the compiler output by making the header array static. It will then produce identical output to the memcmp version. I don't actually see any particular reason that it loads the string instead of using an immediate without static, but without static each call to the function ought to have a different header instance with a different address. At least before inlining that requires making the copy, since a pointer into the array is formed and the compiler wouldn't be sure whether std::equal wouldn't e.g. compare pointers to different instances of header.


As a workaround I would write my own implementation of std::equal that forwards to memcmp in more situations than the actual standard library implementation does. And then I would also make my constants static constexpr to help out the compiler a bit. But if this is a one-off situation I don't see anything wrong with your approach of using memcmp directly.