I'm working on embedded code, and I was at first delighted to find out that I could use std::ranges and views to simplify even performance intensive loops, as the compiler optimizes out all the iterators down to the same assembler as if I'd written the old school loop where all the indexing is done by hand.
Now, C++23 introduces views::adjacent, views::stride, etc, which would allow me to simplify even more. However, it appears that the optimizer hits a wall there. A simplified toy-model, sum every second element of an array:
// Old-school
std::tuple<int, int> process(const std::array<int, 16> &in)
{
int sumL = 0;
int sumR = 0;
for (unsigned i = 0; i < in.size(); )
{
sumL += in[i++];
sumR += in[i++];
}
return {sumL, sumR};
}
//Ranges
std::tuple<int, int> processRanges(const std::array<int, 16> &in)
{
int sumL = 0;
int sumR = 0;
for (auto && [l, r] : in | std::views::adjacent<2> | std::views::stride(2))
{
sumL += l;
sumR += r;
}
return {sumL, sumR};
}
// Ranges, using std::views::chunk
std::tuple<int, int> processRangesChunked(const std::array<int, 16> &in)
{
int sumL = 0;
int sumR = 0;
for (auto && inner: in | std::views::chunk(2))
{
sumL += inner[0];
sumR += inner[1];
}
return {sumL, sumR};
}
Using -O3, the old-school version compiles to assembly that I couldn't improve on by hand, the loop is entirely unrolled, etc. The ranges version using adjacent and stride not only misses the unrolling, but does a weird nested-loop-looking -code. Using chunk is a bit better, but still produces more instructions and has a slightly less nice interface anyway. Godbolt: https://godbolt.org/z/r99seWEMz
While in this case, it's a micro-optimization, in my actual use case which has a similar structure of processing every-second-element differently, the compiler misses obvious and very necessary inlinings etc, completely destroying the performance.
My question(s): is it just a fact of life at the moment that more complicated loop indexing cannot be using std::ranges where performance matters? Or maybe I'm writing an unnecessarily complicated view with adjacent and stride, and there's some way that optimizes better? Perhaps by writing a custom view, like chunked but returning tuples?