tl;dr
In checking if two strings are case-insensitively equal, why is this code
bool rangeV3WithTransformViews(std::string_view str1, std::string_view str2) {
return equal(str1 | transform(tolowercase),
str2 | transform(tolowercase));
}
that much slower than this?
bool rangeV3WithProj(std::string_view str1, std::string_view str2) {
return equal(str1, str2, std::equal_to{}, tolowercase, tolowercase);
}
From the latter code, GCC 13.2 with -O3 manages to generate the same assembly code as the low level C-style solution:
#define to_lower_ascii(c) (((c) >= 'A' && (c) <= 'Z') ? (c) + 32 : (c))
int cstyle(const char* str1, const char* str2) {
while (*str1 != '\0' && *str2 != '\0' && to_lower_ascii(*str1) == to_lower_ascii(*str2)) {
str1++;
str2++;
}
return to_lower_ascii(*str1) - to_lower_ascii(*str2);
}
More datails
The origin of my question is my sense of frustration when I realized such a verbose, low-lever, and unreadable¹ code is written to make such a simple thing as a case-insensitive string equality.
So I started writing out a range-based solution. Initially I ended up with the first solution above, thinking it was the best I could write, but quick-bench.com proved me wrong. Yes, maybe I haven't written the test carefully and haven't read the documentation of Google Benchmark, but two things are clear
- GCC (at least the version on QuickBench, and with those options) manages to compile
rangeV3WithProjdown to the same assembly ascstyle, which means that the compiler could 100% see through the abstractions; rangeV3WithProjis simpler thanrangeV3WithTransformViewsand uses less abstractions, so I don't see any reason to think that the former can be slower than the latter.
I would expect that the compiler could see through rangeV3WithTransformViews too, though, and generate the same assembly code, but it doesn't.
So my questions are:
- What, if anything, makes the first snippet inherently slower than the second?
- What, if anything, makes harder for the compiler to work out the same assembly for the first snippet as for the second and third?
I think the problem is with transform, which is the only difference between rangeV3WithTransformViews and rangeV3WithProj, but why does it have the effect it has (different assembly, and maybe worse performance)? After all, the size of str1 and str2 is known in the function in O(1), because they're std::string_views, hence they satisfy sized_range, and transform gives back a sized_range for sized_range inputs, so it's not like there's a loop of uknown length or anything like that, no?
Caveat
I don't really know what QuickBench does, but one thing that makes me nervous is that Compiler Explorer doesn't generate correspondigly identical assembly codes for rangeV3WithProj and for cstyle, even though I pass -O3 -std=c++20 just like on QuickBench.
(¹) Would have you noticed, if I had bugged the code by changing some != to ==, 32 to 34, the last str2 should be str1, or something similar?
This is because your benchmark is wrong.
If you enable the "no-op" bar, you'll find that
rangeV3WithProjandcstyleare ~1× as fast as doing literally nothing: They are optimized out.You actually want to make sure
bis not optimized out, not a function pointer:benchmark::DoNotOptimize(b);instead ofbenchmark::DoNotOptimize(<function name>);.Even then, the timing is still unfair because
rangeV3WithProjandrangeV3WithTransformViewshave to callstd::tolower, which has to look at the current locale. To be fair, usestd::tolowerwith all three or use your macro for all of them withconstexpr auto tolowercase = [](char c)->char{ return to_lower_ascii(c); };(to ensure that all functions have the same result).Also, your third function is slightly different: It requires a nul-terminated C-string instead of a sized contiguous range of characters. You also have to use
!cstyle(str1, str2)it to get the same result. You can use ranges algorithms with nul-terminated C-strings, but not out of the box (https://godbolt.org/z/TKfM5Ph4o)With
std::ranges, the two methods with ranges compile to the exact same assembly: https://godbolt.org/z/GbTcKTP4xThe same with Range-v3: https://godbolt.org/z/Th4oT8f7c (though it is slightly different from the
std::rangesassembly)Given that the exact same code is generated, there should not be a difference in runtime.