tl;dr

In checking if two strings are case-insensitively equal, why is this code

bool rangeV3WithTransformViews(std::string_view str1, std::string_view str2) {
    return equal(str1 | transform(tolowercase),
                 str2 | transform(tolowercase));
}

that much slower than this?

bool rangeV3WithProj(std::string_view str1, std::string_view str2) {
    return equal(str1, str2, std::equal_to{}, tolowercase, tolowercase);
}

From the latter code, GCC 13.2 with -O3 manages to generate the same assembly code as the low level C-style solution:

#define to_lower_ascii(c) (((c) >= 'A' && (c) <= 'Z') ? (c) + 32 : (c))

int cstyle(const char* str1, const char* str2) {
    while (*str1 != '\0' && *str2 != '\0' && to_lower_ascii(*str1) == to_lower_ascii(*str2)) {
        str1++;
        str2++;
    }
    return to_lower_ascii(*str1) - to_lower_ascii(*str2);
}

More datails

The origin of my question is my sense of frustration when I realized such a verbose, low-lever, and unreadable¹ code is written to make such a simple thing as a case-insensitive string equality.

So I started writing out a range-based solution. Initially I ended up with the first solution above, thinking it was the best I could write, but quick-bench.com proved me wrong. Yes, maybe I haven't written the test carefully and haven't read the documentation of Google Benchmark, but two things are clear

  1. GCC (at least the version on QuickBench, and with those options) manages to compile rangeV3WithProj down to the same assembly as cstyle, which means that the compiler could 100% see through the abstractions;
  2. rangeV3WithProj is simpler than rangeV3WithTransformViews and uses less abstractions, so I don't see any reason to think that the former can be slower than the latter.

I would expect that the compiler could see through rangeV3WithTransformViews too, though, and generate the same assembly code, but it doesn't.

So my questions are:

  • What, if anything, makes the first snippet inherently slower than the second?
  • What, if anything, makes harder for the compiler to work out the same assembly for the first snippet as for the second and third?

I think the problem is with transform, which is the only difference between rangeV3WithTransformViews and rangeV3WithProj, but why does it have the effect it has (different assembly, and maybe worse performance)? After all, the size of str1 and str2 is known in the function in O(1), because they're std::string_views, hence they satisfy sized_range, and transform gives back a sized_range for sized_range inputs, so it's not like there's a loop of uknown length or anything like that, no?

Caveat

I don't really know what QuickBench does, but one thing that makes me nervous is that Compiler Explorer doesn't generate correspondigly identical assembly codes for rangeV3WithProj and for cstyle, even though I pass -O3 -std=c++20 just like on QuickBench.


(¹) Would have you noticed, if I had bugged the code by changing some != to ==, 32 to 34, the last str2 should be str1, or something similar?

1

There are 1 best solutions below

1
Artyer On

This is because your benchmark is wrong.

If you enable the "no-op" bar, you'll find that rangeV3WithProj and cstyle are ~1× as fast as doing literally nothing: They are optimized out.

You actually want to make sure b is not optimized out, not a function pointer: benchmark::DoNotOptimize(b); instead of benchmark::DoNotOptimize(<function name>);.

Even then, the timing is still unfair because rangeV3WithProj and rangeV3WithTransformViews have to call std::tolower, which has to look at the current locale. To be fair, use std::tolower with all three or use your macro for all of them with constexpr auto tolowercase = [](char c)->char{ return to_lower_ascii(c); }; (to ensure that all functions have the same result).

Also, your third function is slightly different: It requires a nul-terminated C-string instead of a sized contiguous range of characters. You also have to use !cstyle(str1, str2) it to get the same result. You can use ranges algorithms with nul-terminated C-strings, but not out of the box (https://godbolt.org/z/TKfM5Ph4o)


With std::ranges, the two methods with ranges compile to the exact same assembly: https://godbolt.org/z/GbTcKTP4x

The same with Range-v3: https://godbolt.org/z/Th4oT8f7c (though it is slightly different from the std::ranges assembly)

Given that the exact same code is generated, there should not be a difference in runtime.