Forcing loop unrolling in MSVC C++

2.7k Views Asked by At

Imagine following code:

for (int i = 0; i < 8; ++i) {
    // ... some code
}

I want this loop to be unrolled in MSVC. In CLang I can add #pragma unroll before loop. But how to do same in MSVC?

I understand that anyway compilers often will unroll this loop for me even without any pragmas. But I want to be really sure about this, I want to unroll it always.

Of cause one way to force unrolling is to use recursive call of templated unrolling function with passed-in functor, like in following code:

Try it online!

template <int N, int I = 0, typename F>
inline void Unroll(F const & f) {
    if constexpr(I < N) {
        f.template operator() <I> ();
        Unroll<N, I + 1>(f);
    }
}

void f_maybe_not_unrolled() {
    int volatile x = 0;
    for (int i = 0; i < 8; ++i)
        x = x + i;
}

void f_forced_unrolled() {
    int volatile x = 0;
    Unroll<8>([&]<int I>{ x = x + I; });
}

But is it possible to force unroll in MSVC without such more difficult code like above?

Also is it possible for CLang to really force unrolling, I'm thinking that #pragma unroll just gives a hint to CLang (or I'm not right), maybe there is something like #pragma force_unroll, is there?

Also I want to unroll just this single loop, I don't need solution like passing command line arguments to force unrolling ALL possible loops.

Note: For me is not really crucial for code to be really forced unrolled in all 100% cases. I just need it to happen in most cases. Basically I just want to find out for MSVC same like CLang's #pragma unroll which on average make compiler more likely to unroll loop than without using pragma.

2

There are 2 best solutions below

2
On BEST ANSWER

You can't directly. The closest #pragma is #pragma loop(...), and that doesn't have an unroll option. The big hammer here is Profile Guided Optimization - profile your program, and MSVC will know how often this loop runs.

2
On

This is much more simpler with fold expressions:

template<size_t N, typename Fn>
#if defined(__cpp_concepts)
    requires (N >= 1) && requires( Fn fn ) { { fn.template operator ()<(size_t)N - 1>() } -> std::convertible_to<bool>; }
#endif
inline bool unroll( Fn fn )
{
    auto unroll_n = [&]<size_t ... Indices>( std::index_sequence<Indices ...> ) -> bool
    {
        return (fn.template operator ()<Indices>() && ...);
    };
    return unroll_n( std::make_index_sequence<N>() );
}

This becomes really powerful if you want to do loop-unrolling with that:

template<std::size_t N, typename RandomIt, typename UnaryFunction>
#if defined(__cpp_concepts)
    requires std::random_access_iterator<RandomIt>
    && requires( UnaryFunction fn, std::iter_value_t<RandomIt> elem ) { { fn( elem ) } -> std::same_as<bool>; }
#endif
inline RandomIt unroll_for_each( RandomIt begin, RandomIt end, UnaryFunction fn )
{
    RandomIt &it = begin;
    if constexpr( N > 1 )
        for( ; it + N <= end && unroll<N>( [&]<size_t I>() { return fn( it[I] ); } ); it += N );
    for( ; it < end; ++it )
        fn( *begin );
    return it;
}

The peculiarity with that is that the it + N <= end check is done for N iterations and not for each iteration. The check for the unroll return values might get eliminated if the lambda for each element always returns true.
I optimized Fletcher's hash with that and got a speedup of 60%, resulting in about 18GB/s, with an unrolling factor of five on my Zen1-CPU.