Why is the auto-vectorizer failing to find "vectorizable type information"?

54 Views Asked by At

I'm trying to get some of my code to vectorize, but I keep running into info C5002: loop not vectorized due to reason '1305'. According to this page:

// Code 1305 is emitted when the compiler can't discern proper vectorizable type information for this loop.

(I'm using Visual Studio Community 2022)

I decided to experiment with some non-functional code to better understand why this was happening, but this error seems to pop up in code that should be obviously typed, and easy to vectorize. This is my code:

int vecTest() {
    int v0[128] alignas(16);
    int v1[128] alignas(16);
    int v2[128] alignas(16);
    int sum = 0;

    for (int i = 0; i < 128; i++) {
        v0[i] = i-1;
        v1[i] = i*2;
    }
    
    for (int i = 0; i < 128; i++) {
        v2[i] = v0[i] + v2[i];
    }

    #ifdef CASE_TWO
    int* pv0 = &v0[0];
    int* pv1 = &v1[0];
    int* pv2 = &v2[0];

    for (int i = 0; i < 128; i++) {
        pv2[i] = pv0[i] + pv2[i];
    }
    #endif

    sum += v2[0];
    return sum;
}

int main(int argc, char* argv[])
{
    int sum = vecTest();
    sum = sum + 1;
}

If CASE_TWO is absent, the first (initialization) loop will vectorize, but the second will return code 1305. However, adding the contents of CASE_TWO causes all three loops to vectorize properly! Additionally, including the CASE_TWO code and excluding the second loop causes CASE_TWO to return 1305.

It seems to me that none of these loops should have trouble being vectorized, and that they shouldn't affect each other. What am I missing?

What is the actual meaning of code 1305 and "proper vectorizable type information", and does the compiler actually behave in the manner suggested by the documentation?

I'm using default compiler settings, except for /O2 and /Qvec-report:2.

1

There are 1 best solutions below

0
On

If you look at the asm (on Godbolt), we can see MSVC folded the two loops together so there is no separate init loop. It just computes v0[i] on the fly, adding into the uninitialized v2[i] (vector load and store from the space it allocated but never wrote).

It reports the first loop getting vectorized and the second not, but really it's fusing them into one asm loop. The work in those loops all gets vectorized so this is arguably a bug in its reporting. (Except for optimizing away the unused v1[i] = i*2; that nothing ever reads.)

// x64 MSVC 19.37 -O2 
v2$ = 0
int vecTest(void) PROC                                    ; vecTest, COMDAT
... function prologue
        movdqa  xmm2, XMMWORD PTR __xmm@00000003000000020000000100000000  ; _mm_setr_epi32(0,1,2,3)
        xor     eax, eax     ; i = 0
        movdqa  xmm3, XMMWORD PTR __xmm@00000001000000010000000100000001  ; _mm_set1_epi32(1)
        mov     ecx, eax     ; byte_offset = 0, could have just used a scaled-index addr mode
        npad    3
$LL4@vecTest:
        movdqu  xmm0, XMMWORD PTR v2$[rsp+rcx]  ; load uninitialized v2[i]
        lea     rcx, QWORD PTR [rcx+16]         ; byte_offset += 16
        movd    xmm1, eax
        add     eax, 4
        pshufd  xmm1, xmm1, 0     ; _mm_set1_epi32(i) = movd+pshufd
        paddd   xmm1, xmm2        ; add [3,2,1,0] to get [i+3, i+2, i+1, i+0]
        psubd   xmm1, xmm3        ; v0[i] = i - 1  for i+0..3
        paddd   xmm1, xmm0        ; v2[i] += v0[i]
        movdqu  XMMWORD PTR v2$[rsp+rcx-16], xmm1
        cmp     eax, 128                      ; 00000080H
        jl      SHORT $LL4@vecTest

        mov     eax, DWORD PTR v2$[rsp]   # retval = v2[0]
... epilogue

By comparison, GCC isn't that clever and does allocate space for both v2 and v1 (sub rsp, 928, plus the 128-byte red-zone, is just over 1024 = 2x 128 * sizeof(int)). MSVC allocated space for v2, not v0 (sub rsp, 536 is just over 128 * sizeof(int) = 512). Neither compiler allocated space for the unused v1, IDK why that's cluttering up your example.

Clang optimizes away everything (including the return value because reading uninitialized v2[] is UB in C++, or at least indeterminate so it can leave whatever garbage it wants in EAX as the return value). With alignas(16) int v2[128] = {};, clang still optimizes away the arrays, just returning -1. https://godbolt.org/z/E9v1evE94 - clang requires standard alignas(128) int v0[]; syntax, not allowing the alignas to go after the declaration. GCC and MSVC allow that.

With init of v2, MSVC does call memset for that, but then still makes the same single loop that materializes v0[i] on the fly to add into v2[i].