Why doesn't MSVC's auto-vectorization use AVX2?

I am trying to use vectorization in my compiler (Microsoft Visual Studio 2013). One of the problems I am facing is that it doesn't want to use AVX2. While investigating this problem, I constructed the following example, which calculates a sum of 16 numbers, each one 16-bit.

int16_t input1[16] = {0};
int16_t input2[16] = {0};
... // fill the arrays with some data

// Calculate the sum using a loop
int16_t output1[16] = {0};
for (int x = 0; x < 16; x++){
    output1[x] = input1[x] + input2[x];

The compiler vectorizes this code, but only to SSE instructions:

vmovdqu  xmm1, xmmword ptr [rbp+rax]
lea      rax, [rax+10h]
vpaddw   xmm1, xmm1, xmmword ptr [rbp+rax+10h]
vmovdqu  xmmword ptr [rbp+rax+30h], xmm1
dec      rcx
jne      main+0b0h

To make sure the compiler has the option to generate AVX2 code, I wrote the same calculation as follows:

// Calculate the sum using one AVX2 instruction
int16_t output2[16] = {0};
__m256i in1 = _mm256_loadu_si256((__m256i*)input1);
__m256i in2 = _mm256_loadu_si256((__m256i*)input2);
__m256i out2 = _mm256_add_epi16(in1, in2);
_mm256_storeu_si256((__m256i*)output2, out2);

I see that the two parts of code are equivalent (that is, output11 is equal to output2 after they are executed).

And it outputs AVX2 instructions for the second part of code:

vmovdqu  ymm1, ymmword ptr [input2]
vpaddw   ymm1, ymm1, ymmword ptr [rbp]
vmovdqu  ymmword ptr [output2], ymm1

I don't want to rewrite my code to use intrinsics, however: having it written as a loop is much more natural, is compatible with old (SSE-only) processors, and has other advantages.

So how can I tweak my example to make the compiler be able to vectorize it in AVX2 way?


Visual Studio easily produces AVX2 code when doing floating point arithmetic. I guess this is enough to declare that "VS2013 supports AVX2".

However, no matter what I did, VS2013 didn't produce AVX2 code for integer calculations (neither int16_t nor int32_t worked), so I guess this is not supported at all (gcc produces AVX2 for my code at version 4.8.2; not sure about earlier versions).

If I had to do calculations on int32_t, I could consider converting them to float and back. However, since I use int16_t, it doesn't help.