Debug GCC warning during LTO constant propagation

351 Views Asked by At

we are in the process of making our project GCC compatible. With LTO enabled, linking takes remarkably long and these warnings show:

../src/xenia/base/memory.h: In function ‘copy_and_swap.constprop’:
../src/xenia/base/memory.cc:105: warning: iteration 4611686018427387903 
invokes undefined behavior [-Waggressive-loop-optimizations]
  105 |     dest[i] = byte_swap(src[i]);
      |
../src/xenia/base/memory.cc:104: note: within this loop
  104 |   for (; i < count; ++i) {  // handle residual elements
      |
../src/xenia/base/memory.cc:124: warning: iteration 4611686018427387903 
invokes undefined behavior [-Waggressive-loop-optimizations]
  124 |     dest[i] = byte_swap(src[i]);
      |
../src/xenia/base/memory.cc:123: note: within this loop
  123 |   for (; i < count; ++i) {  // handle residual elements
      |

It's the first time we see an issue with these functions (normally using MSVC/Clang). They include vector intrinsics.

How can I debug this issue? How to get a compile-time stack trace of the call GCC is trying to optimize?

Edit:

This is the code in question

inline uint32_t byte_swap(uint32_t value) { return __builtin_bswap32(value); }

void copy_and_swap_32_aligned(void* dest_ptr, const void* src_ptr,
                              size_t count) {
  assert_zero(reinterpret_cast<uintptr_t>(dest_ptr) & 0xF);
  assert_zero(reinterpret_cast<uintptr_t>(src_ptr) & 0xF);

  auto dest = reinterpret_cast<uint32_t*>(dest_ptr);
  auto src = reinterpret_cast<const uint32_t*>(src_ptr);
  __m128i shufmask =
      _mm_set_epi8(0x0C, 0x0D, 0x0E, 0x0F, 0x08, 0x09, 0x0A, 0x0B, 0x04, 0x05,
                   0x06, 0x07, 0x00, 0x01, 0x02, 0x03);

  size_t i;
  for (i = 0; i + 4 <= count; i += 4) {
    __m128i input = _mm_load_si128(reinterpret_cast<const __m128i*>(&src[i]));
    __m128i output = _mm_shuffle_epi8(input, shufmask);
    _mm_store_si128(reinterpret_cast<__m128i*>(&dest[i]), output);
  }
  for (; i < count; ++i) {  // handle residual elements
    dest[i] = byte_swap(src[i]);
  }
}

A platform invariant version of the function without intrinsics (which just loops over the complete array and byte swaps individually) does not throw the gcc warning.

1

There are 1 best solutions below

4
On

How are the arrays in question declared?

The gcc compiler will often make inferences about the number of times a loop might execute based upon the notion that if the Standard doesn't impose any requirements upon how a construct behaves, even if that's because the Committee expected that commonplace implementations would all process it identically, no possible action would be better or worse than any other.

Consider, for example, the code snippet:

unsigned foo[32770];
unsigned mul_mod_65536(unsigned short x, unsigned short y)
{
    return (x*y) & 0xFFFFu;
}
void test(unsigned short n)
{
    unsigned sum = 0;
    for (unsigned short i=32768; i<n; i++)
        sum += mul_mod_65536(i, 65535);
    if (n < 32770)
        foo[n] = sum & 32767;
}

GCC will process the test() function into code equivalent to:

void test(unsigned short n)
{
    foo[n] = 0;
}

Such a transformation is conforming: the authors of the Standard have said they expected that commonplace compilers would extend the semantics of the language to treat signed and unsigned integer math identically in more cases than mandated by the Standard, but they didn't require such behavior. The Standard allows compilers to deviate arbitrarily from the commonplace behavior either because such deviation would make sense on the target platform, or for any other reason the compiler's writer might see fit.

The warnings you cite are usually a consequence of gcc replacing useful code with more "efficient" but useless code. Consider yourself lucky that you're getting a warning about such "optimization". In some cases [like the example I posted above], gcc will silently change code in such a way as to throw ordinary laws of causality out the window. On most platforms, there would be no particular reason why attempting to multiply 32769 by 65535 should have any side effects beyond yielding a possibly-meaningless value, and in the above code any value that might be produced by such a multiplication would end up being ignored if the multiplication didn't have any weird side effects. GCC, however, will notice that any value of n greater than 32769 would cause an integer overflow; even though the overflow occurs in circumstances where the authors of the Standard would have expected commonplace implementations to behave meaningfully, gcc instead infers that it can regard as meaningless any situation where n would exceed 32769.

Addendum:

Perhaps the following code snippet might be helpful to play with in godbolt:

struct foo { int a[2][2]; };
int bump(struct foo *p, unsigned short n)
{
    int total=0;
    int i=0;
    for (i=0; i < n+2; i++)
    {
        p->a[0][i] += 1;
    }
    return i;
}
struct foo foo[4];
int test2(unsigned char n)
{
    return bump(foo, n+1);
}

Experiment with changing the array dimension, and with replacing the second argument in the call to bump with different constants, n, or n+1, and you'll notice that gcc will sometimes:

  1. produce code that would be able to index through the entire array
  2. sometimes produce silently code that will behave as though the loop range is confined to one dimension
  3. sometimes produce a warning and generate code that simply falls off the end of the function into whatever code happens to follow it in memory, and
  4. sometimes silently produce code that falls off the end of the function into whatever follows in memory.

I'm not familiar enough with C++ to know exactly how your example would compare with this kind of thing, but gcc is very aggressive with inner array dimensions. At least in C, it seems to extend the semantics language so as to allow the pointer-arithmetic operation + to operate beyond the bounds of an inner array (something the authors of the Standard almost certainly intended to be possible, but failed to mandate, since an access to arr[x][y] when y isn't less than the inner array's length is explicitly characterized as UB, but it's also explicitly characterized as being equivalent to *((&arr[x][0]) + y), implying that the latter construct would be UB in the same circumstances).