Is there a way to force CLang to use unaligned load/store x86 instructions?

348 Views Asked by At

I'm trying to use CLang in a large Visual Studio project. There's a lot of MS-specific code, including C++/CLI and MStest that can't be compiled with CLang, so it's a mix of libraries compiled by Microsoft compiler (version 17.2 / VS 2022) and CLang-CL (13.0.2).

Existing code uses AVX to optimize performance-critical bottlenecks, so there are several classes that store aligned data like

struct tx
{
  alignas(32) double m_data[12];
}

The problem is that Microsoft does not always honor alignment requirements. Most of the time it will properly align the data, but sometimes (usually for temporary variables) it will allocate non-aligned structs. For example,

struct edge_object
{
  ...
  tx m_pos;
};

int c = sizeof(edge_object);  // 256
int a = alignof(edge_object); // 32
int b = offsetof(edge_object, tx); // 160

std::vector<edge_object> edges;
for (int i = 0; i < n - 1; ++i)
{
    edges.push_back(edge_object( (edge_id_t)i, test_cost_0, lower_v[i], lower_v[i + 1], tx ));
    edges.push_back(edge_object( (edge_id_t)(n + i), test_cost_0, upper_v[i], upper_v[i + 1], tx ));
}

In this code snippet, MS compiler aligns first temporary edge_object properly (e.g. it will move it 32 bytes if I allocate few additional variables on stack), but it places second temporary edge_object in a totally weird location (at a position shifted 78h bytes off position of first temporary for some reason). MS gets away with this because it always issue unaligned load/store instructions (even if explicitly said to use aligned load/store), so even if object is not aligned, the generated code will still work. CLang, on the other hand, is issuing aligned load instructions. I started by replacing all intrinsics like _mm256_load_ps to _mm256_loadu_ps in my own vectorized code, but sadly Clang is smart enough to issue its own aligned loads when it sees that alignas(32).

So I'm wondering - is there a way to force CLang to issue only unaligned load/stores like MSVC and ICC compilers do? As a potential workaround I can force Clang to do so by changing alignment to 8 instead of 32, but this will hurt performance. MS approach, on the other hand, is almost just as fast when it manages to properly align the data (VMOVUPS and VMOVAPS on modern CPUs have almost same performance for properly aligned addresses) but does not crash when alignment is wrong due to compiler bug. Any suggestions?

0

There are 0 best solutions below