Speed Test for Buffer Alignment: IBM's PowerPC results vs. my CPU

134 Views Asked by At

I was doing some research to find out the reason why data alignment on specific byte boundaries (4-byte, 8-byte, etc. dependent on the hardware) affects the computing performance. I came across this example by IBM: https://developer.ibm.com/articles/pa-dalign/

The test cases were not included, so I have written a small C++ script for the 8-byte access "granularity" (in IBM's terms) case to conduct the test (it is Listing 4 in the IBM webpage link that I have shared):

#include <iostream>
#include <chrono>

void Munge64( void *data, uint32_t size ) {
    double *data64 = (double*) data;
    double *data64End = data64 + (size >> 3); /* Divide size by 8. */
    uint8_t *data8 = (uint8_t*) data64End;
    uint8_t *data8End = data8 + (size & 0x00000007); /* Strip upper 29 bits. */
    
    while( data64 != data64End ) {
        *data64++ = -*data64;
    }
    while( data8 != data8End ) {
        *data8++ = -*data8;
    }
}

int main() {
    const uint32_t bufferSize = 125000 ; // 125000*(8-bytes) = 1 MB

    uint64_t Buffer[bufferSize];
    auto start_time_aligned = std::chrono::high_resolution_clock::now();
    Munge64(Buffer, bufferSize*8);
    auto end_time_aligned = std::chrono::high_resolution_clock::now();

    // Calculate the duration for aligned case
    auto duration_aligned = std::chrono::duration_cast<std::chrono::microseconds>(end_time_aligned - start_time_aligned);
    std::cout << "Aligned buffer execution time: " << duration_aligned.count() << " microseconds" << std::endl;


   
    auto start_time_unaligned = std::chrono::high_resolution_clock::now();
    Munge64(Buffer+4, bufferSize*8); // +4 to make the buffer access unaligned
    auto end_time_unaligned = std::chrono::high_resolution_clock::now();

    // Calculate the duration for unaligned case
    auto duration_unaligned = std::chrono::duration_cast<std::chrono::microseconds>(end_time_unaligned - start_time_unaligned);
    std::cout << "Unaligned buffer execution time: " << duration_unaligned.count() << " microseconds" << std::endl;

    return 0;
}

As it can be seen in the C++ code above, the first while loop takes care of the 64-bit chunks whereas the second while loop takes care of the 8-bit chunks (in case size is not a multiple of 8-bytes).

In the webpage of IBM, it says that the test was conducted with 10 MB buffer size and it also says that the unaligned case is approximately 4,610% (!) slower. However, in my program, I cannot even go up beyond 1 MB since it gives segmentation fault (there may be problems in the while loops assigning the pointers). In my test, I have used 1 MB of buffer and used the first access address as Buffer+4. The results I am getting are nowhere similar to those of IBM webpage. Firstly, unaligned access and aligned access give very similar results with aligned access averaged as ~155 ms and unaligned access as ~147 ms.

An important side note is that the IBM test uses Powerbook G4 PC which is rather old (production stopped in 2006).

My question is: Is my test totally wrong or the processors have changed so much that the unaligned and aligned accesses give similar results? For instance, is my unaligned access approach true when I use the starting address as Buffer+4? Also, why do I get Segmentation Fault when I try to use 10 MB buffer?

0

There are 0 best solutions below