Neon on Android limited by memory access?

283 Views Asked by At

I have programmed a routine to process single float arrays using Neon on the Android platform, specifically the Samsung S4, and find that my Neon routines are limited by the access to the array data. For interests sake, snippet below:

Neon

m1 =  vmulq_f32(*(float32x4_t *)&ey[i][j],*(float32x4_t *)&caey[i][j]);
                m2 =  vsubq_f32(*(float32x4_t *)&hz[i-1][j],*(float32x4_t *)&hz[i][j]);
                m3 =  vmulq_f32(*(float32x4_t *)&cbey[i][j],m2);
                m4 =  vaddq_f32(m1,m3); 
                vst1q_f32(&ey[i*je+j],m4);

Serial

ey[i][j] = caey[i][j] * ey[i][j] + cbey[i][j] * ( hz[i-1][j] - hz[i][j] ); 

Built on Android phone using C4droid gcc and also AIDE-JNI. The Neon intrinsics code above takes slightly longer to process than the serial equivalent. When replacing the array data with dummy const floats then the code runs nearly 4 times as quick as the serial with array data, although it will of course produce nonsense results (this does confirm that the performance problem lies with the data access). My equivalent SSE and AVX code on other platforms produces good speedups.

I have tried 1D equivalent arrays and prefetching data with __builtin_prefetch , but can not speed up the data access to the Neon intrinsics.

Is there anything else I can try to improve the data access performance on the Android phone ??

0

There are 0 best solutions below