Assembly / Neon code crashing

279 Views Asked by At

I'm using the following code:

#include <stdlib.h>
#include <fcntl.h>

int main(int argc, char **argv) {
    char *auyvy = malloc(640 * 480 * 2);
    char *ay8 = malloc(640 * 480);

    int fd = open("input.uyvy", O_RDONLY);
    if (fd >= 0) {
        read(fd, auyvy, 640 * 480 * 2);
        close(fd);
    }

    __uyvy_luma_extract(640, 480, auyvy, 640 * 2, ay8, 640);

    fd = open("output.y8", O_RDWR | O_CREAT);
    if (fd >= 0) {
        write(fd, ay8, 640 * 480);
        close(fd);
    }
}

with the two additional files: https://github.com/emrainey/DVP/blob/master/libraries/public/yuv/__uyvy_luma_extract.S https://github.com/emrainey/DVP/blob/master/libraries/public/yuv/yuv.inc

I compile with "gcc -g convert.c __uyvy_luma_extract.S -mfpu=neon"

Strangely, the program crashes during the conversion. Any idea what I'm doing wrong?

* FIRST EDIT * I have uploaded a zip file with the various file so that it's easily reproducible on an ARM platform: http://www.gentil.com/tmp/convert.zip

* SECOND EDIT * I have updated the assembly file link which was not correct.

* THIRD EDIT * gdb gives the following:

Starting program: /home/ai/convert/convert                                      

Program received signal SIGSEGV, Segmentation fault.
0x00008036 in ?? ()
(gdb) bt
#0  0x00008036 in ?? ()
#1  0x000084f2 in __uyvy_luma_extract () at __uyvy_luma_extract.S:38
#2  0x000084f2 in __uyvy_luma_extract () at __uyvy_luma_extract.S:38
Backtrace stopped: previous frame identical to this frame (corrupt stack?)
2

There are 2 best solutions below

0
On BEST ANSWER

Oh, this is a good one.

It works fine if built with -marm but breaks with -mthumb. Ubuntu and Android probably have different defaults for this.

The reason it breaks in Thumb mode is that the assembly function (which is always non-Thumb) is missing a type specification for the symbol, so the linker doesn't know it needs to use a BLX instruction to call it from Thumb code. When the program is executed, the assembly function is thus erroneously called in Thumb state. The first half-word of this function, 0x47ff, when interpreted as a Thumb instruction, is BLX pc which is invalid with unpredictable behaviour. Apparently, the Cortex cores simply execute it in the obvious way, that is switch to ARM state, branch to the PC value (current instruction + 4 in Thumb state), and store the next (Thumb) instruction address in LR, thus giving the appearance of having simply ignored the STM instruction.

The fix is to add this line to the assembly file:

.type __uyvy_luma_extract, STT_FUNC
8
On

You gave the wrong link. The correct one would be: https://github.com/emrainey/DVP/blob/master/libraries/public/yuv/__uyvy_luma_extract.S

And the library is bugged. Or the author assumes Empty Descending stack while you are most probably using Full Descending stack. Change line 39 and 40 to:

ldr     pY,         [sp, #(11 * 4)]
ldr     dstStride,  [sp, #(12 * 4)]

Anyway, the library is rather lackluster when it comes to performance. Extremely amateurishly written, would run at the half of NEON's potential speed.

=======================================================================

EDIT: looking at the PROLOG macro, it reveals that the library also pushes the lr. That means the part above isn't bugged.

The function should work correctly, albeit not optimal. Check the following:

  • Memory allocations (unlikely)
  • Exception handling (undefined instructions exception)

What exception does your code crash with?