Grayscale filter in assembly doesn't work on smaller images

354 Views Asked by At

I have a problem with grayscale filter that i wrote in assembly - the results on the bigger images are great, but when i try to test it on smaller images, or for example 5x1 bitmap, instead of the expected result there are colors showing and the average is not being calculated properly.

The bitmap is in RGB format, and populates the values array in the order of BGR

Bitmap to filter (2px black 3px white):
base bitmap

Expected result:
expected result

My result:
my result

My result on bigger image:
my result on bigger image

The code:

.data
    three_values dd 0.333333, 0.333333, 0.333333, 0.333333 ; creating a vector filled with values 3.0
.code
asmFilter proc ; procedure
    mov rsi, rcx ; load our raw bitmap
    mov rdi, [rsp+85] ; load a pointer to the beginning of the processed image into rdi

    mov rcx, rdx ; load the width of the image into register rcx
    imul rcx, r9 ; multiply it by the pixel size to get the stride

    mov r10, rcx ; load the value of stride into register r10
    imul rcx, r8 ; calculate the size of the image by multiplying the stride by the height

    mov rax, 0 ; initialize the counter
    movups xmm2, [three_values] ; load the pointer to the vector of threes into xmm2

    processLoop:

        cmp rax, rcx ; compare the counter with the size of the image
        jae doneCopy ; if they are equal, end the loop

        ; processing the middle value [i]
        mov r11, rsi ; load the pointer to the first pixel of our bitmap into r11

        pmovzxbd xmm0, [r11 + rax] ; load the middle value into xmm0 (shift by the counter)

        ; processing the value to the left [i-1]
        pmovzxbd xmm1, [r11 + rax - 1]
        paddd xmm0, xmm1

        ; processing the value to the right [i+1]
        pmovzxbd xmm1, [r11 + rax + 1]
        paddd xmm0, xmm1 

        ; converting int to float
        cvtdq2ps xmm3, xmm0

        ; (R + G + B) * 0.3
        mulps xmm3, xmm2 

        ; converting float to int
        cvtps2dq xmm0, xmm3

        ; writing the processed pixel to the result
        movups [rdi + rax], xmm0

        add rax, 1 ; increment the counter by 1

        jmp processLoop ; repeat the loop
    doneCopy:
        ret
asmFilter endp
end

asmFilter signature:

private static extern void asmFilter(byte[] values, int width, int height, int pixelSize, byte[] converted_values);
public void executeAsmFilter(byte[] values, int width, int height, int pixelSize, byte[] converted_values)
{
    asmFilter(values, width, height, pixelSize, converted_values);
}

Currently i mostly tried to figure out how is it possible that the colors are showing, and from what i found out is that the yellow and blue colors are due to the values being partly taken from black pixel, and because of that the average changes a little bit and instead of just white it displays yellow, then white normally (no black pixels around) and then light blue.

What i cannot grasp even with extensive debugging is why the bigger images work perfectly, and smaller not.

1

There are 1 best solutions below

2
fuz On

There is an algorithmic problem in your code which is exacerbated by the gradient example but less noticeable in images with very subtle gradients.

As you describe, your pixel format is BGR. So at the start of our image, the input looks like this:

R11 + 0:  B0
R11 + 1:  G0
R11 + 2:  R0
R11 + 3:  B1
R11 + 4:  G1
R11 + 5:  R1
R11 + 6:  B2
R11 + 7:  G2
R11 + 8:  R2

To convert to grayscale correctly, you want to end up with

RDI + 0:  B0/3 + G0/3 + R0/3
RDI + 1:  B0/3 + G0/3 + R0/3
RDI + 2:  B0/3 + G0/3 + R0/3
RDI + 3:  B1/3 + G1/3 + R1/3
RDI + 4:  B1/3 + G1/3 + R1/3
RDI + 5:  B1/3 + G1/3 + R1/3
RDI + 6:  B2/3 + G2/3 + R2/3
RDI + 7:  B2/3 + G2/3 + R2/3
RDI + 8:  B2/3 + G2/3 + R2/3

Here's what your algorithm does right at the beginning:

pmovzxbd xmm0, [r11 + rax]

This loads xmm0 with [B0, G0, R0, B1].

pmovzxbd xmm1, [r11 + rax - 1]

This loads xmm1 with [??, B0, G0, R0] where ?? is a byte before the beginning of the array and of undefined value. I'll go ahead and assume that byte has value 0 for now.

paddd xmm0, xmm1

This sets xmm0 to [B0, G0 + B0, R0 + G0, B1 + R0].

pmovzxbd xmm1, [r11 + rax + 1]

This sets xmm1 to [G0, R0, B1, G1].

paddd xmm0, xmm1

This sets xmm0 to [B0 + G0, R0 + G0 + B0, R0 + G0 + B1, B1 + R0 + G1].

This vector is then converted to float and divided by 3 and converted back to integer. Lastly you do this:

movups [rdi + rax], xmm0

writing the vector of dwords to the output. But... your output is supposed to be a bitmap of bytes, not of dwords! Fortunately, your code is also broken in that it continues with an offset of 1 byte the next iteration, overwriting everything you just wrote except the first byte, which happens to be the least significant byte of the first pixel value. After a bunch of iterations, you end up with output like this:

RDI + 0:         B0/3 + G0/3
RDI + 1:  B0/3 + G0/3 + R0/3
RDI + 2:  G0/3 + R0/3 + B1/3
RDI + 3:  R0/3 + B1/3 + G1/3
RDI + 4:  B1/3 + G1/3 + R1/3
RDI + 5:  G1/3 + R1/3 + B2/3
RDI + 6:  R1/3 + B2/3 + G2/3
RDI + 7:  B2/3 + G2/3 + R2/3
RDI + 8:  G2/3 + R2/3 + B3/3

See how instead of the channels of each pixel being the average of that pixel's channels, it's kind of smeared across pixels? This is what causes the wrong output. To fix the code, rewrite it so it actually sums B/G/R of each pixel individually instead of doing the sort of running sum you have.

This is slightly tricky to achieve. You can for example first deinterleave the pixels so you have one vector of all B, one vector of all R, and one of all G. If you have that, you can upconvert to 16 bits, sum, then divide by three (I recommend using vpmulhw to multiply with 65536/3 = 21845 as that's faster than the floating point stuff). Finally interleave and store.

Make sure to not overread the input or write past the end of the output!