I have a problem with grayscale filter that i wrote in assembly - the results on the bigger images are great, but when i try to test it on smaller images, or for example 5x1 bitmap, instead of the expected result there are colors showing and the average is not being calculated properly.
The bitmap is in RGB format, and populates the values array in the order of BGR
Bitmap to filter (2px black 3px white):

Expected result:

My result:

My result on bigger image:

The code:
.data
three_values dd 0.333333, 0.333333, 0.333333, 0.333333 ; creating a vector filled with values 3.0
.code
asmFilter proc ; procedure
mov rsi, rcx ; load our raw bitmap
mov rdi, [rsp+85] ; load a pointer to the beginning of the processed image into rdi
mov rcx, rdx ; load the width of the image into register rcx
imul rcx, r9 ; multiply it by the pixel size to get the stride
mov r10, rcx ; load the value of stride into register r10
imul rcx, r8 ; calculate the size of the image by multiplying the stride by the height
mov rax, 0 ; initialize the counter
movups xmm2, [three_values] ; load the pointer to the vector of threes into xmm2
processLoop:
cmp rax, rcx ; compare the counter with the size of the image
jae doneCopy ; if they are equal, end the loop
; processing the middle value [i]
mov r11, rsi ; load the pointer to the first pixel of our bitmap into r11
pmovzxbd xmm0, [r11 + rax] ; load the middle value into xmm0 (shift by the counter)
; processing the value to the left [i-1]
pmovzxbd xmm1, [r11 + rax - 1]
paddd xmm0, xmm1
; processing the value to the right [i+1]
pmovzxbd xmm1, [r11 + rax + 1]
paddd xmm0, xmm1
; converting int to float
cvtdq2ps xmm3, xmm0
; (R + G + B) * 0.3
mulps xmm3, xmm2
; converting float to int
cvtps2dq xmm0, xmm3
; writing the processed pixel to the result
movups [rdi + rax], xmm0
add rax, 1 ; increment the counter by 1
jmp processLoop ; repeat the loop
doneCopy:
ret
asmFilter endp
end
asmFilter signature:
private static extern void asmFilter(byte[] values, int width, int height, int pixelSize, byte[] converted_values);
public void executeAsmFilter(byte[] values, int width, int height, int pixelSize, byte[] converted_values)
{
asmFilter(values, width, height, pixelSize, converted_values);
}
Currently i mostly tried to figure out how is it possible that the colors are showing, and from what i found out is that the yellow and blue colors are due to the values being partly taken from black pixel, and because of that the average changes a little bit and instead of just white it displays yellow, then white normally (no black pixels around) and then light blue.
What i cannot grasp even with extensive debugging is why the bigger images work perfectly, and smaller not.
There is an algorithmic problem in your code which is exacerbated by the gradient example but less noticeable in images with very subtle gradients.
As you describe, your pixel format is BGR. So at the start of our image, the input looks like this:
To convert to grayscale correctly, you want to end up with
Here's what your algorithm does right at the beginning:
This loads
xmm0with [B0, G0, R0, B1].This loads
xmm1with [??, B0, G0, R0] where ?? is a byte before the beginning of the array and of undefined value. I'll go ahead and assume that byte has value 0 for now.This sets
xmm0to [B0, G0 + B0, R0 + G0, B1 + R0].This sets
xmm1to [G0, R0, B1, G1].This sets
xmm0to [B0 + G0, R0 + G0 + B0, R0 + G0 + B1, B1 + R0 + G1].This vector is then converted to float and divided by 3 and converted back to integer. Lastly you do this:
writing the vector of dwords to the output. But... your output is supposed to be a bitmap of bytes, not of dwords! Fortunately, your code is also broken in that it continues with an offset of 1 byte the next iteration, overwriting everything you just wrote except the first byte, which happens to be the least significant byte of the first pixel value. After a bunch of iterations, you end up with output like this:
See how instead of the channels of each pixel being the average of that pixel's channels, it's kind of smeared across pixels? This is what causes the wrong output. To fix the code, rewrite it so it actually sums B/G/R of each pixel individually instead of doing the sort of running sum you have.
This is slightly tricky to achieve. You can for example first deinterleave the pixels so you have one vector of all B, one vector of all R, and one of all G. If you have that, you can upconvert to 16 bits, sum, then divide by three (I recommend using
vpmulhwto multiply with 65536/3 = 21845 as that's faster than the floating point stuff). Finally interleave and store.Make sure to not overread the input or write past the end of the output!