I have frames extracted from a 10 bit movie, using ffmpeg writing P010LE frames. I'm loading the frames into a vector of uint8_t. I've verified the frame data by viewing it with rawpixels.net. I'm trying to make a CUDA kernel that will convert these P010LE frames into ARGB format. Apparently this is beyond my mental horsepower.
Anyway, to simplify I thought I'd just show the luma, then maybe ask a question about the chroma once I've failed that. My kernel looks like this:
__global__ void P010leToArgbKernel(const uint8_t * P010le, uint8_t * argb, int width, int height)
{
int x = blockIdx.x * blockDim.x + threadIdx.x;
int y = blockIdx.y * blockDim.y + threadIdx.y;
if (x < width && y < height)
{
int yIndexP010 = (y * width + x) * 2;
uint16_t Y = *reinterpret_cast<const uint16_t*>(P010le + yIndexP010) & 0x3FF;
uint8_t Y8 = static_cast<uint8_t>((Y * 255 + 511) / 1023); // Proper rounding
int yIndex = (y * width + x) * 4;
argb[yIndex + 0] = Y8;
argb[yIndex + 1] = Y8;
argb[yIndex + 2] = Y8;
argb[yIndex + 3] = 255; // Alpha
}
}
The result is as follows, which I do not believe to be correct (original image left, converted image right). I'm expecting a smooth, greyscale image. What mistake did I make?

Following a hint from Christoph I realised the P010LE format actually stores its 10 bits in the high 10, not the low. Shifting down 6 times brings the data into the low 10 bits. Apparently it's good practice to mask off the bits you want in any case, so I have done that here.
The final kernel (probably not the fastest possible but it works) is below. When I used this with GDI+ so I could get a very quick-to-code visualisation, I also realised that its PixelFormat32bppARGB is actually BGRA on the underlying bitmap, so I've renamed it from P010le_To_Argb_Kernel to P010le_To_Bgra_Kernel.