I'm trying to use uint64_t as if it was 8 lanes of uint8_ts; my goal is to implement a lane-by-lane less-than. This operation, given x and y, should produce a result with 0xFF in a lane if the value for the corresponding lane in x is less than the value for that lane in y, and 0x00 otherwise. A lane-by-lane less-than-or-equal would also work.
Based on what I've seen, I'm guessing I would need a lanewise difference-or-zero operation (defined as doz(x, y) = if (x < y) then 0 else (x - y)), and then to use that to construct a selection mask. However, all the lane-wise subtraction approaches I've seen are signed, and I'm not sure how I would use them to do this kind of task.
Is there a way I could do this, using difference-or-zero or some other way?
Here's an architecture-independent approach. I'm sure it could use refinement, but it seems to be working fine. With x86 gcc/clang, it compiles to 20/19 instructions.
The idea is to first solve the problem when both bytes are either less than 128 or not, setting bit 7 in each byte with that result. Then patch up the other cases. Finally smear the bit 7's downward.
One test case: