I am very new in assembly and I want to find all pythagorean triples in a range from 1 to 100. I'm generating all the numbers in C and all the other calculations should be done in assembly SSE. I was trying to do this by using sqrt command (I've tried all of them) but I couldn't make it work.. Can someone tell me how it should be done?
That's what I've got so far:
int main(){
for (int i = 1; i <= 100; i++)
{
a++;
if (a > 100)
a = 0;
for (int j = 1; j <= 100; j++)
{
b++;
if (b > 100)
b = a;
_asm //tricky part begins here:
{
movups xmm0, a
movups xmm1, b
pmuludq xmm0, xmm0
pmuludq xmm1, xmm1
//movups xmm2, 0
//paddd xmm2, xmm0
//paddd xmm2, xmm1
movups z, xmm0
}
printf("%d\n", z);
}
}
}
The basic problem with your approach is that you need to be looking at 4
b
values in parallel, so you can't just load from a C scalar variable. You need to keep stuff in vector registers across loop iterations, since you aren't just loading vectors from memory or something. You should write the whole loop in asm because MSVC inline asm sucks for wrapping short sequences, due to unavoidable overhead of getting results in / out.Of course, the best way to vectorize this loop would be with C intrinsics, not with inline asm. Then you can hand-hold the compiler into making better asm if necessary (and if possible) by checking its asm output for inefficiencies. (See Why is this C++ code faster than my hand-written assembly for testing the Collatz conjecture?)
Of course, if you really just wanted to create efficient code for generating Pythagorean triples, your algorithm is bogus, too:
The Wikipedia article has a generating a triple section which describes Euclid's formula. Iterating that would be a different problem than checking for hits in a brute-force search of the whole
a=[1..100] b=[1..100]
search space, since checking if a number is a perfect square is fairly slow.Also, detecting which vector elements match a condition is clumsy. A packed-compare instruction and then PMOVMSKB (or MOVMSKPS) will give you a bitmap, but this works best when hits are rare, e.g. implementing
memchr
where your loop stops after the first hit.