I'm trying to count how many number 1, are in the numbers of an array.
First I have a code in C lenguaje(work ok):
int popcount2(int* array, int len){
int i;
unsigned x;
int result=0;
for (i=0; i<len; i++){
x = array[i];
do{
result+= x & 0x1;
x>>= 1;
} while(x);
}
return result;
}
Now I need to translate the do-while loop into Assembly using 3-6 lines of code. I have write some code but the result is not correct.( I'm new in the assembly world )
int popcount3(int* array, int len){
int i;
unsigned x;
int result=0;
for (i=0; i<len; i++){
x = array[i];
asm(
"ini3: \n"
"adc $0,%[r] \n"
"shr %[x] \n"
"jnz ini3 \n"
: [r]"+r" (result)
: [x] "r" (x) );
}
}
I'm using GCC ( on linux) with an Intel processor.
The nicest think you can do is using built in
popcount
function as suggested by Paul R, but since you need to write it in assembly, this worked for me:At first two lines you just check the contents of
x
(and go to end if it's zeroJump Zero
). Than you shiftx
one bit to right and:If there's no
CF
set, just go to start (Jump Not Carry
) else increment result and then go to start.And the beautiful think about assembly is that you can do things in so many ways...
Here you again use
SHift Right
instruction, if noCF
is present just go to loop condition.Otherwise again increment result and call binary
AND
(INC
does modifyZF
).Using
LOOP
andECX
I was curious how to do this in 3 instruction (I figured your teacher wouldn't give you bottom limit of 3 if it wasn't possible) and I realized x86 also has
LOOP
instruction:And you can add input argument using GCCs input constrain:
Just to make sure it compiles to proper assembly:
I think the first one should have a bit better performance, especially if there's just 1 bit set, this approach always does
k*8
iterations.SSE4 and single instruction
I know you have to use a loop, but just for fun... With SSE4 extension you could do this by just one instruction
POPCNT
:I think (I have a quite old CPU on my notebook, so I can't test this for you) you should be able to do this with just one simple instruction:
(If you try this and you do have SSE4 extension, please let me know if it works)
Performance
I've measured times required to do 100,000,000 popcounts comparing my first and second methods with David Wohlferd's. [Raw data]
If anyone can compare these 3 with SSE4's
POPCNT
instruction I'd be glad.