Divide by variable power of 2 using assembly

173 Views Asked by At

I've got this task:

dividePower2

Compute x/2n, for 0 ≤ n ≤ 30. Round toward zero.

  • Argument 1: x
  • Argument 2: n

Examples:

  • dividePower2(15,1) = 7
  • dividePower2(-33,4) = -2

This is what I've got so far but I don't know if I'm heading in the right direction (AT&T syntax required):

.global dividePower2
   dividePower2:
   sar $2, %esi
   ret
2

There are 2 best solutions below

8
Root On

You need to use the IEEE standard of floating numbers in Computing.It will take up more memory but I dont see any other way.

Remember that any floating number can be written as :

s|exp bits|fraction bits

Let n1=x = s_x|exp bits_x|fraction bits_x

and n2=2^n 0|b(n+127)|0

n1/n2 will be simply equal to s_x|exp bits_x-b(n+127)|fraction bits_x

You will need 2 32-bit registers to store the divident and the divisor and 1 32-bit register for the result.

0
Peter Cordes On

Yes, an arithmetic right shift is a useful part of this if you're going to do it efficiently.
But no, n >>= 2; is not a useful thing to do. Any shifting you do needs to be variable-count (or by 31 if you're broadcasting the sign bit to make a 0 / -1 mask).

Look at how compilers handle a constant n like return x / (1<<11); for example. (Godbolt).

# x86-64 GCC -O3
bar(int):
        testl   %edi, %edi        # set FLAGS according to x
        leal    2047(%rdi), %eax
        cmovns  %edi, %eax        # x < 0  ?  x+(1<<11)-1  :  x
        sarl    $11, %eax         # arithmetic right shift that value
        ret                       # return value in EAX

A simple x >> n (sar %cl, %edi) rounds towards -Infinity, rather than truncating toward 0. Difference in x/2 and x>>1 or x*2 and x << 1 where x is an integer goes into the details of how compilers implement C division semantics for x/2 or x/8 or whatever.

Adding 2047 can wrap small negative numbers into small positive numbers, but only numbers less than 1<<11. So the later sar will shift all the set bits out and leave 0, like we want for division with truncation toward zero for the original small negative number with magnitude smaller than 2048.

For larger-magnitude negative numbers, adding a positive number decreases the magnitude, decreasing the quotient unless it was an exact multiple to start with. (Since we're adding 1 less than the divisor). This lets us use sar (arithmetic right shift) which rounds towards -Infinity, accounting for the off-by-1 difference between it and what we want for truncating division.


For a variable count, return x / (1<<n), compilers naively just materialize the 1<<n and use idiv. But we can do better by following the same pattern as for signed division by constant powers of 2.

(The range limit of n in [0,30] means that 1<<31 overflowing to INT_MIN is a non-issue. I think this algorithm would work correctly for n=31, though. The result is either 0 or -1, which are the two results possible for 32-bit sar by 31. Adding 0x7fffffff to any negative number other than INT_MIN = 0x80000000 will produce a non-negative number, so sar produces 0. For x = INT_MIN, we're doing -1 >> 31 giving -1.)

We can do x + (1<<n)-1 and use the same test/cmovns gets us x or x + (1<<n)-1. The -1 part can be done as part of an LEA.

The most efficient way to get 1<<n on Intel CPUs is with bts into a zeroed register. (Variable-count shl is 3 uops, unless we use BMI2 sarx. And materializing a 0 with xor is cheaper than mov $1, %eax.) On AMD Zen where bts is 2 uops but shl %cl, %eax is only 1 (https://uops.info), it might be cheaper to move a 1 into EAX and shl that, since we need the shift count in CL anyway for the later sar. (Unless we have BMI2 for sarx which allows the shift count to come from any register.)

Another option is to get (1<<n) - 1 by applying BMI2 bzhi to 0xffffffff; I think that works directly for n in the range [0,32] inclusive (although the sar still only works for n=[0,31]). It's single-uop on Intel and AMD (https://uops.info/). See below; that's what clang does for a C version of this.

# untested
.global dividePower2         # int x in EDI,  int n in ESI
                             # returns int in EAX
dividePower2:
   xor   %eax, %eax
   bts   %esi, %eax             # 2^n = 1<<n
   lea   -1(%rax, %rdi), %eax   # x + 2^n - 1

   test  %edi, %edi
   cmovns %edi, %eax     # x for non-negative,  x + 2^n - 1 for negative

   mov   %esi, %ecx      # or  sarx %esi, %eax, %eax
   sar   %cl, %eax       # >>= n  on the adjusted value is equivalent to /= (1>>n)
   ret

LEA with 3 components ([-1 + rax + rdi]) will have 3 cycle latency on Skylake and earlier, but Ice Lake runs it with 1 cycle latency. "Slow" LEAs on Ice Lake and later can't run on as many ports, but don't have a latency penalty. (On Alder Lake P-cores, scaled index costs 2 cycle latency like on Zen family, even when it's just a 2 component LEA, but we aren't scaling the index here.) On Zen-family, this 3-component LEA has 2 cycle latency. Same on Alder Lake E-cores. So there'd be nothing to save by doing it with separate add and dec instructions.

Yet another way to get (1<<n) - 1 is what older clang does: 0xffffffff >> (-n&31). It actually generates 0xffffffff or 0 from sar $31, reg to broadcast the sign bit. But this strategy is less convenient for variable n because it would require a negation. Especially inconvenient on x86 without BMI2 since shift counts need to be in CL.


A SIMD version of this with SSE2 pslld and psrad should be possible, or AVX2 vpsllvd with per-element variable counts.

Blending with SSE4.1 pblendvps is possible as a replacement for scalar cmov, but it seems like something more efficient should be possible. Perhaps more bithacky using psrad to get 0 or -1 and do something with that? Or just use that as a mask with pandn to zero out a 1<<n - 1 vector in elements where x was non-negative. The psrad result is directly useful as a -1 in (1<<n) - 1.


In C or C++

// well-defined behaviour for n=0..30
int divp2_bitshift(int x, int n){
    int bias = x < 0 ? (1U<<n)-1 : 0U;
    return (x+bias) >> n;
// To handle n=31, use  unsigned bias  and cast x+bias  back to signed int to avoid signed-overflow UB
// That might depend on int being 2's complement, so C++20 or C23
}

int divp2_reference(int x, int n){
    return x / (1LL<<n);  // 64-bit shift and division avoids overflow to INT_MIN
}

Clang -march=skylake compiles to optimized version in a clever way that uses bzhi:

# clang -O3 -march=skylake (BM1 and BMI2)
bitshift(int, int):
        movl    %edi, %eax
        sarl    $31, %eax            # 00000000 or FFFFFFFF
        bzhil   %esi, %eax, %eax     # 0 or (1<<n)-1 

        addl    %edi, %eax
        sarxl   %esi, %eax, %eax
        retq

Godbolt, including a main to exhaustively test the reference version against the optimized version, for every int32_t bit-pattern and every n from 0 to 31. It passes for all cases.

With the reference implementation being return x / (1<<n), it passes for all n=0..30, but fails for -2147483648 / (1<<31) = 1 (because 1<<31 in the reference implementation overflows to INT_MIN, but the bit-shift version gives -1 because it's dividing by a positive power of 2.) I compiled with gcc -O3 -fwrapv -march=native to make signed integer overflow well-defined for ease of testing the asm.

So this optimization wouldn't be usable for x/(1<<n) with gcc -fwrapv, because the n=31 behaviour is well-defined as having a negative divisor.

But without -fwrapv where signed overflow is undefined behaviour in C, this optimization would be valid. TODO: report this missed-optimization to GCC and clang.

Also valid for x/(1LL<<n) for n=0..31, but not for n=32 or higher because that's still well-defined in C. (I'm assuming x is a 32-bit int, so long long is wider; n=32..63 are valid and make the quotient 0.)