I have following task: Count how many numbers between 1 and N will have exactly K zero non-leading bits. (e.g. 710=1112 will have 0 of them, 4 will have 2)
N and K satisfy condition 0 ≤ K, N ≤ 1000000000
This version uses POPCNT and is fast enough on my machine:
%include "io.inc"
section .bss
n resd 1
k resd 1
ans resd 1
section .text
global CMAIN
CMAIN:
GET_DEC 4,n
GET_DEC 4,k
mov ecx,1
mov edx,0
;ecx is counter from 1 to n
loop_:
mov eax, ecx
popcnt eax,eax;in eax now amount of bits set
mov edx, 32
sub edx, eax;in edx now 32-bits set=bits not set
mov eax, ecx;count leading bits
bsr eax, eax;
xor eax, 0x1f;
sub edx, eax
mov eax, edx
; all this lines something like (gcc):
; eax=32-__builtin_clz(x)-_mm_popcnt_u32(x);
cmp eax,[k];is there k non-leading bits in ecx?
jnz notk
;if so, then increment ans
mov edx,[ans]
add edx,1
mov [ans],edx
notk:
;increment counter, compare to n and loop
inc ecx
cmp ecx,dword[n]
jna loop_
;print ans
PRINT_DEC 4,ans
xor eax, eax
ret
It should be okay in terms of speed (~0.8 sec), but it wasn't accepted because (I guess) CPU used on testing server is too old so it shows that runtime error happened.
I tried using precounting trick with a 64K * 4-byte lookup table, but it wasn't fast enough:
%include "io.inc"
section .bss
n resd 1
k resd 1
ans resd 1
wordbits resd 65536; bits set in numbers from 0 to 65536
section .text
global CMAIN
CMAIN:
mov ebp, esp; for correct debugging
mov ecx,0
;mov eax, ecx
;fill in wordbits, ecx is wordbits array index
precount_:
mov eax,ecx
xor ebx,ebx
;c is ebx, v is eax
;for (c = 0; v; c++){
; v &= v - 1; // clear the least significant bit set
;}
lloop_:
mov edx,eax
dec edx
and eax,edx
inc ebx
test eax,eax
jnz lloop_
;computed bits set
mov dword[wordbits+4*ecx],ebx
inc ecx
cmp ecx,65536
jna precount_
;0'th element should be 0
mov dword[wordbits],0
GET_DEC 4,edi;n
GET_DEC 4,esi;k
mov ecx,1
xor edx,edx
xor ebp,ebp
loop_:
mov eax, ecx
;popcnt eax,eax
mov edx,ecx
and eax,0xFFFF
shr edx,16
mov eax,dword[wordbits+4*eax]
add eax,dword[wordbits+4*edx]
;previous lines are to implement absent instruction popcnt.
; they simply do eax=wordbits[x & 0xFFFF] + wordbits[x >> 16]
mov edx, 32
sub edx, eax
;and the same as before:
;non-leading zero bits=32-bits set-__builtin_clz(x)
mov eax, ecx
bsr eax, eax
xor eax, 0x1f
sub edx, eax
mov eax, edx
;compare to k again to see if this number has exactly k
;non-leading zero bits
cmp edx, esi
jnz notk
;increment ebp (answer) if so
mov edx, ebp
add edx, 1
mov ebp, edx
;and (or) go to then next iteration
notk:
inc ecx
cmp ecx, edi
jna loop_
;print answer what is in ebp
PRINT_DEC 4, ebp
xor eax, eax
ret
(>1 sec)
Should I speed up second program (if so, then how?) or somehow replace POPCNT with some other (which?) instructions (I guess SSE2 and older should be available)?
First of all, a server too old to have
popcnt
will be significantly slower in other ways and have different bottlenecks. Given that it has pshufb but not popcnt, it's a Core 2 first or second-gen (Conroe or Penryn). See Agner Fog's microarch PDF (on https://agner.org/optimize/). Also lower clock speeds, so the best you can do on that CPU might not be enough to let brute-force work.There are probably algorithmic improvements that could save huge amounts of time, like noting that every 4 increments cycle the low 2 bits through a 00, 01, 10, 11 pattern: 2 zeros happens once per four increments, 1 zero happens twice, no zeros happens once. For every number >= 4, these 2 bits are below the leading bit and thus part of the count. Generalizing this into a combinatorics formula for each MSB-position between 1 and log2(N) might be a way to do vastly less work. Handling the numbers between 2^M and N is less obvious.
Versions here:
pcmpeqb
/psubb
(withpsadbw
in an outer loop, like How to count character occurrences using SIMD shows - the inner loop reduces to counting byte elements in a fixed-size array that match a value calculated in the outer loop. Just like the scalar version). 18ms on Skylake, ~0.036s on Core 2. Those times are probably now including a considerable amount of startup overhead. But as expected/hoped, about 16x faster on both.wordbits
table once (perhaps as you generate it). Instead of searching 64kiB to find matching bytes, just look up the answer for every outer-loop iteration! That should let you go thousands of times faster for large N. (Although you still need to handle the low 1..64K and the partial range when N isn't a multiple of 64K.)To usefully measure the faster versions, you could slap a repeat loop around the whole thing so the whole process still takes some measurable time, like half a second. (Since it's asm, no compiler will optimize away the work from doing the same N,k repeatedly.) Or you could do the timing inside the program, with
rdtsc
if you know the TSC frequency. But being able to useperf stat
on the whole process easily is nice, so I'd keep doing that (take out the printf and make a static executable to further minimize startup overhead).You seem to be asking about micro-optimizing the brute-force approach that still checks every number separately. (There are significant optimizations possible to how you implement the
32 - clz - popcnt == k
though.)There are other ways to do popcnt that are generally faster, e.g. bithacks like in How to count the number of set bits in a 32-bit integer?. But when you have a lot of popcounting to do in a tight loop (enough to keep a lookup table hot in cache), the LUT can be good.
If you have fast SSSE3
pshufb
, it could be worth using it to do a SIMD popcount for four dwords in parallel in an XMM register (auto-vectorizing the loop), or even better in a YMM register with AVX2. (First-gen Core2 haspshufb
but it's not single uop until 2nd-gen Core2. Still possibly worth it.)Or much better, using SIMD to count LUT elements that match what we're looking for, for a given high-half of a number.
The brute force checking contiguous ranges of numbers opens up a major optimization for the LUT strategy: the upper n bits of the number only change once per 2^n increments. So you can hoist the count of those bits out of an inner-most loop. This also can make it worth using a smaller table (that fits in L1d cache).
Speaking of which, your 64k * 4 table is 256KiB, the size of your L2 cache. This means it's probably having to come in from L3 every time you loop through it. Your desktop CPU should have enough L3 bandwidth for that (and the access pattern is contiguous thanks to the increments), and modern servers have larger L2, but there's very little reason not to use a byte LUT (popcnt(-1) is only 32). Modern Intel CPUs (since about Haswell) don't rename AL separately from the rest of EAX/RAX, and a
movzx
byte load is just as cheap as amov
dword load.On an Intel CPU so old that it doesn't support
popcnt
, that will cause a partial-register stall. Do the next compare withcmp al, dl
instead. (Uselea
oradd
orsub
on thebsr
result, instead of the popcount LUT load, so you can avoid a partial-register stall.)Normally you'd want to use a smaller LUT, like maybe 11 bits per step, so 3 steps handles a whole 32-bit number (2^11 = 2048 bytes, a small fraction of 32k L1d). But with this sequential access pattern, hardware prefetch can handle it and fully hide the latency, especially when the L1d prefetches will hit in L2. Again, this is good because this loop touches no memory other this lookup table. Lookup tables are a lot worse in the normal case where significant amounts of other work happen between each popcount, or you have any other valuable data in cache you'd rather not evict.
Optimized for Skylake (i7-6700k): even with 2 LUT accesses per iteration: 0.600 seconds at 3.9GHz
vs. 0.536 seconds with popcnt. Hoisting the high-half LUT lookup (and maybe the
32
constant) might even let that version be faster.Note: a CPU so old that it doesn't have
popcnt
will be significantly different from Skylake. Optimizing this for Skylake is a bit silly, unless you take this further and wind up beating thepopcnt
version on Skylake, which is possible if we can hoist the BSR work by having nested loops, with an inner loop that uses the same BSR result for the whole range of numbers from2^m .. 2^(m+1)-1
(clamped to a 64k range so you can also hoist the high half popcnt LUT lookup).popcnt_low(i)
== some constant calculated fromk
,popcnt_high(i)
, andclz(i)
.3 major things were quite important for Skylake (some of them relevant for older CPUs, including avoiding taken branches for front-end reasons):
Avoid having a cmp/jcc touching a 32-byte boundary on Intel Skylake-derived CPUs with up-to-date microcode, because Intel mitigated the JCC erratum by disabling the uop cache for such lines: 32-byte aligned routine does not fit the uops cache
This looking at the disassembly and deciding whether to make instructions longer (e.g. with
lea eax, [dword -1 + edx]
to force a 4-byte displacement instead of the smaller disp8.) and whether to usealign 32
at the top of a loop.No-increment is much more common than increment, and Intel CPUs can only run taken branches at 1/clock. (But since Haswell have a 2nd execution unit on another port that can run predicted-not-taken branches.) Change
jne notk
toje yesk
to a block below the function that jumps back. Tail-duplication of thedec ecx
/jnz .loop
/ else fall through to ajmp print_and_exit
helped a tiny amount vs. just jumping back to after theje yesk
.It's taken so rarely (and has a consistent enough pattern) that it doesn't mispredict often, so
setnz al
/add ebx, eax
would probably be worse.Optimize the
32 - clz - popcnt == k
check, taking advantage of the fact thatbsr
gives you31-clz
. So31-clz - (popcnt-1) = 32-clz-popcnt
.Since we're comparing that for
== k
, that can be further rearranged topopcnt-1 + k == 31-clz
.When we're using a LUT for popcount, instead of a
popcnt
instruction that has to run on port 1, we can afford to use a 3-component (slow) LEA likelea edx, [edx + esi - 1]
to do thepopcnt-1+k
. Since it has 3 components (2 registers and a displacement, 2+
signs in the addressing mode), it can only run on port 1 (with 3 cycle latency), competing with bsr (and popcnt if we were using it).Taking advantage of
lea
saved instructions in general, even in the popcnt version. So did counting the loop down towards 0, with a macro-fused 1-uopdec/jnz
instead ofinc
+cmp/jne
. (I haven't tried counting up to see if L1d HW prefetch works better in that direction; the popcnt version won't care but the LUT version might.)Ported to work without io.inc, just using hard-coded N and k with printf for output. This is not "clean" code, e.g. nasty hacks like
%define wordbits edi
that I put in to test changing alignment of branches by using indexed addressing modes instead of[reg + disp32]
for every access to the array. That happened to do the trick, getting almost all of the uops to come from DSB (the uop cache) instead of MITE, i.e. avoided the JCC erratum slowdown. The other way to do it would be making instructions longer, to push thecmp/je
anddec/jnz
past a 32-byte boundary. (Or to change the alignment of the start of the loop.) Uop-cache fetch happens in lines of up-to-6 uops and can be a bottleneck if you end up with a line with only a couple uops. (Skylake's loop-buffer aka LSD is also disabled by microcode to fix an earlier erratum; Intel had more big bugs with Skylake than most designs.)This is version 3, updated to avoid partial-register penalties on Core 2 (Conroe). Runs in
1.78s1.69s vs. 3.18s. Still sometimes as fast on Skylake, but more often 610ms instead of 594ms. I don't have perf counter access on my Core 2; it's too old for perf to fully support, and I don't have perf for the kernel that booted last.(disassembly and perf results for version 1 on Godbolt: https://godbolt.org/z/ox7e8G)
On my Linux desktop, i7-6700k at 3.9GHz. (EPP = balance_performance, not full performance, so it doesn't want to turbo to 4.2GHz apparently.) I don't need
sudo
to use perf because I set/proc/sys/kernel/perf_event_paranoid
= 0. I usetaskset -c 3
just to avoid CPU migrations for single-threaded workloads.This is about 3.93 fused-domain (front-end) uops/clock. So we're pretty close to the 4/clock front-end width.
With popcnt:
Your original (with
GET_DEC
replaced by loading a constant) ran in 1.3 sec on my desktop, for k=8 N=1000000000. This version runs in about 0.54 sec. My version of your original wasn't even the worst possible case for alignment of branches (another version was about 1.6 sec), although since I did have to change things it could be different from your machine.I used mostly the same optimizations as above to save uops and help out the front-end inside the loop. (But I did this first, so it's missing some optimizations.)
(Unfinished) Inner loop with invariant clz(x) and high-half popcount
This version runs in only 0.58 sec on my 2.4GHz Core 2 Duo E6600 (Conroe), same microarchitecture as your Xeon 3050 2.13GHz.
And in 210ms on my Skylake.
It does most of the work, only missing cleanup for N < 65536 (or the low 65536 of a larger N, where the MSB is in the low half), and maybe missing handling a couple other corner cases in the outer loop. But the inner loop totally dominates the run-time, and it doesn't have to run more so these times should be realistic.
It still brute-force checks every single number, but most of the per-number work dependent on the high half is loop invariant and hoisted out. Assuming non-zero high halves, but only 2^16 numbers have their MSB in the low 16. And narrowing to only the low 12 or 14 bits means less cleanup, as well as a smaller part of the LUT to loop over that can stay hot in L1d.
Skylake perf results:
Note that uops_issued.any is about the same as idq.DSB_uops + idq.MITE_uops - if we'd used EDI as a pointer to save code-size, uops_issued.any would be much higher because of unlamination of the indexed addressing modes from micro + macro-fused cmp+jcc.
Also interesting that branch misses is even lower; perhaps the unrolling helped distribute the history better in the IT-TAGE predictor table.
SSE2 SIMD
Also unfinished, not handling corner cases or cleanup, but I think doing approximately the right amount of work.
Unlike in How to count character occurrences using SIMD, the array we're matching against has known limits on how often matches can occur, so it happens to be (mostly?) safe to not do nested loops, just a 2^14 (16384) iteration loop unrolled by 2 before widening the byte counters out to dword. At least for k=8.
This gets a total count of 12507677, just slightly lower than 12509316 (correct for N=1000000000, k=8), but I haven't checked if that's all due to not doing 1..16384, or if I'm losing any counts anywhere.
You could unroll over outer loop iterations to make use of each XMM vector twice or 4 times for each load. (With sequential access to an array in L1d cache, that could possibly let us go slightly faster by doing more ALU work per load, but not much faster.) By setting up 2 or 4 vectors to match against for 2 or 4 different high halves, you can spend longer in the inner loop. Possibly we could go a bit faster than 1 compare/accumulate per clock. But that might run into (cold) register read bottlenecks on Core 2, though.
The version below just does a standard unroll.
Probably can just PADDD into a vector accumulator and only hsum to scalar outside the loop, but we might want more free vector regs?