The asm function strlen
receives the link to a string as a char - Array. To to so, the function may use SWAR on general purpose register, but without using xmm register or SSE instructions.
The function checks with the bit manipulation: (v - 0x01010101) & ~(v & 0x80808080)
in 8 byte steps,
if the word contains a zero byte, which marks the end of the string. If it does, it iterates byte per byte until the zero to avoid a page fault.
The alignment works like in this GNU Libc implementation:
for (char_ptr = str; ((unsigned long int) char_ptr & (sizeof (longword) - 1)) != 0; ++char_ptr){
if (*char_ptr == '\0'){
return char_ptr - str;
}
}
Is there any way to make it faster?
; rdi <- const *char
; rax <- counter + return value
; r10 <- current array for computation
; rcx,r8 <- Bitmask
; rsi <- Arr for calculation
strlen:
PUSH rbp
SUB rsp,8
XOR rax,rax
MOV r8,31
alignment:
CMP byte [rdi+rax],0
JE end
MOV rsi,rdi
ADD rsi,rax
AND rsi,r8
CMP rsi,0
JE while_parallel
INC rax
JMP alignment
while_parallel:
MOV rcx,0x01010101
MOV r8,0x80808080
while_parallel_loop:
MOV r10,[rdi+rax]
MOV rsi,r10
NOT r10
AND r10,r8
SUB rsi,rcx
AND rsi,r10
CMP rsi,0
JNE while_single
ADD rax,8
JMP while_parallel_loop
while_single:
CMP byte [rdi+rax],0
JE end
INC rax
JMP while_single
end:
ADD rsp,8
POP rbp
RET
Note that I do not intend to use any SSE instructions nor xmm register.
The code from the question seems to have a defect: Mycroft's magic constants have not been extended to 64 bits. As for efficiency: the various x86-64 calling conventions are primarily register-based, so it is not necessary to maintain a stack frame for simple functions. The main loop in the posted code appears a bit too complex; note that most x86 instructions set flags that can be used for branching. Instead of processing byte-wise until reaching the next
QWORD
boundary from the start of the string, simply read the entireQWORD
it falls in, and overwrite the bytes preceding start of the string with0xff
. The resulting "masked' data can then be processed by the regularQWORD
processing loop.Below is x86-64 code written for Windows that incorporates these suggestions. Adjusting it to the System V ABI used by Linux should require nothing more than a cyclical exchange of registers. Note that on CPUs with BMI instruction set extension the 64-bit processing loop can be reduced by one instructions by merging the
not rcx
with the followingand rcx, r9
intoandn rcx, rcx, r9
.I used the ISO-C99 test scaffolding below to check the correctness of the
strglen
implementation above. I built with the Microsoft Macro Assembler and the Intel C/C++ compiler, as follows:ml64 -c strglen.asm
,icl /Qstd=c99 /Ox /W4 strlen.c strglen.obj
. I also inspected the generated machine code withdumpbin /disasm strglen.obj
, mostly to make sure the alignment directives work as desired and that loop unrolling via theREPT
macro works correctly, as I had not used that in the past twenty years.