The asm function strlen receives the link to a string as a char - Array. To to so, the function may use SWAR on general purpose register, but without using xmm register or SSE instructions.
The function checks with the bit manipulation: (v - 0x01010101) & ~(v & 0x80808080)in 8 byte steps,
if the word contains a zero byte, which marks the end of the string. If it does, it iterates byte per byte until the zero to avoid a page fault.
The alignment works like in this GNU Libc implementation:
for (char_ptr = str; ((unsigned long int) char_ptr & (sizeof (longword) - 1)) != 0; ++char_ptr){
    if (*char_ptr == '\0'){
        return char_ptr - str;
    }
}
Is there any way to make it faster?
; rdi <- const *char
; rax <- counter + return value
; r10 <- current array for computation
; rcx,r8 <- Bitmask
; rsi <- Arr for calculation 
            
strlen:
    PUSH rbp
    SUB rsp,8
 
    XOR rax,rax    
    MOV r8,31
alignment:
    CMP byte [rdi+rax],0
    JE end
    
    MOV rsi,rdi
    ADD rsi,rax
    AND rsi,r8
    CMP rsi,0
    JE while_parallel
    INC rax
    JMP alignment       
          
while_parallel:  
    MOV rcx,0x01010101
    MOV r8,0x80808080
while_parallel_loop:  
    MOV r10,[rdi+rax]
    MOV rsi,r10
    
    NOT r10
    AND r10,r8
    SUB rsi,rcx
    AND rsi,r10
    
    CMP rsi,0
    JNE while_single
    ADD rax,8
    JMP while_parallel_loop
    
while_single:
    CMP byte [rdi+rax],0
    JE  end
    INC rax
    JMP while_single    
end:
    ADD rsp,8
    POP rbp
    RET
Note that I do not intend to use any SSE instructions nor xmm register.
 
                        
The code from the question seems to have a defect: Mycroft's magic constants have not been extended to 64 bits. As for efficiency: the various x86-64 calling conventions are primarily register-based, so it is not necessary to maintain a stack frame for simple functions. The main loop in the posted code appears a bit too complex; note that most x86 instructions set flags that can be used for branching. Instead of processing byte-wise until reaching the next
QWORDboundary from the start of the string, simply read the entireQWORDit falls in, and overwrite the bytes preceding start of the string with0xff. The resulting "masked' data can then be processed by the regularQWORDprocessing loop.Below is x86-64 code written for Windows that incorporates these suggestions. Adjusting it to the System V ABI used by Linux should require nothing more than a cyclical exchange of registers. Note that on CPUs with BMI instruction set extension the 64-bit processing loop can be reduced by one instructions by merging the
not rcxwith the followingand rcx, r9intoandn rcx, rcx, r9.I used the ISO-C99 test scaffolding below to check the correctness of the
strglenimplementation above. I built with the Microsoft Macro Assembler and the Intel C/C++ compiler, as follows:ml64 -c strglen.asm,icl /Qstd=c99 /Ox /W4 strlen.c strglen.obj. I also inspected the generated machine code withdumpbin /disasm strglen.obj, mostly to make sure the alignment directives work as desired and that loop unrolling via theREPTmacro works correctly, as I had not used that in the past twenty years.