scanf a string and print strlen in assembly gas 64-bit

1.4k Views Asked by At

I am trying to write a strlen function in assembly using 64-bit GAS. I need to get an input string from the user, and print its length. This is my code:

.lcomm d2, 255
.data
pstring1:  .ascii "%s\0\n"

.text
.globl main
main:
    movq %rsp, %rbp 

    subq $8, %rsp   
    movq  $d2, %rsi
    movq  %rsi,%rbx          
    movq  $pstring1, %rdi
    movq  $0,%rax
    call scanf

    movq   $1, %rax
    movq   $d2, %rsi
    movq   $pstring1, %rdi
    call  printf #print to check if scanf worked write

    add   $8, %rsp

    movq 8(%rsp), %rcx
    movq %rcx, d2
    call pstrlen
    popq %rbx   
    ret

    ##########
pstrlen:  

    movq %rsp, %rbx
    movq 16(%rbp),%rdx
    xor %rax, %rax        
    jmp if

then:
    incq %rax
    movq $length,%rax
if:
    movq %rdx, %rcx
    cmp 0, %rcx
    jne then
end:
    pop %rbp
    ret

If someone could explain giving an example of how to work with strings and pass parameters to functions in 64-bit GAS assembly it would be ideal, since I can't find anything suitable online.

1

There are 1 best solutions below

11
On BEST ANSWER

On principle level, you are using .lcomm d2, 255 to allocate 255 bytes for the string data. One byte is 8 bits, 1 bit is either 0 or 1. So maximum value of one byte is 28-1 when treated as unsigned binary value. Which is for me the most common way, how I think about bytes (as a number 0..255), but those 8 bits can represent also other values, like sometimes signed 8 bit is used (-128..+127), or particular bits are addressed giving them specific functionality for the particular code accessing them. (this part is good)

Then you use scanf with "%s\0\n" definitions (it will compile as bytes '%', 's', 0, 10 ... not sure what the 10 is good for there after null terminator). I would use .asciiz "%254s" instead, to prevent malicious user entering more that 255 bytes of input into that reserved d2 space. (note it's .asciiz with z at end, so it will add the zero byte on it's own)

Then you use printf. Rather provide another formatting string separately for output, this time like formatOut: .asciiz "%s\n".

Finally you want strlen.

Which means I will return back to input. If you are running in normal 64b OS (linux), your input string is very likely UTF-8 encoded (unless your OS is set in other specific Locale, then I'm not sure which Locale will scanf pick up).

UTF-8 encoding is variable-length encoding, so you should decide whether your strlen will return number of characters, or number of bytes occupied.

For the simplicity I will assume number of bytes (not chars) is enough for you, and if your input strings will consist only of basic 7b ASCII characters ([0-9A-Za-z !@#$%^&*,.;'\<>?:"|{}] etc... check any ASCII table ... no accent chars allowed (like á), that would produce multi-byte UTF8 code), then number of bytes will be also equal to number of characters (UTF-8 encoding is sort of compatible with 7b ASCII).

That means for example for input "Hell 1234" the memory at address d2 will contain these values (hexadecimal) 48 65 6C 6C 20 31 32 33 34 00. Once again, if you will check ASCII table, you will realize that for example byte 0x20 is the space character, etc... And the string is "nul terminated", the last value zero is part of the string, but it is not displayed, instead it is used by various C functions as "end of string marker".

So what you want to do in strlen is to load some register with d2 address, let's say rdi. And then scan byte by byte (byte, because ASCII encoding works in "1 char = 1 byte" way, and we will ignore UTF-8 variable-length codes), until you reach zero value in memory, and meanwhile count how many bytes it did take to reach it. If you would ponder on this idea a bit to make it "short" for CPU, and you will use the SCASB for scanning (you can also write it "manually" with ordinary mov/cmp/inc/jne/jnz if you wish), you may end with this:

rdi = d2 address
rdx = rdi  ; (copy of d2 address)
ecx = 255  ; maximum length of string
al  = 0    ; value to test against
repne scasb  ; repeat SCASB instruction until zero is found
; here rdi points at the zero byte
; (or it's d2+255 if the zero terminator is missing)
rdi -= rdx ; rdi = length of string
; return result as you wish

So you need first correct understand what values you are manipulating with, where they are, what is their bit/byte size, and what structure it has.

Then you can write instructions which produce any reasonable calculation based on those data.

In your case the calculation is "length_of_string = number of non-zero bytes in 7b ASCII encoded string stored in memory at address d2" (I mean after successful scanf part of code).

Considering how your source looks it looks to me like you don't understand what x86 CPU instruction do, and you just copy them from some examples. That will get you into trouble soon.

For example cmp 0, %rcx is checking if rcx (8 bytes "wide" value) is equal to zero. And you did load rcx with value from rdx, which was something from stack (maybe d2 address), so the rcx will be never zero.

And even if you would actually load the character values from memory into rcx, you would load 8 of them at the same time, so you would miss the 0 value as it would be only single byte inside some garbage, like 0xCCCCCCCC00343332 (I'm using 0xCC for the undefined memory after d2 buffer just for example, there may be any value).

So that code doesn't make any sense. If you at least partially understand what are CPU registers and what instructions like mov/inc/cmp/... do, then you have some chance to produce working code by simply using debugger a lot, to verify almost every 1-2 new instructions added to source, if it does manipulate the correct values, and fix them until you get it right.

Which requires you to have clear idea what is the "correct behaviour" first! (like in this case "fetching byte-by-byte values from d2 address, one after another, incrementing "length" counter, and looking for zero byte) So you can tell when the code does what you need, or not.


What I did want to point out with this answer is, that instructions themselves, while important, are less important than your vision of data/structures/algorithm used. Your question sounds like you have no idea what is "C string" in x86 assembly, or which algorithm to use. That makes it impossible for you to just "guess" some instructions into source and then verify if you guessed right or not. Because you can't tell what you want it to do. That's why I told you should check also non-gas x86 Assembly resources for the very basics, what is bit/byte/computer memory/etc... up until you somewhat understand what numeric values are manipulated for example to create "strings".

Once you will have good idea what it should do, it will be easy for you to catch in debugger things like swapped arguments (for example: movq %rcx, d2 - why do you put 8 bytes from rcx into memory at address d2? That will overwrite the input string), and similar, so you actually don't need to understand the instructions and gas syntax 100% well, just enough to produce something, and then over several iterations to "fix" it. Like checking the register+memory view, realizing the rcx didn't change, but instead the string data were damaged => try it other way...


Oh, and I completely forgot... you need to find documentation for your 64b platform ABI, so you know what is the correct way to pass arguments to C functions.

For example in linux these tutorials may help: http://cs.lmu.edu/~ray/notes/gasexamples/

And search here for word "ABI" for further resources: https://stackoverflow.com/tags/x86/info