I am trying to write a strlen
function in assembly using 64-bit GAS.
I need to get an input string from the user, and print
its length. This is my code:
.lcomm d2, 255
.data
pstring1: .ascii "%s\0\n"
.text
.globl main
main:
movq %rsp, %rbp
subq $8, %rsp
movq $d2, %rsi
movq %rsi,%rbx
movq $pstring1, %rdi
movq $0,%rax
call scanf
movq $1, %rax
movq $d2, %rsi
movq $pstring1, %rdi
call printf #print to check if scanf worked write
add $8, %rsp
movq 8(%rsp), %rcx
movq %rcx, d2
call pstrlen
popq %rbx
ret
##########
pstrlen:
movq %rsp, %rbx
movq 16(%rbp),%rdx
xor %rax, %rax
jmp if
then:
incq %rax
movq $length,%rax
if:
movq %rdx, %rcx
cmp 0, %rcx
jne then
end:
pop %rbp
ret
If someone could explain giving an example of how to work with strings and pass parameters to functions in 64-bit GAS assembly it would be ideal, since I can't find anything suitable online.
On principle level, you are using
.lcomm d2, 255
to allocate 255 bytes for the string data. One byte is 8 bits, 1 bit is either 0 or 1. So maximum value of one byte is 28-1 when treated as unsigned binary value. Which is for me the most common way, how I think about bytes (as a number0..255
), but those 8 bits can represent also other values, like sometimes signed 8 bit is used (-128..+127
), or particular bits are addressed giving them specific functionality for the particular code accessing them. (this part is good)Then you use
scanf
with"%s\0\n"
definitions (it will compile as bytes'%', 's', 0, 10
... not sure what the 10 is good for there after null terminator). I would use.asciiz "%254s"
instead, to prevent malicious user entering more that 255 bytes of input into that reservedd2
space. (note it's.asciiz
withz
at end, so it will add the zero byte on it's own)Then you use
printf
. Rather provide another formatting string separately for output, this time likeformatOut: .asciiz "%s\n"
.Finally you want
strlen
.Which means I will return back to input. If you are running in normal 64b OS (linux), your input string is very likely UTF-8 encoded (unless your OS is set in other specific Locale, then I'm not sure which Locale will
scanf
pick up).UTF-8 encoding is variable-length encoding, so you should decide whether your
strlen
will return number of characters, or number of bytes occupied.For the simplicity I will assume number of bytes (not chars) is enough for you, and if your input strings will consist only of basic 7b ASCII characters (
[0-9A-Za-z !@#$%^&*,.;'\<>?:"|{}]
etc... check any ASCII table ... no accent chars allowed (likeá
), that would produce multi-byte UTF8 code), then number of bytes will be also equal to number of characters (UTF-8 encoding is sort of compatible with 7b ASCII).That means for example for input
"Hell 1234"
the memory at addressd2
will contain these values (hexadecimal)48 65 6C 6C 20 31 32 33 34 00
. Once again, if you will check ASCII table, you will realize that for example byte0x20
is the space character, etc... And the string is "nul terminated", the last value zero is part of the string, but it is not displayed, instead it is used by various C functions as "end of string marker".So what you want to do in
strlen
is to load some register withd2
address, let's sayrdi
. And then scan byte by byte (byte, because ASCII encoding works in "1 char = 1 byte" way, and we will ignore UTF-8 variable-length codes), until you reach zero value in memory, and meanwhile count how many bytes it did take to reach it. If you would ponder on this idea a bit to make it "short" for CPU, and you will use theSCASB
for scanning (you can also write it "manually" with ordinarymov/cmp/inc/jne/jnz
if you wish), you may end with this:So you need first correct understand what values you are manipulating with, where they are, what is their bit/byte size, and what structure it has.
Then you can write instructions which produce any reasonable calculation based on those data.
In your case the calculation is "length_of_string = number of non-zero bytes in 7b ASCII encoded string stored in memory at address
d2
" (I mean after successfulscanf
part of code).Considering how your source looks it looks to me like you don't understand what x86 CPU instruction do, and you just copy them from some examples. That will get you into trouble soon.
For example
cmp 0, %rcx
is checking ifrcx
(8 bytes "wide" value) is equal to zero. And you did loadrcx
with value fromrdx
, which was something from stack (maybed2
address), so thercx
will be never zero.And even if you would actually load the character values from memory into
rcx
, you would load 8 of them at the same time, so you would miss the0
value as it would be only single byte inside some garbage, like0xCCCCCCCC00343332
(I'm using0xCC
for the undefined memory afterd2
buffer just for example, there may be any value).So that code doesn't make any sense. If you at least partially understand what are CPU registers and what instructions like
mov/inc/cmp/...
do, then you have some chance to produce working code by simply using debugger a lot, to verify almost every 1-2 new instructions added to source, if it does manipulate the correct values, and fix them until you get it right.Which requires you to have clear idea what is the "correct behaviour" first! (like in this case "fetching byte-by-byte values from
d2
address, one after another, incrementing "length" counter, and looking for zero byte) So you can tell when the code does what you need, or not.What I did want to point out with this answer is, that instructions themselves, while important, are less important than your vision of data/structures/algorithm used. Your question sounds like you have no idea what is "C string" in x86 assembly, or which algorithm to use. That makes it impossible for you to just "guess" some instructions into source and then verify if you guessed right or not. Because you can't tell what you want it to do. That's why I told you should check also non-gas x86 Assembly resources for the very basics, what is bit/byte/computer memory/etc... up until you somewhat understand what numeric values are manipulated for example to create "strings".
Once you will have good idea what it should do, it will be easy for you to catch in debugger things like swapped arguments (for example:
movq %rcx, d2
- why do you put 8 bytes fromrcx
into memory at addressd2
? That will overwrite the input string), and similar, so you actually don't need to understand the instructions and gas syntax 100% well, just enough to produce something, and then over several iterations to "fix" it. Like checking the register+memory view, realizing thercx
didn't change, but instead the string data were damaged => try it other way...Oh, and I completely forgot... you need to find documentation for your 64b platform ABI, so you know what is the correct way to pass arguments to C functions.
For example in linux these tutorials may help: http://cs.lmu.edu/~ray/notes/gasexamples/
And search here for word "ABI" for further resources: https://stackoverflow.com/tags/x86/info