Note: Just here for the brevity the examples are simplified, so they do not justify my intentions. If I would be just writing to a memory location exactly like as in the example, then the C would be the best approach. However, I'm doing stuff where I can't use C in this example even when in general it would be best to stay in C.
I'm trying to load registers with values, but I'm stuck to using 8-bit immediates.
My code:
https://godbolt.org/z/8EE45Gerd
#include <cstdint>
void a(uint32_t value) {
*(volatile uint32_t *)(0x21014) = value;
}
void b(uint32_t value) {
asm (
"push ip \n\t"
"mov ip, %[gpio_out_addr_high] \n\t"
"lsl ip, ip, #8 \n\t"
"add ip, %[gpio_out_addr_low] \n\t"
"lsl ip, ip, #2 \n\t"
"str %[value], [ip] \n\t"
"pop ip \n\t"
:
: [gpio_out_addr_low] "I"((0x21014 >> 2) & 0xff),
[gpio_out_addr_high] "I"((0x21014 >> (2+8)) & 0xff),
[value] "r"(value)
);
}
// adding -march=ARMv7E-M will not allow 16-bit immediate
// void c(uint32_t value) {
// asm (
// "mov ip, %[gpio_out_addr] \n\t"
// "str %[value], [ip] \n\t"
// :
// : [gpio_out_addr] "I"(0x1014),
// [value] "r"(value)
// );
// }
int main() {
a(20);
b(20);
return 0;
}
When I write a C code (see a()
) then it gets assembled in Godbolt to:
a(unsigned char):
mov r3, #135168
str r0, [r3, #20]
bx lr
I think it uses the MOV
as pseudo instruction. When I want to do the same in assembly, I could put the value into some memory location and load it with LDR
. I think that's how the C code gets assembled when I use -march=ARMv7E-M (the MOV
gets replaced with LDR
), however in many cases this will not be practical for me as I will be doing other things with.
In the case of the 0x21014 address, the first 2 bits are zero so I can treat this 18-bit number as 16-bit when I shift it correctly, that's what I'm doing in the b()
, but still I have to pass it with 8-bit immediates. However, in the Keil documentation I noticed mention of 16-bit immediates:
https://www.keil.com/support/man/docs/armasm/armasm_dom1359731146992.htm
https://www.keil.com/support/man/docs/armasm/armasm_dom1361289878994.htm
In ARMv6T2 and later, both ARM and Thumb instruction sets include:
A MOV instruction that can load any value in the range 0x00000000 to 0x0000FFFF into a register. A MOVT instruction that can load any value in the range 0x0000 to 0xFFFF into the most significant half of a register, without altering
the contents of the least significant half.
I think my CortexM4 should be ARMv7E-M and should meet this "ARMv6T2 and later" requirement and should be able to use 16-bit immediates.
However from GCC inline assembly documentation I do not see such mention:
https://gcc.gnu.org/onlinedocs/gcc/Machine-Constraints.html
And when I enable the ARMv7E-M arch and uncomment the c()
where I use the regular "I" immediate then I get a compilation error:
<source>: In function 'void c(uint8_t)':
<source>:29:6: warning: asm operand 0 probably doesn't match constraints
29 | );
| ^
<source>:29:6: error: impossible constraint in 'asm'
So I wonder is there a way to use 16-bit immediates with GCC inline assembly, or am I missing something (that would make my question irrelevant)?
Side question, is it possible to disable in the Godbolt these pseudo instructions? I have seen they are used with the RISC-V assembly as well, but I would prefer to see disassembled real bytecode to see what exact instructions these pseudo/macro assembly instructions resulted.
@Jester in the comments recommended either to use
i
constrain to pass larger immediates or use real C variable, initialize it with desired value and let the inline assembly take it. This sounds like the best solution, the least time spent in the inline assembly the better, people wanting better performance often underestimate how powerful the C/C++ toolchain can be at optimizing when given correct code and for many rewriting the C/C++ code is the answer instead of redoing everything in assembly. @Peter Cordes mentioned to not use inline assembly and I concur. However in this case the exact timing of some instructions was critical and I couldn't risk the toolchain slightly differently optimize the timing of some instructions.Bit-banging protocols is not ideal, and in most cases the answer is to avoid bit-banging, however in my case it's not that simple and other approaches didn't work:
Long story short, bit-banging is bad and mostly there are better ways around it and unecesary using inline assembly might actually produce worse results without knowing, but in my case I needed it. And in my previous code was trying to focus on a simple question about the immediates and not go into tangents or X-Y problem discussion.
So now back to the topic of 'passing bigger immediates to the assembly', here is the implementation of a much more real-world example:
https://godbolt.org/z/5vbb7PPP5
@David Wohlferd comment about making less assembly will give more chances for the toolchain to optimize further the 'load of addresses into the registers', in case of inlining it shouldn't load the addresses again (so they are done only once yet there are multiple invocations of reads/writes). Here is inlining enabled:
https://godbolt.org/z/K8GYYqrbq
And the question, was it worth it? I think yes, my TCK is dead spot 8MHz and my duty cycle is close to 50% while I have more confidence about the duty cycle staying as it is. And the sampling is done when I was expecting it to be done and not worry about it getting optimized differently with different toolchain settings.