I am developing a C++ for ARM using GCC. I have ran into an issue where, I no optimizations are enabled, I am unable to create a binary (ELF) for my code because it will not fit in the available space. However, if I simply enable optimization for debugging (-Og), which is the lowest optimization available to my knowledge, the code easily fits.
In both cases, -ffunction-sections, -fdata-sections, -fno-exceptions, and -Wl,--gc-sections is enabled.
- Flash size: 512 kB
- Without Optimizations: .text Overflows by ~200 kB
- With -Og Optimizations: .text is ~290 kB
This is huge difference in binary size even with minimal optimizations.
I took a look at 3.11 Options That Control Optimization for details on to what optimizations are being performed with the -Og flag to see if that would give me any insight.
What optimization flags affect binary size the most? Is there anything I should be looking for to explain this massive difference?
Most of the extra code-size for an un-optimized build is the fact that the default
-O0
also means a debug build, not keeping anything in registers across statements for consistent debugging even if you use a GDBj
command to jump to a different source line in the same function.-O0
means a huge amount of store/reload vs. even the lightest level of optimization, especially disastrous for code-size on a non-CISC ISA that can't use memory source operands. Why does clang produce inefficient asm with -O0 (for this simple floating point sum)? applies to GCC equally.Especially for modern C++, a debug build is disastrous because simple template wrapper functions that normally inline and optimize away to nothing in simple cases (or maybe one instruction), instead compile to actual function calls that have to set up args and run a call instruction. e.g. for a
std::vector
, theoperator[]
member function can normally inline to a singleldr
instruction, assuming the compiler has the.data()
pointer in a register. But without inlining, every call-site takes multiple instructions1Options that affect code-size in the actual
.text
section1 the most: alignment of branch-targets in general, or just loops, costs some code-size. Other than that:-ftree-vectorize
- make SIMD versions loops, also necessitating scalar cleanup if the compiler can't prove that the iteration count will be a multiple of the vector width. (Or that pointed-to arrays are non-overlapping if you don't userestrict
; that may also need a scalar fallback). Enabled at-O3
in GCC11 and earlier. Enabled at-O2
in GCC12 and later, like clang.-funroll-loops
/-funroll-all-loops
- not enabled by default even at-O3
in modern GCC. Enabled with profile-guided optimization (-fprofile-use
), when it has profiling data from a-fprofile-generate
build to know which loops are actually hot and worth spending code-size on. (And which are cold and thus should be optimized for size so you get fewer I-cache misses when they do run, and less eviction of other code.) PGO also influences vectorization decisions.Related to loop unrolling are heuristics (tuning knobs) that control loop peeling (fully unrolling) and how much to unroll. The normal way to set these is with
-march=native
, implying-mtune=
whatever as well.-mtune=znver3
may favour big unroll factors (at least clang does), compared to-mtune=sandybridge
or-mtune=haswell
. But there are GCC options to manually adjust individual things, as discussed in comments on gcc: strange asm generated for simple loop and in How to ask GCC to completely unroll this loop (i.e., peel this loop)?There are options to override the weights and thresholds for other decision heuristics like inlining, too, but it's very rare you'd want to fine-tune that much unless you're working on refining the defaults, or finding good defaults for a new CPU.
-Os
- optimize for size and speed, trying not to sacrifice too much speed. A good tradeoff if your code has a lot of I-cache misses, otherwise-O3
is normally faster, or at least that's the design goal for GCC. Can be worth trying different options to see if-O2
or-Os
make your code faster than-O3
across some CPUs you care about; sometimes missed-optimizations or quirks of certain microarchitectures make a difference, as in Why does GCC generate 15-20% faster code if I optimize for size instead of speed? which has actual benchmarks from GCC4.6 to 4.8 (current at the time) for a specific small loop in a test program, on quite a few different x86 and ARM CPUs, with and without-march=native
to actually tune for them. There's zero reason to expect that to be representative of other code, though, so you need to test yourself for your own codebase. (And for any given loop, small code changes could make a different compile option better on any given CPU.)And obviously
-Os
is very useful if you need your static code-size smaller to fit in some size limit.-Oz
optimizing for size only, even at a large cost in speed. GCC only very recently added this to current trunk, so expect it in GCC12 or 13. Presumably what I wrote below about clang's implementation of-Oz
being quite aggressive also applies to GCC, but I haven't yet tested it.Clang has similar options, including
-Os
. It also has aclang -Oz
option to optimize only for size, without caring about speed. It's very aggressive, e.g. on x86 using code-golf tricks likepush 1; pop rax
(3 bytes total) instead ofmov eax, 1
(5 bytes).GCC's
-Os
unfortunately chooses to usediv
instead of a multiplicative inverse for division by a constant, costing lots of speed but not saving much if any size. (https://godbolt.org/z/x9h4vx1YG for x86-64). For ARM, GCC-Os
still uses an inverse if you don't use a-mcpu=
that impliesudiv
is even available, otherwise it usesudiv
: https://godbolt.org/z/f4sa9Wqcj .Clang's
-Os
still uses a multiplicative inverse withumull
, only usingudiv
with-Oz
. (or a call to__aeabi_uidiv
helper function without any-mcpu
option). So in that respect,clang -Os
makes a better tradeoff than GCC, still spending a little bit of code-size to avoid slow integer division.Footnote 1: inlining or not for
std::vector
Godbolt with
gcc
with the default-O0
vs.-Os
for-mcpu=cortex-m7
just to randomly pick something. IDK if it's normal to use dynamic containers likestd::vector
on an actual microcontroller; probably not.vs. a debug build (with name-demangling enabled for the asm)
As you can see, un-optimized GCC cares more about fast compile-times than even the most simple things like avoiding useless
mov reg,reg
instructions even within code for evaluating one expression.Footnote 1: metadata
If you could a whole ELF executable with metadata, not just the .text + .rodata + .data you'd need to burn to flash, then of course
-g
debug info is very significant for size of the file, but basically irrelevant because it's not mixed in with the parts that are needed while running, so it just sits there on disk.Symbol names and debug info can be stripped with
gcc -s
orstrip
.Stack-unwind info is an interesting tradeoff between code-size and metadata.
-fno-omit-frame-pointer
wastes extra instructions and a register as a frame pointer, leading to larger machine-code size, but smaller.eh_frame
stack unwind metadata. (strip
does not consider that "debug" info by default, even for C programs not C++ where exception-handling might need it in non-debugging contexts.)How to remove "noise" from GCC/clang assembly output? mentions how to get the compiler to omit some of that:
-fno-asynchronous-unwind-tables
omits.cfi
directives in the asm output, and thus the metadata that goes into the.eh_frame
section. Also-fno-exceptions -fno-rtti
with C++ can reduce metadata. (Run-Time Type Information for reflection takes space.)Linker options that control alignment of sections / ELF segments can also take extra space, relevant for tiny executables but is basically a constant amount of space, not scaling with the size of the program. See also Minimal executable size now 10x larger after linking than 2 years ago, for tiny programs?