Did you know that it is possible (for gcc, not clang) to compile 32bit x86 binaries with most of the benefits of x86_64 without using the problematic x32 ABI?
Note if you link with any libraries, they will need to have been compiled similarly.
-Os -m32 -mlong-double-64 -mfpmath=sse -mregparm=3 -msseregparm -fomit-frame-pointer -mrtd -freg-struct-return -mpush-args -mno-accumulate-outgoing-args -fomit-frame-pointer
Here is a simple example of floating point improvements from gcc.godbolt.org
Code: Select all
float squaref(float n){return n * n;}
double square(double n){return n * n;}
long double squarel(long double n){return n*n;}
int squarei(int a){return a*a;}
long long squareil(long x){return (long long)x*x;}
Code: Select all
squaref:
mulss %xmm0, %xmm0
ret
square:
mulsd %xmm0, %xmm0
ret
squarel:
mulsd %xmm0, %xmm0
ret
squarei:
imull %eax, %eax
ret
squareil:
imull %eax
ret
Code: Select all
squaref:
pushl %ebp
movl %esp, %ebp
pushl %eax
flds 8(%ebp)
fmul %st(0), %st
fstps -4(%ebp)
flds -4(%ebp)
leave
ret
square:
pushl %ebp
movl %esp, %ebp
subl $8, %esp
fldl 8(%ebp)
fmul %st(0), %st
fstpl -8(%ebp)
fldl -8(%ebp)
leave
ret
squarel:
pushl %ebp
movl %esp, %ebp
fldt 8(%ebp)
popl %ebp
fmul %st(0), %st
ret
squarei:
pushl %ebp
movl %esp, %ebp
movl 8(%ebp), %eax
popl %ebp
imull %eax, %eax
ret
squareil:
pushl %ebp
movl %esp, %ebp
movl 8(%ebp), %eax
popl %ebp
imull %eax
ret
-Os -m32 -march=pentium-m -mtune=generic :
compile for size on x86 with instructions available to Pentium M but tune it for a generic CPU
Replace -march=pentium-m with pentium3m to avoid sse2 instructions, but any code using a double or long double instead of a float will be suboptimal (may want to use -mfpmath=both or add a nasty hack like -Ddouble=float)
-mregparm=3 -msseregparm -mno-fp-ret-in-387 -mfpmath=sse -freg-struct-return :
pass up to 3 integral values in registers (e{adc}x) as well as 3 floating points in sse registers (xmm*), use sse instructions for floating point math even for structs
-mpush-args -mno-accumulate-outgoing-args -mpreferred-stack-boundary=2 -fomit-frame-pointer -mrtd -mskip-rax-setup:
avoids some holdover prologue/epilogue code (usually unnecessary) stack manipulation thereby decreasing code size
(Note all called functions must have a prototype)
You may also want to use these in your CFLAGS:
-flto OR -ffunction-sections -fdata-sections (with -Wl,--gc-sections in LDFLAGS) :
These get rid of a lot of unused junk.
Unless doing a debug build I also use:
-g0 -fno-unwind-tables -fno-asynchronous-unwind-tables -feliminate-dwarf2-dups -fno-dwarf2-cfi-asm :
don't emit useless dwarf debugging stuff
-fno-ident :
don't emit compiler info
-fmerge-all-constants :
duplicate constants are stored in 1 place (IIRC can be problematic with loadable modules though)
-fweb :
sometimes helps with optimization of larger programs
-ffast-math -fshort-double -fsingle-precision-constant :
like -mlong-double-64 reduces code size at the cost of floating point precision and "standards" compliance - probably ok for an mp3 decoder, but not for calculating ballistic missile trajectories.
For c++ (CXXFLAGS)
-fno-exceptions -fno-rtti -fvtable-gc :
don't use exceptions, run time type info and remove unused virtual method tables
Note: with -msse, you should avoid math operations on doubles - use float
However gcc seems to be generating sse2 instructions anyhow
See also:
https://gcc.gnu.org/onlinedocs/gcc/Opti ... tions.html
https://gcc.gnu.org/onlinedocs/gcc/Code ... tions.html
https://gcc.gnu.org/onlinedocs/gcc/Debu ... tions.html
AND
https://gcc.gnu.org/onlinedocs/gcc/x86-Options.html
OR one of these
https://gcc.gnu.org/onlinedocs/gcc/Subm ... tions.html
Random foot note: -mlong-double-64 was added for bionic libc in gcc-4.8 in case anyone would like to patch gcc 4.7 series (last C version):
https://gcc.gnu.org/ml/gcc-patches/2012 ... 01512.html