gcc link time optimization and the Linux kernel Andi Kleen Intel OTC Apr 2013 [email protected] Acknowledgments Lots of people helped/contributed ● Ralf Baechle, Richard Biener, Tim Bird, Honza ● Hubicka, H.J. Lu, Joe Mario, Markus Trippelsdorf, Changlong Xie, others Why LTO? Optimize over the whole binary ● – Not just function (3.0) or file (<4.5) Avoid inline dependency hell in header files ● Without changing Makefiles significantly ● gcc 4.7+ LTO WHOPR crash course Compiler parses files, writes GIMPLE to object files ● (LGEN) – Function optimization summaries are computed Linker calls lto1 with all files for sequential whole ● program analysis (WPA) – Merges types, Generates global callgraph, IPA optimization summaries, writes partitions Generate code per partition in parallel (LTRANS) ● – Inline inside partition, run IPA optimizations for real, run per function optimizations, generate object code LTO IPA optimizations (4.8) Inlining between files De-virtualization ● ● Function cloning for ● Change ABI (SSE) ● specific arguments Constant propagation ● Remove unused code ● Scalar replacement of (fwhole-program) ● aggregates Increase alignment ● Constructor / ● Discover pure/const ● destructor merging Keep globals/statics ● alive over calls green: does not benefit kernel today Build time Build time small User time 800 small config 600 4000 ) 400 s e ( 2000 200 m 0 er ti 0 s u Gcc 4.7 Gcc 4.8 4.7 -lto 4.8 -lto Gcc 4.7 Gcc 4.8 4.7 -lto 4.8 -lto Faults small config Parallelism 60.00 e 15.00 m al ti 10.00 M) 40.00 e / re 5.00 ults ( 20.00 m a r ti 0.00 F e 0.00 Us Gcc 4.7 Gcc 4.8 4.7 -lto 4.8 -lto Gcc 4.7 Gcc 4.8 4.7 -lto 4.8 -lto LTO is slow and not parallel enough Parallelism small build 4.7 Object file generation Parsing / LGEN Modules without Runnable processes job server LTO kernel build small config 60 50 LTRANS 40 code generation p) r e ( 30 bl a n n u 20 r 10 WPA + real linker Type merging 0 11 31 51 71 91111131151171191211231251271291311331351371391411431451471491511531551 4 times 1 21 41 61 81101121141161181201221241261281301321341361381401421441461481501521541 time (s) Small config parallelism User time / Runtime 15 10 ) WHOPR still has poor parallelism s e ( 5 m ti 0 Gcc 4.7 Gcc 4.7 no lto gcc 4.8 Gcc 4.8 no lto Multilink vmlinux links 2-4x: runs LTO that often ● Generates integrated symbol table (kallsyms) ● So far not fixed ● – KALLSYMS can be disabled One of those unexpected quirks of real build ● systems 6000.00 4000.00 2000.00 0.00 Gcc 4.8 w/o kallsyms Gc c 4.8 no lto Gcc 4.7 with kallsyms Memory usage 4.7 small build Active memory Kernel LTO build small config 8 7 6 5 ) B G ( 4 m e m e 3 v cti a 2 1 0 9 25 41 57 73 89105121137153169185201217233249265281297313329345361377393409425441457473489505521537553 1 17 33 49 65 81 97113129145161177193209225241257273289305321337353369385401417433449465481497513529545 time (s) Faults small config Even small build swaps in 4GB system 60.00 Memory peaks all in WPA M) 40.00 ( s ult 2 0.00 a F 0.00 Gcc 4.7 Gcc 4.8 4.7 -lto 4.8 -lto Memory consumption Temporary data can be a problem, together ● with WPA – Early many swap storms with /tmp = tmpfs – Partitioning algorithm was improved – Use TMPDIR=objdir – With modules need to avoid too large -j* for parallel WPA – Jobserver has to be disabled, makes it worse
Description: