ebook img

gcc link time optimization and the Linux kernel Andi Kleen Intel OTC PDF

29 Pages·2013·0.22 MB·English
Save to my drive
Quick download
Download
Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.

Preview gcc link time optimization and the Linux kernel Andi Kleen Intel OTC

gcc link time optimization and the Linux kernel Andi Kleen Intel OTC Apr 2013 [email protected] Acknowledgments Lots of people helped/contributed ● Ralf Baechle, Richard Biener, Tim Bird, Honza ● Hubicka, H.J. Lu, Joe Mario, Markus Trippelsdorf, Changlong Xie, others Why LTO? Optimize over the whole binary ● – Not just function (3.0) or file (<4.5) Avoid inline dependency hell in header files ● Without changing Makefiles significantly ● gcc 4.7+ LTO WHOPR crash course Compiler parses files, writes GIMPLE to object files ● (LGEN) – Function optimization summaries are computed Linker calls lto1 with all files for sequential whole ● program analysis (WPA) – Merges types, Generates global callgraph, IPA optimization summaries, writes partitions Generate code per partition in parallel (LTRANS) ● – Inline inside partition, run IPA optimizations for real, run per function optimizations, generate object code LTO IPA optimizations (4.8) Inlining between files De-virtualization ● ● Function cloning for ● Change ABI (SSE) ● specific arguments Constant propagation ● Remove unused code ● Scalar replacement of (fwhole-program) ● aggregates Increase alignment ● Constructor / ● Discover pure/const ● destructor merging Keep globals/statics ● alive over calls green: does not benefit kernel today Build time Build time small User time 800 small config 600 4000 ) 400 s e ( 2000 200 m 0 er ti 0 s u Gcc 4.7 Gcc 4.8 4.7 -lto 4.8 -lto Gcc 4.7 Gcc 4.8 4.7 -lto 4.8 -lto Faults small config Parallelism 60.00 e 15.00 m al ti 10.00 M) 40.00 e / re 5.00 ults ( 20.00 m a r ti 0.00 F e 0.00 Us Gcc 4.7 Gcc 4.8 4.7 -lto 4.8 -lto Gcc 4.7 Gcc 4.8 4.7 -lto 4.8 -lto LTO is slow and not parallel enough Parallelism small build 4.7 Object file generation Parsing / LGEN Modules without Runnable processes job server LTO kernel build small config 60 50 LTRANS 40 code generation p) r e ( 30 bl a n n u 20 r 10 WPA + real linker Type merging 0 11 31 51 71 91111131151171191211231251271291311331351371391411431451471491511531551 4 times 1 21 41 61 81101121141161181201221241261281301321341361381401421441461481501521541 time (s) Small config parallelism User time / Runtime 15 10 ) WHOPR still has poor parallelism s e ( 5 m ti 0 Gcc 4.7 Gcc 4.7 no lto gcc 4.8 Gcc 4.8 no lto Multilink vmlinux links 2-4x: runs LTO that often ● Generates integrated symbol table (kallsyms) ● So far not fixed ● – KALLSYMS can be disabled One of those unexpected quirks of real build ● systems 6000.00 4000.00 2000.00 0.00 Gcc 4.8 w/o kallsyms Gc c 4.8 no lto Gcc 4.7 with kallsyms Memory usage 4.7 small build Active memory Kernel LTO build small config 8 7 6 5 ) B G ( 4 m e m e 3 v cti a 2 1 0 9 25 41 57 73 89105121137153169185201217233249265281297313329345361377393409425441457473489505521537553 1 17 33 49 65 81 97113129145161177193209225241257273289305321337353369385401417433449465481497513529545 time (s) Faults small config Even small build swaps in 4GB system 60.00 Memory peaks all in WPA M) 40.00 ( s ult 2 0.00 a F 0.00 Gcc 4.7 Gcc 4.8 4.7 -lto 4.8 -lto Memory consumption Temporary data can be a problem, together ● with WPA – Early many swap storms with /tmp = tmpfs – Partitioning algorithm was improved – Use TMPDIR=objdir – With modules need to avoid too large -j* for parallel WPA – Jobserver has to be disabled, makes it worse

Description:
Acknowledgments. ○ Lots of people helped/contributed. ○ Ralf Baechle, Richard Biener, Tim Bird, Honza. Hubicka, H.J. Lu, Joe Mario, Markus. Trippelsdorf
See more

The list of books you might like

Most books are stored in the elastic cloud where traffic is expensive. For this reason, we have a limit on daily download.