Implicit Vectorisation Stephen Blair-Chappell Intel Compiler Labs This training relies on you owning a copy of the following… Parallel Programming with Parallel Studio XE Stephen Blair-Chappell & Andrew Stokes Wiley ISBN: 9780470891650 Part I: Introduction Part II: Using Parallel Studio XE Part III :Case Studies 1: Parallelism Today 4: Producing Optimized Code 13: The World’s First Sudoku ‘Thirty-Niner’ 2: An Overview of Parallel Studio XE 5: Writing Secure Code 14: Nine Tips to Parallel Heaven 3: Parallel Studio XE for the Impatient 6: Where to Parallelize 15: Parallel Track-Fitting in the CERN Collider 7: Implementing Parallelism 16: Parallelizing Legacy Code 8: Checking for Errors 9: Tuning Parallelism 10: Advisor-Driven Design 11: Debugging Parallel Applications 12:Event-Based Analysis with VTune Amplifier XE 2 8/2/2012 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. What’s in this section? • (A seven-step optimization process ) • Using different compiler options to optimize your code • Using auto-vectorization to tune your application to different CPUs 3 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. The Sample Application • Initialises two matrices with a numeric sequence • Does a Matrix Multiplication 4 8/2/2012 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. The main loop (without timing & printf) // repeat experiment six times for( l=0; l<6; l++ ) { // initialize matrix a sum = Work(&total,a); // initialize matrix b; for (i = 0; i < N; i++) { for (j=0; j<N; j++) { for (k=0;k<DENOM_LOOP;k++) { sum += m/denominator; } b[N*i + j] = sum; } } // do the matrix manipulation MatrixMul( (double (*)[N])a, (double (*)[N])b, (double (*)[N])c); } 5 8/2/2012 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. The Matrix Multiply void MatrixMul(double a[N][N], double b[N][N], double c[N][N]) { int i,j,k; for (i=0; i<N; i++) { for (j=0; j<N; j++) { for (k=0; k<N; k++) { c[i][j] += a[i][k] * b[k][j]; } } } } 6 8/2/2012 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. start Step 1 Example options Build with Windows (Linux) s optimization disabled p /Od (-O0) e Step 2 t Use General S Optimizations n /01,/02,/03 (-O1, -O2, -O3) o Step 3 i Use Processor-Specific t /QxSSE4.2 (-xsse4.2) a Options /QxHOST (-xhost) s i Step 4 m Add Inter-procedural /Qipo (-ipo) i t p Step 5 O Use Profile Guided /Qprof-gen (-prof-gen) n Optimization /Qprof-use (-prof-use) e Step 6 v Tune automatic e S vectorization /Qguide (-guide) e h Step 7 T Implement Parallelism Use Intel Family of Parallel Models or use Automatic /Qparallel (-parallel) Parallelism Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. start Step 1 Example options Build with Windows (Linux) s optimization disabled p /Od (-O0) e Step 2 t Use General S Optimizations n /01,/02,/03 (-O1, -O2, -O3) o Step 3 i Use Processor-Specific t /QxSSE4.2 (-xsse4.2) a Options /QxHOST (-xhost) s i Step 4 m Add Inter-procedural /Qipo (-ipo) i t p Step 5 O Use Profile Guided /Qprof-gen (-prof-gen) n Optimization /Qprof-use (-prof-use) e Step 6 v Tune automatic e S vectorization /Qguide (-guide) e h Step 7 T Implement Parallelism Use Intel Family of Parallel Models or use Automatic /Qparallel (-parallel) Parallelism Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Intel® Compiler Architecture C++ FORTRAN Front End Front End Profiler Disambiguation: types, array, pointer, structure, Interprocedural analysis and optimizations: inlining, directives constant prop, whole program detect, mod/ref, points-to Loop optimizations: data deps, prefetch, vectorizer, unroll/interchange/fusion/dist, auto-parallel/OpenMP Global scalar optimizations: partial redundancy elim, dead store elim, strength reduction, dead code elim Code generation: vectorization, software pipelining, global scheduling, register allocation, code generation Step 2 9 8/2/2012 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners. Getting Visibility : Compiler Optimization Report Compiler switch: (Linux) -opt-report-phase[=phase] ‚ ‘ can be: phase – Interprocedural Optimization • ipo – Intermediate Language Scalar Optimization • ilo – High Performance Optimization • hpo – High-level Optimization • hlo … – All optimizations (not recommended, output too • all verbose) Control the level of detail in the report: (Windows) /Qopt-report[0|1|2|3] (Linux, MacOS X) -opt-report[0|1|2|3] Step 2 10 10 8/2/2012 Copyright© 2012, Intel Corporation. All rights reserved. *Other brands and names are the property of their respective owners.
Description: