Äîêóìåíò âçÿò èç êýøà ïîèñêîâîé ìàøèíû. Àäðåñ îðèãèíàëüíîãî äîêóìåíòà : http://smdc.sinp.msu.ru/doc/UserGuide.pdf
Äàòà èçìåíåíèÿ: Tue May 6 15:17:32 2008
Äàòà èíäåêñèðîâàíèÿ: Mon Oct 1 19:33:00 2012
Êîäèðîâêà:
PathScaleTM Compiler Suite User Guide
Version 3.1

Page i


PathScale Compiler Suite User Guide Version 3.1

Information furnished in this manual is believed to be accurate and reliable. However, PathScale LLC assumes no responsibility for its use, nor for any infringements of patents or other rights of third parties which may result from its use. PathScale LLC reserves the right to change product specifications at any time without notice. Applications described in this document for any of these products are for illustrative purposes only. PathScale LLC makes no representation nor warranty that such applications are suitable for the specified use without further testing or modification. PathScale LLC assumes no responsibility for any errors that may appear in this document. No part of this document may be copied nor reproduced by any means, nor translated nor transmitted to any magnetic medium without the express written consent of PathScale LLC. In accordance with the terms of their valid PathScale agreements, customers are permitted to make electronic and paper copies of this document for their own exclusive use. Linux is a registered trademark of Linus Torvalds. PathScale, the PathScale logo, and EKOPath are registered trademarks of PathScale, LLC. Red Hat and all Red Hat-based trademarks are trademarks or registered trademarks of Red Hat, Inc. SuSE is a registered trademark of SuSE Linux AG. All other brand and product names are trademarks or registered trademarks of their respective owners.

© 2007 PathScale, LLC. All rights reserved. © 2006, 2007 QLogic Corporation. All rights reserved worldwide. © 2004, 2005, 2006 PathScale. All rights reserved. First Published: April 2004 Printed in U.S.A. PathScale LLC, 2071 Stierlin Ct., Suite 200, Mountain View, CA 94043

Page ii


Table of Contents
Section 1 Introduction
1.1 Conventions Used in This Document . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-1 1.2 Documentation Suite . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1-2 Section 2 Compiler Quick Reference 2.1 What You Installed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-1 2.2 How To Invoke the PathScale Compilers . . . . . . . . . . . . . . . . . . . . . . . . . 2-1 2.2.1 Accessing the GCC 4.x Front-ends for C and C++ . . . . . . . . . . . . . . . 2-2 2.3 Compiling for Different Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-3 2.3.1 Target Options for This Release . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-4 2.3.2 Defaults Flag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-5 2.3.3 Compiling for an Alternate Platform . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-5 2.3.4 Compiling Option Tool: pathhow-compiled . . . . . . . . . . . . . . . . . . . . . 2-6 2.4 Input File Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-6 2.5 Other Input Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-7 2.6 Common Compiler Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-8 2.7 Shared Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-8 2.8 Large File Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-9 2.9 Memory Model Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-9 2.9.1

Page iii


PathScale Compiler Suite User Guide Version 3.1

Support for "Large" Memory Model . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-10 2.10 Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2-11 2.11 Profiling: Locate Your Program's Hot Spots . . . . . . . . . . . . . . . . . . . . . . . 2-11 2.12 taskset: Assigning a Process to a Specific CPU . . . . . . . . . . . . . . . . . . . . 2-12 Section 3 The PathScale Fortran Compiler 3.1 Using the Fortran Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-1 3.1.1 Fixed-form and Free-form Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-2 3.2 Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-3 3.2.1 Order of Appearance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-3 3.2.2 Linking Object Files to the Rest of the Program . . . . . . . . . . . . . . . . . 3-4 3.3 Linking When the Main Program Is In a Library . . . . . . . . . . . . . . . . . . . . 3-4 3.3.1 Module-related Error Messages . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-4 3.4 Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-5 3.4.1 Promotion of REAL and INTEGER Types . . . . . . . . . . . . . . . . . . . . . . 3-5 3.4.2 Cray Pointers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-6 3.4.3 Directives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-6 3.4.3.1 F77 or F90 Prefetch Directives . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-7 3.4.3.2 Changing Optimization Using Directives . . . . . . . . . . . . . . . . . . . . . 3-8 3.5 Compiler and Runtime Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-9 3.5.1 Preprocessing Source Files with -cpp . . . . . . . . . . . . . . . . . . . . . . . . . 3-9 3.5.2 3.4.2 Preprocessing Source Files with -ftpp . . . . . . . . . . . . . . . . . . . . 3-9 3.5.3

Page iv


PathScale Compiler Suite User Guide Version 3.1 PathScale Compiler Suite User

Support for Varying Length Character Strings . . . . . . . . . . . . . . . . . . . 3-9 3.5.4 Preprocessing Source Files with -fcoco . . . . . . . . . . . . . . . . . . . . . . . 3-9 3.5.4.1 Pre-defined Macros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-10 3.5.5 Error Numbers: The explain Command . . . . . . . . . . . . . . . . . . . . . . . . 3-11 3.5.6 Fortran 90 Dope Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-13 3.5.7 Bounds Checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-13 3.5.8 Pseudo-random Numbers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-13 3.6 Mixed Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-13 3.6.1 Calls between C and Fortran . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-14 3.6.1.1 Example: Calls between C and Fortran . . . . . . . . . . . . . . . . . . . . . . 3-15 3.6.1.2 Example: Accessing Common Blocks from C . . . . . . . . . . . . . . . . . 3-18 3.7 Runtime I/O Compatibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-19 3.7.1 Performing Endian Conversions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-19 3.7.1.1 The assign Command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-19 3.7.1.2 Using the Wildcard Option . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-20 3.7.1.3 Converting Data and Record Headers . . . . . . . . . . . . . . . . . . . . . . . 3-20 3.7.1.4 The ASSIGN( ) Procedure . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-20 3.7.1.5 I/O Compilation Flags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-20 3.7.2 Reserved File Units . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-21 3.8 Source Code Compatibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-21 3.8.1 Fortran KINDs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-21 3.9

Page v


PathScale Compiler Suite User Guide Version 3.1

Library Compatibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-22 3.9.1 Name Mangling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-22 3.9.2 ABI Compatibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-23 3.9.3 Linking with g77-compiled Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-23 3.9.3.1 AMD Core Math Library (ACML) . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-24 3.9.4 List Directed I/O and Repeat Factors . . . . . . . . . . . . . . . . . . . . . . . . . . 3-24 3.9.4.1 Environment Variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-25 3.9.4.2 assign Command . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-25 3.10 Porting Fortran Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-26 3.11 Debugging and Troubleshooting Fortran . . . . . . . . . . . . . . . . . . . . . . . . . 3-26 3.11.1 Writing to Constants Can Cause Crashes . . . . . . . . . . . . . . . . . . . . . . 3-27 3.11.2 Runtime Errors Caused by Aliasing Among Fortran Dummy Arguments 3-27 3.11.3 Fortran malloc Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-28 3.11.4 Arguments Copied to Temporary Variables . . . . . . . . . . . . . . . . . . . . . 3-28 3.12 Fortran Compiler Stack Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3-30 Section 4 The PathScale C/C++ Compiler 4.1 Using the C/C++ Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-2 4.1.1 Accessing the GCC 4.x Front-ends for C and C++ . . . . . . . . . . . . . . . 4-2 4.2 Compiler and Runtime Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3 4.2.1 Preprocessing Source Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-3 4.2.1.1 Pre-defined Macros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-4 4.2.2

Page vi


PathScale Compiler Suite User Guide Version 3.1 PathScale Compiler Suite User

Pragmas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-6 4.2.2.1 Pragma pack . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-6 4.2.2.2 Changing Optimization Using Pragmas . . . . . . . . . . . . . . . . . . . . . . 4-6 4.2.2.3 Code Layout Optimization Using Pragmas . . . . . . . . . . . . . . . . . . . 4-6 4.2.3 Mixing Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-7 4.2.4 Linking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-7 4.3 Debugging and Troubleshooting C/C++ . . . . . . . . . . . . . . . . . . . . . . . . . . 4-7 4.4 Unsupported GCC Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4-8 Section 5 Porting and Compatibility 5.1 Getting Started . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-1 5.2 GNU Compatibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-1 5.3 Compatibility with Other Fortran Compilers . . . . . . . . . . . . . . . . . . . . . . . 5-1 5.4 Porting Fortran . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3 5.4.1 Intrinsics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-3 5.4.1.1 An Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-4 5.4.2 Name-mangling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-4 5.4.3 Static Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-4 5.5 Porting to x86_64 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-4 5.6 Migrating from Other Compilers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5 5.7 Compatibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5 5.7.1 gcc Compatibility Wrapper Script . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5-5

Page vii


PathScale Compiler Suite User Guide Version 3.1

Section 6 Tuning Quick Reference 6.1 Basic Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-1 6.2 IPA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-1 6.3 Feedback Directed Optimization (FDO) . . . . . . . . . . . . . . . . . . . . . . . . . . 6-2 6.4 Aggressive Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-2 6.5 Compiler Flag Recommendations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-3 6.6 Performance Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-4 6.7 Optimize Your Hardware . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6-4 Section 7 Tuning Options 7.1 Basic Optimizations: The -O flag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-1 7.2 Syntax for Complex Optimizations (-CG, -IPA, -LNO -OPT, -WOPT) . . . 7-2 7.3 Inter-Procedural Analysis (IPA) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-3 7.3.1 The IPA Compilation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-3 7.3.2 Inter-procedural Analysis and Optimization . . . . . . . . . . . . . . . . . . . . . 7-4 7.3.2.1 Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-4 7.3.3 Optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-5 7.3.4 Controlling IPA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-7 7.3.4.1 Inlining . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-7 7.3.5 Cloning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-9 7.3.6 Other IPA Tuning Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-9 7.3.6.1 Disabling Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-10

Page viii


PathScale Compiler Suite User Guide Version 3.1 PathScale Compiler Suite User

7.3.7 Case Study on SPEC CPU2000 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-10 7.3.8 Invoking IPA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-12 7.3.9 Size and Correctness Limitations to IPA . . . . . . . . . . . . . . . . . . . . . . . 7-14 7.4 Loop Nest Optimization (LNO) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-14 7.4.1 Loop Fusion and Fission . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-14 7.4.2 Cache Size Specification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-15 7.4.3 Cache Blocking, Loop Unrolling, Interchange Transformations . . . . . . 7-16 7.4.4 Prefetch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-16 7.4.5 Vectorization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-17 7.5 Code Generation (-CG:) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-17 7.6 Feedback Directed Optimization (FDO) . . . . . . . . . . . . . . . . . . . . . . . . . . 7-18 7.7 Aggressive Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-19 7.7.1 Alias Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-19 7.7.2 Numerically Unsafe Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-20 7.7.3 Fast-math Functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-21 7.7.4 IEEE 754 Compliance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-21 7.7.4.1 Arithmetic . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-21 7.7.4.2 Roundoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-22 7.7.5 Other Unsafe Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-23 7.7.6 Assumptions About Numerical Accuracy . . . . . . . . . . . . . . . . . . . . . . . 7-23 7.7.6.1 Flush-to-Zero Behavior . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-24

Page ix


PathScale Compiler Suite User Guide Version 3.1

7.8 Hardware Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-24 7.8.1 Hardware Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-24 7.8.2 BIOS Setup . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-25 7.8.3 Multiprocessor Memory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-25 7.8.4 Kernel and System Effects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-25 7.8.5 Tools and APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-26 7.8.6 Testing Memory Latency and Bandwidth . . . . . . . . . . . . . . . . . . . . . . . 7-26 7.9 The pathopt2 Tool . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-27 7.9.1 A Simple Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-28 7.9.2 pathopt2 Usage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-29 7.9.3 Option Configuration File . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-32 7.9.4 Testing Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-35 7.9.5 Using an External Configuration File to Modify pathopt2.xml . . . . . . . 7-35 7.9.6 PSC_GENFLAGS Environment Variable . . . . . . . . . . . . . . . . . . . . . . . 7-36 7.9.7 Using Build and Test Scripts . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-36 7.9.8 The NAS Parallel Benchmark Suite . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-37 7.9.8.1 Set Up the Workarea . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-37 7.9.8.2 Example 1-Run with Makefile . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-37 7.9.8.3 Example 2-Use Build/Run Scripts and a Timing File . . . . . . . . . . . . 7-38 7.9.8.4 Example 3-Using a Single Script with the rate-file . . . . . . . . . . . . . . 7-41 7.10 How Did the Compiler Optimize My Code? . . . . . . . . . . . . . . . . . . . . . . . 7-43

Page x


PathScale Compiler Suite User Guide Version 3.1 PathScale Compiler Suite User

7.10.1 Using the -S flag . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-43 7.10.2 Using -CLIST or -FLIST . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-44 7.10.3 Verbose Flags . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7-44 Section 8 Using OpenMP and Autoparallelization 8.1 OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-1 8.2 Autoparallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-2 8.3 Getting Started With OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-3 8.4 OpenMP Compiler Directives (Fortran) . . . . . . . . . . . . . . . . . . . . . . . . . . 8-3 8.5 OpenMP Compiler Directives (C/C++) . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-6 8.6 OpenMP Runtime Library Calls (Fortran) . . . . . . . . . . . . . . . . . . . . . . . . . 8-7 8.7 OpenMP Runtime Library Calls (C/C++) . . . . . . . . . . . . . . . . . . . . . . . . . 8-9 8.8 Runtime Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-10 8.9 Environment Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-11 8.9.1 Standard OpenMP Environment Variables . . . . . . . . . . . . . . . . . . . . . 8-12 8.9.2 PathScale OpenMP Environment Variables . . . . . . . . . . . . . . . . . . . . 8-12 8.10 OpenMP Stack Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-21 8.10.1 Stack Size for Fortran . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-21 8.10.2 Stack Size for C/C++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-22 8.11 Stack Size Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-22 8.12 Example OpenMP Code in Fortran . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-24 8.13 Example OpenMP Code in C/C++ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-25

Page xi


PathScale Compiler Suite User Guide Version 3.1

8.14 Tuning for OpenMP Application Performance . . . . . . . . . . . . . . . . . . . . . 8-27 8.14.1 Reduced Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-27 8.14.2 Enable OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-28 8.14.3 Optimizations for OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-28 8.14.3.1 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-28 8.14.3.2 Memory System Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-28 8.14.3.3 Load Balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-29 8.14.3.4 Tuning the Application Code . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-30 8.14.3.5 Using Feedback Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-30 8.15 Other Resources for OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8-31 Section 9 Examples 9.1 Compiler Flag Tuning and Profiling With pathprof . . . . . . . . . . . . . . . . . . 9-1 9.2 Using the -profile Option . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9-4 Section 10 Debugging and Troubleshooting 10.1 Subscription Manager Problems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-1 10.2 Debugging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-1 10.3 Dealing with Uninitialized Variables . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-1 10.4 Trapping IEEE Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-2 10.5 Large Object Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-3 10.6 More Inputs Than Registers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-4 10.7

Page xii


PathScale Compiler Suite User Guide Version 3.1 PathScale Compiler Suite User

Linking With libg2c . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-4 10.8 Linking Large Object Files . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-4 10.9 Using -ipa and -Ofast . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-4 10.10 Tuning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-5 10.11 Troubleshooting OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-5 10.11.1 Compiling and Linking with -mp . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10-5

Appendix A Environment Variables
A.1 Environment Variables for Use with C . . . . . . . . . . . . . . . . . . . . . . . . . . . A-1 A.2 Environment variables for Use with C++ . . . . . . . . . . . . . . . . . . . . . . . . . A-1 A.3 Environment Variables for Use with Fortran . . . . . . . . . . . . . . . . . . . . . . A-1 A.4 Language-independent Environment Variables . . . . . . . . . . . . . . . . . . . . A-2 A.5 Environment Variables for OpenMP . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A-2 A.5.1 Standard OpenMP Runtime Environment Variables . . . . . . . . . . . . . . A-3 A.5.2 PathScale OpenMP Environment Variables . . . . . . . . . . . . . . . . . . . . A-3

Appendix B Implementation Dependent Behavior for OpenMP Fortran Appendix C Supported Fortran Intrinsics
C.1 How to Use the Intrinsics Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-1 C.2 Intrinsic Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-1 C.3 Table of Supported Intrinsics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-2 C.4 Fortran Intrinsic Extensions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . C-41

Page xiii


PathScale Compiler Suite User Guide Version 3.1

Appendix D Fortran 90 Dope Vector Appendix E Summary of Compiler Options Appendix F eko man Page Appendix G Glossary Figures Figure
7-1 IPA Compilation Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Page
7-6

Tables Table
4-1 7-1 7-2 7-3 7-4 7-5 8-1 8-2 8-3 8-4 8-5 C-1 E-1 Pre-defined Macros . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Effects of IPA on SPEC CPU 2000 Performance . . . . . . . . . Effects of IPA tuning on some SPEC CPU2000 benchmarks Numerical Accuracy with Options . . . . . . . . . . . . . . . . . . . . . pathopt2 Options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tags for Option Configuration Fle . . . . . . . . . . . . . . . . . . . . . Fortran Compiler Directives . . . . . . . . . . . . . . . . . . . . . . . . . . C/C++ Compiler Directives . . . . . . . . . . . . . . . . . . . . . . . . . . Fortran OpenMP Runtime Library Routines . . . . . . . . . . . . . C/C++ OpenMP Runtime Library Routines . . . . . . . . . . . . . . Standard OpenMP Environment Variables . . . . . . . . . . . . . . Fortran Intrinsics Supported in 3.1 . . . . . . . . . . . . . . . . . . . . Summary of Compiler Options by Function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

Page
4-4 7-10 7-12 7-23 7-30 7-34 8-4 8-6 8-8 8-9 8-12 C-3 E-1

Page xiv


Introduction
This User Guide covers how to use the PathScaleTM Compiler Suite compilers; how to configure them, how to use them to optimize your code, and how to get the best performance from them. This guide also covers the language extensions and differences from the other commonly available language compilers. The PathScale Compiler Suite will be referred to as the PathScale Compiler Suite or the PathScale compiler in the rest of this document. The PathScale Compiler Suite generates both 32-bit and 64-bit code, with 64-bit code as the default. See the eko man page for details. The information in this guide is organized into these sections: Section 2 is a quick reference to using the PathScale compilers Section 3 covers the PathScale Fortran compiler Section 4 covers the PathScale C/C++ compilers Section 5 provides suggestions for porting and compatibility Section 6 is a Tuning Quick Reference, with tips for getting faster code Section 7 discusses tuning options in more detail Section 8 covers using autoparallelization and OpenMP in Fortran and C/C++ Section 9 provides an example of optimizing code Section 10 covers debugging and troubleshooting code Appendix A lists environmental variables used with the compilers Appendix B discusses implementation dependent behavior for OpenMP Fortran Appendix C is a list of the supported Fortran intrinsics Appendix D provides a simplified data structure from a Fortran 90 dope vector Appendix E is a summary of the compiler options, grouped by function Appendix F is a reference copy of the eko man page Appendix G contains a glossary of terms associated with the compilers
1.1

Section 1

Conventions Used in This Document
These conventions are used throughout this document.
Convention command variable user input Meaning Fixed-space font is used for literal items such as commands, files, routines, and pathnames. Italic typeface is used for variable names or concepts being defined Bold, fixed-space font is used for literal items the user types in. Output is shown in non-bold, fixed-space font.

1-1


1 ­ Introduction Documentation Suite

Convention $ [] ... NOTE
1.2

Meaning Indicates a command line prompt Brackets enclose optional portions of a command or directive line. Ellipses indicate that a preceding element can be repeated. Indicates important information

Documentation Suite
The PathScale Compiler Suite product documentation set includes: The The The The PathScale PathScale PathScale PathScale Compiler Suite Compiler Suite Compiler Suite Debugger User and Subscription Manager Install Guide User Guide Support Guide Guide

There are also online manual pages ("man pages") available describing the flags and options for the PathScale Compiler Suite. These man pages are a subset of the pages that are shipped with the Compiler Suite: eko, pathf95, pathf90, pathcc, pathCC. The pathscale-intro man page gives a complete list of all the various man pages that are included with the Compiler Suite. Please see the PathScale website for further information about current releases and developer support. http://www.pathscale.com/support.html In addition, you may want to refer to language reference books for more information on compilers and language usage. Programming and language reference books are often a matter of personal taste. Everyone has a personal preferences in reference books, and this list reflects the variety of opinions found within the PathScale engineering team. Fortran Language: Fortran 95 Handbook: Complete ISO /ANSI Reference by Jeanne C. Adams, et al., MIT Press, 1997. ISBN 0-262-51096-0 Fortran 95 Explained by Metcalf, M. and Reid, J., Oxford University Press, 1996. ISBN 0-19-851888-8

1-2


1 ­ Introduction Documentation Suite

C Language: C Programming Language by Brian W. Kernighan, Dennis Ritchie, Dennis M. Ritchie, Prentice Hall, 1988, 2nd edition, ISBN 0-13-110362-8 C: A Reference Manual by Samuel P. Harbison, Guy L. Steele, Prentice Hall, 5th Edition, 2002, ISBN 0-130-89592-X C: How to Program by H.M. Deitel and P.J. Deitel, Prentice Hall, Fourth Edition, 2004 ISBN 0-131-42644-3 C++ Language: The C++ Standard Library A Tutorial and Reference by Josutis, Nicolai M., 1999. Addison- Wesley, ISBN 0-201-37926-0 Effective C++: 55 Specific Ways to Improve Your Programs and Design by Scott Meyers, Addison-Wesley Professional, 2005, 3rd edition, ISBN 0-321-33487-6 More Effective C++: 35 New Ways to Improve Your Programs and Designs by Scott Meyers, Addison-Wesley Professional, 1995, ISBN 0-201-63371-X Thinking in C++, Volume 1: Introduction to Standard C++ by Bruce Eckel, Prentice Hall, 2nd Edition, 2000, ISBN: 0-139-79809-9 (NOTE: There is a later version­2002­available online as a free download.) Thinking in C++, Vol. 2: Practical Programming by Bruce Eckel, Prentice Hall, Second Edition, 2003, ISBN 0-130-35313-2 C++ Inside & Out by Bruce Eckel, Osborne/McGraw-Hill, 1993, ISBN: 0-07-881809-5 C++: How to Program by H.M. Deitel and P.J. Deitel, Prentice Hall, 2005, 5th edition, ISBN 0-131-85757-6 Other Topics: Effective STL: 50 Specific Ways to Improve Your Use of the Standard Template Library by Scott Meyers, Addison-Wesley Professional, 2001, ISBN 0-201-74962-9

1-3


1 ­ Introduction Documentation Suite

Notes

1-4


Compiler Quick Reference
This section describes how to get started using the PathScale Compiler Suite. The compilers follow the standard conventions of Unix and Linux compilers, produce code that follows the Linux x86_64 ABI, and run on both the AMD64 and Intel EM64T families of chips. AMD64 is the AMD 64-bit extension to the Intel IA32 architecture, often referred to as "x86". EM64T is the Intel® Extended Memory 64 Technology chip family. This means that object files produced by the PathScale compilers can link with object files produced by other Linux x86_64-compliant compilers such as Red Hat and SUSE GNU gcc, g++, and g771.
2.1

Section 2

What You Installed
For details on installing the PathScale compilers, see the PathScale Compiler Suite Install Guide. The PathScale Compiler Suite includes optimizing compilers and runtime support for C, C++, and Fortran. Depending on the type of subscription you purchased, you enabled some or all of the following: PathScale C compiler for x86_64 and EM64T architectures PathScale C++ compiler for x86_64 and EM64T architectures PathScale Fortran compiler for x86_64 and EM64T architectures Documentation Libraries Subscription Manager client. You must have a valid subscription and associated subscription file in order to run the compiler. Subscription Manager server. The PathScale Subscription Manager server is only required for floating subscriptions. PathScale debugger (pathdb) GNU binutils
2.2

How To Invoke the PathScale Compilers
The PathScale Compiler Suite has three different front-ends to handle programs written in C, C++, and Fortran, and it has common optimization and code generation

2-1


2 ­ Compiler Quick Reference How To Invoke the PathScale Compilers

components that interface with all the language front-ends. The language your program uses determines which command (driver) name to use:
Language C C++ Fortran 77 Fortran 90 Fortran 95 Command Name pathcc pathCC pathf95 Compiler Name PathScale C compiler PathScale C++ compiler PathScale Fortran compiler

You can create a common example program called world.c:
#include main() { printf ("Hello World!\n"); }

Then you can compile it from your shell prompt very simply:
$ pathcc world.c

The default output file for the pathcc-generated executable is named a.out. You can execute it and see the output:
$ ./a.out Hello World!

As with most compilers, you can use the -o option to give your program executable file the desired name. If invoked with the flag -v (or -version), the compilers will emit some text that identifies the version. For example:
$ pathcc -v PathScale(TM) Compiler Suite: Version 3.1 Built on: 2007-10-21 07:03:08 -0800 Thread model: posix GNU gcc version 4.0.2 (PathScale 3.1 driver)

There are online manual pages ("man pages") with descriptions of the large number of command line options that are available. Type man pathscale_intro at the command line to see the pathscale-intro man page and its overview of the various man pages included with the Compiler Suite.
2.2.1

Accessing the GCC 4.x Front-ends for C and C++
This release supports GCC 3.x and GCC 4.x. The compiler defaults to gnu3 or gnu4 depending on whether the system-installed gcc/g++ is a 3.x or 4.x compiler. It is possible to override this choice using -gnu3 or -gnu4 to get the compiler to use the alternate front-end instead of the default one. A sample command for C is:

2-2


2 ­ Compiler Quick Reference Compiling for Different Platforms

$ pathcc -gnu4 world.c

This default option can be changed in your compiler.defaults file by adding this line: -gnu4 See section 2.3 for an example compiler.defaults file. The option has no effect on pathf90 or pathf95. There are currently some limitations when using this option. Please see the Release Notes for more information.
2.3

Compiling for Different Platforms
The PathScale Compiler Suite currently compiles and optimizes your code for the Opteron processor independent of where the compilation is happening. (This may change in the future.) To select the 32-bit/64-bit ABI, the compiler queries the machine where the compilation is happening and will compile to the best ABI supported for that machine. These defaults (for the target processor and the ABI) can be overridden by command-line flags or the compiler.defaults file. You can set or change the default platform for compilation using the compiler.defaults file, found in /opt/pathscale/etc. If you installed in a non-default location the path will be //pathscale/etc. You can use the defaults file to provide a set of additional include or library directories to search, or to specify some default compiler optimization flags. The compiler refers to the compiler.defaults file for options to be used during compilation.The syntax in compiler.defaults file is the same as options specified on the compiler command line. Options are added to the command line in the order in which they appear in the defaults file. Every option is included unconditionally. For exclusive options, the command line takes precedence over the defaults file. For example, if the defaults file contains the -O3 option, but the compiler is invoked with -O2 on the command line, it will behave as if invoked with -O2 alone, because -O2 and -O3 are exclusive options. For additive options, the command line is used before the defaults file. For example, if the defaults.compiler contains -I/usr/foo and the command line contains -I/usr/bar, the compiler will behave as if invoked with -I/usr/bar -I/usr/foo. The format of the compiler.defaults file is simple. Each line can contain compiler options, separated by white space, followed by an optional comment. A comment begins with the # character, and ends at the end of a line. Empty lines and lines containing only comments are skipped.

2-3


2 ­ Compiler Quick Reference Compiling for Different Platforms

Here is an example defaults file:
# PathScale compiler defaults file. # # Set default CPU type to optimize for, since all of our # systems use the same CPUs. -march=opteron # We have a recent Opteron CPU stepping, so it's safe to # always use SSE3. -msse3 # Ensure that the FFTW library is available to users, so # they don't need to remember where it's installed. -L/share/fftw3/lib -I / share/fftw3 /include # Use the GCC 4.x front-end by default -gnu4

The environment variable PSC_COMPILER_DEFAULTS_PATH, if set, specifies a PATH or a colon-separated list of PATHs, designating where the compiler is to look for the compiler.defaults file. If the environment variable is set, the PATH /opt/pathscale/etc will not be used. If the file cannot be found, then no defaults file will be used, even if one is present in /opt/pathscale/etc. For more details, see the compiler.defaults man page.
2.3.1

Target Options for This Release
These options, related to ABI, ISA, and processor target, are supported in this release: -m32 -m64 -march= (same as -mcpu= and -mtune=) -mcpu= (same as -march= and -mtune=) -mtune= (same as -march= and -mcpu=) -msse2 -msse3 -msse4a -m3dnow There are also -mno- versions for these options: -msse2, -msse3, -msse4a, -m3dnow. For example, -mno-msse3. As indicated in this list using the -march= flag, the architectures supported in this release are: -march=(opteron|athlon64|athlon64fx) -march=barcelona -march=pentium4 -march=xeon

2-4


2 ­ Compiler Quick Reference Compiling for Different Platforms

-march=em64t -march=core We have also added two special options, -march=any86 and -march=auto. If you want to compile the program so that it can be run on any x86 machine, you can specify anyx86 as the value of the -march, mcpu, or -mtune options. -march=anyx86 If the value for the -march, -mcpu, or -mtune options is auto, the compiler will automatically choose the target processor based on the machine on which the compilation takes place. -march=auto The compiler defaults to -march=auto. Here is a sample of how options are specified in the compiler.defaults file:
# Compile for Athlon64 and turn on 3DNow extensions. One # option per line. -march=athlon64 # anything after '#' is ignored -m3dnow

These options can also be used on the command line. See the eko man page for details.
2.3.2

Defaults Flag
This release includes a flag, -show-defaults, which directs the compiler to print out the defaults used related to ABI, ISA, and processor targets. When this flag is specified, the compiler will just print the defaults and quit. No compilation is performed.
$ pathcc -show-defaults
2.3.3

Compiling for an Alternate Platform
You will need to compile with the -march=anyx86 flag if you want to run your compiled executables on both AMD and Intel platforms. See the eko man page for more information about the -march= flag. To run code generated with the PathScale Compiler Suite on a different host machine, you will need to install the runtime libraries on your host machine, or you need to static link your programs when you compile. See section 2.7 for information on static linking and the PathScale Compiler Suite Install Guide for information on installing runtime libraries.

2-5


2 ­ Compiler Quick Reference Input File Types

2.3.4

Compiling Option Tool: pathhow-compiled
The PathScale Compiler Suite includes a tool that displays the compilation options and compiler version currently being used. The tool is called pathhow-compiled and can be found after installation in /opt/pathscale/bin (or //bin if you installed to a non-default location). When a .o file, archive, or an executable is passed to pathhow-compiled, it will display the compilation options for each .o file constituting the argument file. This includes any linked archives. For example, compile the file myfile.c with pathcc and then use the pathhow-compiled tool:
$ pathcc myfile.c -o myfile $ pathhow-compiled myfile.o

The output would look something like this:
PathScale Compiler Version 3.1 compiled myfile.c with options: -O2 -march=opteron -msse2 -mno-sse3 -mno-3dnow -m64
2.4

Input File Types
The name for a source file usually has the form filename.ext, where ext is a one to three character extension used on a source code file that can have various meanings:
Extension .c .C .cc .cpp .cxx .f .f90 .f 95 .F .F90 .F95 Implication to the driver C source file that will be preprocessed C++ source file that will be preprocessed

Fortran source file .f is fixed format, no preprocessor .f90 is freeform format, no preprocessor .f95 is freeform format, no preprocessor Fortran source file .F is fixed format, invokes preprocessor .F90 is freeform format, invokes preprocessor .F95 is freeform format, invokes preprocessor

For Fortran files with the extensions .f, .f90, or .f95 you can use -ftpp (to invoke the Fortran preprocessor) or -cpp (to invoke the C preprocessor) on the

2-6


2 ­ Compiler Quick Reference Other Input Files

pathf95 command line. The default preprocessor for files with .F, .F90, or .F95 extensions, is -cpp. See section 3.5.1 for more information on preprocessing. The compiler drivers can use the extension to determine which language front-end to invoke. For example, some mixed language programs can be compiled with a single command:
# pathf95 stream_d.f second_wall.c -o stream

The path f 95 driver will use the .c extension to know that it should automatically invoke the C front-end on the second_wall.c module and link the generated object files into the stream executable. NOTE: GNU make does not contain a rule for generating object files from Fortran .f90 files. You can add the following rules to your project Makefiles to achieve this:
$.o: $.o: %.f90 $(FC) %.F90 $(FC) $(FFLAGS) $(FFLAGS) -c $< -c $<

You may need to modify this for your project, but in general the rules should follow this form. For more information on compatibility and porting existing code, see section 5. Information on GCC compatibility and a wrapper script that you can use for your build packages can be found in section 5.7.1.
2.5

Other Input Files
Other possible input files, common to both C/C++ and Fortran, are assembly-language files, object files, and libraries. These can be used as inputs on the command line.
Extension .i .ii .s .o .a .so Implication to the driver preprocessed C source file preprocessed C++ source file assembly language file object file a static library of object files a library of shared (dynamic) object files

2-7


2 ­ Compiler Quick Reference Common Compiler Options

2.6

Common Compiler Options
The PathScale Compiler Suite has command line options that are similar to many other Linux or Unix compilers:
Option -c -g -I -l -L -lm -o -O3 -O or -O2 -pg What it does Generates an intermediate object file for each source file, but doesn't link Produces debugging information to allow full symbolic debugging Adds to the directories searched by preprocessor for include file resolution. Searches the library specified during the linking phase for unresolved symbols Adds to the directories searched during the linking phase for libraries Links using the libm math library. This is typically required in C programs that use functions such as exp(), log(), sin(), cos(). Generates the named executable (binary) file Generates a highly optimized executable, generally numerically safe Generates an optimized executable that is numerically safe. (This is also the default if no -O flag is used.) Generates profile information suitable for the analysis program pathprof

Many more options are available and described in the man pages (pathscale_intro, pathcc, pathf95, pathCC, eko) and section 7 in this document.
2.7

Shared Libraries
The PathScale Compiler Suite includes shared versions of the runtime libraries that the compilers use. The shared libraries are packaged in the pathscale-compilers-libs package. The compiler will use these shared libraries by default when linking executables and shared objects. Therefore, if you link a program with these shared libraries, you must install them on systems where that program will run. You should continue to use the static versions of the runtime libraries if you wish to obtain maximum portability or peak performance. The latter is the case because the compiler cannot optimize shared libraries as aggressively as static libraries. Shared libraries are compiled using position-independent code, which limits some opportunities for optimization, while our static libraries are not compiled this way.

2-8


2 ­ Compiler Quick Reference Memory Model Support

To link with static libraries instead of shared libraries, use the -static option. For example the following code is linked using the shared libraries.
$ pathcc -o hello hello.c $ ldd hello libpscrt.so.1 => /opt/pathscale/lib/2.3.99/libpscrt.so.1 (0x0000002a9566d000) libmpath.so.1 => /opt/pathscale/lib/2.3.99/libmpath.so.1 (0x0000002a9576e000) libc.so.6 => /lib64/libc.so.6 (0x0000002a9588b000) libm.so.6 => /lib64/libm.so.6 ( 0x0000002a95acd000) /lib64/ld-linux-x86-64.so.2 => /lib64/ld-linux-x86-64.so.2 (0x0000002a95556000) $

If you use the -static option, notice that the shared libraries are no longer required.
$ pathcc -o hello hello.c -static $ ldd hello not a dynamic executable $
2.8

Large File Support
The Fortran runtime libraries are compiled with large file support. PathScale does not provide any runtime libraries for C or C++ that do I/O, so large file support is provided by the libraries in the Linux distribution being used.
2.9

Memory Model Support
The PathScale compilers currently support two memory models: small and medium. The default memory model on x86_64 systems, and the default for the compilers, is small (equivalent to GCC's -mcmodel=small). This means that offsets of code and data within binaries are represented as signed 32-bit quantities. In this model, all code in an executable must total less than 2GB, and all the data must also be less than 2GB. Note that by data, we mean the static and unlimited static data (BSS) that are compiled into an executable, not data allocated dynamically on the stack or from the heap. Pointers are 64-bits however, so dynamically allocated memory may exceed 2GB. Programs can be statically or dynamically linked. Additionally the compilers support the medium memory model with the use of the option -mcmodel=medium on all of the compilation and link commands. This means

2-9


2 ­ Compiler Quick Reference Memory Model Support

that offsets of code within binaries are represented as signed 32-bit quantities. The offsets for data within the binaries are represented as signed 64-bit quantities. In this model, all code in an executable must come to less than 2GB in total size. The data, both static and BSS, are allowed to exceed 2GB in size. As with the small memory model, pointers are also signed 64-bit quantities and may exceed 2 GB in size. NOTE: The PathScale compilers do not support the use of the -fPIC option flag in combination with the -mcmodel=medium option. The code model medium is not supported in PIC mode.

The PathScale compilers support -mcmodel=medium and -fPIC in the same way that GCC does. When building shared libraries, only -fPIC should be used. The option-mcmodel=medium but not -fPIC when compiling and linking the main program. The reasoning behind this is that because the shared library is self-contained, it does not know about the fixed addresses of the data in the program that it is linked with. The library will only access the program data through pointers, and such pointer data accesses are not affected by the value of the mcmodel option. The mcmodel value only affects the addressing of data with fixed addresses. When these addresses are larger than 2GB, the compiler has to generate longer sequences of instructions. Thus, it does not want to do that unless the -mcmodel=medium flag is given. See 10.4 for more information on using large objects, and your GCC 3.3.1 documentation for more information on this topic.
2.9.1

Support for "Large" Memory Model
At this time the PathScale compilers do not support the large memory model. The significance is that the code offsets must fit within the signed 32-bit address space. To determine if you are close to this limit, use the Linux size command.
$ size bench text data 910219 1448 bss 3192 dec 914859 hex df5ab filename bench

If the total value of the text segment is close to 2GB, then the size of the memory model may be an issue for you. We believe that codes that are this large are extremely rare and would like to know if you are using such an application. The size of the bss and data segments are addressed by using the medium memory model.

2-10


2 ­ Compiler Quick Reference Profiling: Locate Your Program's Hot Spots

2.10

Debugging
The flag -g tells the PathScale compilers to produce data in the DWARF 2.0 format used by modern debuggers such as GDB and PathScale's debugger, pathdb. This format is incorporated directly into the object files. The -g option automatically sets the optimization level to -O0 unless an explicit optimization level is provided on the command line. Debugging of higher levels of optimization is possible, but the code transformation performed by the optimizations may make it more difficult. See the individual sections on the PathScale Fortran and C /C++ compilers for more language-specific debugging information, and section 10 for debugging and troubleshooting tips. See the PathScale Debugger User Guide for more information on pathdb.
2.11

Profiling: Locate Your Program's Hot Spots
Often a program has "hot spots," a few routines or loops that are responsible for most of the execution time. Profilers are a common tool for finding these hot spots in a program.To figure out where and how to tune your code, use the time tool to get a rough estimate and determine if the issue is system load, application load, or a system resource that is slowing down your program. Then use the pathprof tool to find the programs' hot spots. Once you find the hot spots in your program, you can improve your code for better performance, or use the information to help choose which compiler flags are likely to lead to better performance. The time tool provides the elapsed (or wall) time, user time, and system time of your program. Its usage is typically: time ./. Elapsed time is usually the measurement of interest, especially for parallel programs, but if your system is busy with other loads, then user time might be a more accurate estimate of performance than elapsed time. If there is substantial system time being used and you don't expect to be using substantial non-compute resources of the system, you should use a kernel profiling tool to see what is causing it. The pathprof and pathcov programs included with the compilers are symbolic links to your system's gcov and gprof executables. There are more details and an example using pathprof later in section 9, but the following steps are all that are needed to get started in profiling: 1. Add the -pg flag to both the compile and link steps with the PathScale compilers. This generates an instrumented binary. 2. Run the program executable with the input data of interest. This creates a gmon.out file with the profile data. 3. Run pathprof to generate the profiles. The standard output of pathprof includes two tables:

2-11


2 ­ Compiler Quick Reference taskset: Assigning a Process to a Specific CPU

a. A flat profile with the time consumed in each routine and the number of times it was called, and b. A call-graph profile that shows, for each routine, which routines it called and which other routines called it. There is also an estimate of the inclusive time spent in a routine and all of the routines called by that routine. NOTE: The pathprof tool will generate a segmentation fault when used with OpenMP applications that are run with more than one thread. There is no current workaround for pathprof (or gprof).

See section 9 for a more detailed example of profiling.
2.12

taskset: Assigning a Process to a Specific CPU
To improve the performance of your application on multiprocessor machines, it is useful to assign the process to a specific CPU. The tool used to do this is taskset, which can be used to retrieve or set a process' affinity. This command is part of the schedutils package/RPM. NOTE: Some of the Linux distributions supported by the PathScale compilers do not contain the schedutils package/RPM.

The CPU affinity is represented as a bitmask, typically given in hexadecimal. Assigning a process to a specific CPU prevents the Linux scheduler from moving or splitting the process. Example:
$ taskset 0x00000001

This would assign the process to processor #0. If an invalid mask is given, an error is returned, so when taskset returns, it is guaranteed that the program has been scheduled on a valid and legal CPU. See the taskset(1) man page for more information.

2-12


The PathScale Fortran Compiler
The PathScale Fortran compiler supports Fortran 77, Fortran 90, and Fortran 95. The PathScale Fortran compiler: Conforms to ISO/IEC 1539:1991 Programming languages­Fortran (Fortran 90) Conforms to the more recent ISO/IEC 1539-1:1997 Programming languages­Fortran (Fortran 95) Conforms to ISO/IEC TR 15580: Fortran: Floating point exception handling. See also section 14 of ISO/IEC 1539-1:2004, the Fortran 2003 standard, for a complete description. Conforms to ISO/IEC TR 15581: Fortran: Enhanced data type facilities Conforms to ISO/IEC 1539-2: Varying length character strings (section 3.5.3) Conforms to ISO/IEC 1539-3: Conditional compilation (section 3.5.4) Supports legacy FORTRAN 77 (ANSI X3.9-1978) programs Provides support for common extensions to the above language definitions Links binaries generated with the GNU Fortran 77 compiler Generates code that complies with the x86_64 ABI and the 32-bit x86 ABI
3.1

Section 3

Using the Fortran Compiler
To invoke the PathScale Fortran compiler, use this command:
$ pathf95

By default, the compiler will treat input files with an .F suffix or .f suffix as fixed-form files. Files with an .F90, .f 90, .F95, or .f95 suffix are treated as free-form files. This behavior can be overridden using the -fixedform and -freeform switches. See section 3.1.1 for more information on fixed-form and free-form files. By default, all files ending in .F, . F90, or .F95 are first preprocessed using the C preprocessor (-cpp). If you specify the -ftpp option, all files are preprocessed using the Fortran preprocessor (-ftpp), regardless of suffix. See section 3.5.1 for more information on preprocessing.

3-1


3 ­ The PathScale Fortran Compiler Using the Fortran Compiler

Invoking the compiler without any options instructs the compiler to use optimization level -O2. These three commands are equivalent:
$ pathf95 test.f90 $ pathf95 -O test.f90 $ pathf95 -O2 test.f90

Using optimization level -O0 instructs the compiler to do no optimization. Optimization level -O1 performs only local optimization. Level -O2, the default, performs extensive optimizations that will always shorten execution time, but may cause compile time to be lengthened. Level -O3 performs aggressive optimization that may or may not improve execution time. See section 7.1 for more information about the -O flag. Use the -ipa switch to enable inter-procedural analysis:
$ pathf95 -c -ipa matrix.f90 $ pathf95 -c -ipa prog.f90 $ pathf95 -ipa matrix.o prog.o -o prog

Note that the link line also specifies the -ipa option. This is required to perform the IPA link properly. See section 7.3 for more information on IPA. NOTE: The compiler typically allocates data for Fortran programs on the stack for best performance. Some major Linux distributions impose a relatively low limit on the amount of stack space a program can use. When you attempt to run a Fortran program that uses a large amount of data on such a system, it will print an informative error message and abort. You can use your shell's "ulimit" (bash) or "limit" (tcsh) command to increase the stack size limit to a point where the program no longer crashes, or remove the limit entirely. See section 3.12 for more information on Fortran compiler stack size.

3.1.1

Fixed-form and Free-form Files
Fixed-form files follow the obsolete Fortran standard of assigning special meaning to the first 6 character positions of each line in a source file. If a C, ! or * character is present in the first character position on a line, that specifies that the remainder of the line is to be treated as a comment. If a ! is present at any character position on a line except for the 6th character position, then the remainder of that line is treated as a comment. Lines containing only blank characters or empty lines are also treated as comments. If any character other than a blank character is present in the 6th character position on a line, that specifies that the line is a continuation from the previous line. The Fortran standard specifies that no more than 19 continuation lines can follow a line, but the PathScale compiler supports up to 499 continuation lines.

3-2


3 ­ The PathScale Fortran Compiler Modules

Source code appears between the 7th character position and the 72nd character position in the line, inclusive. Semicolons are used to separate multiple statements on a line. A semicolon cannot be the first non-blank character between the 7th character position and the 72nd character position. Character positions 1 through 5 are for statement labels. Since statement labels cannot appear on continuation lines, the first five entries of a continuation line must be blank. Free-form files have fewer limitations on line layout. Lines can be arbitrarily long, and continuation is indicated by placing an ampersand (&) at the end of the line before the continuation line. Statement labels can be placed at any character position in a line, as long as it is preceded by blank characters only. Comments start with a ! character anywhere on the line.
3.2

Modules
When a Fortran module is compiled, information about the module is placed into a file called MODULENAME.mod. The default location for this file is in the directory where the command is executed. This location can be changed using -module option. The MODULENAME.mod file allows other Fortran files to use procedures, functions, variables, and any other entities defined in the module. Module files can be considered similar to C header files. Like C header files, you can use the -I option to point to the location of module files:
$ pathf95 -I/work/project/include -c foo.f90

This instructs the compiler to look for .mod files in the /work/project/include directory. If foo.f90 contains a 'use arith' statement, the following locations would be searched:
/work/project/include/ARITH.mod ./ARITH.mod
3.2.1

Order of Appearance
If a module and the "use" statements referring to that module appear in the same source file, the module must appear first. If a module appears in one source file and the "use" statements referring to that module appear in other source files, the file containing the module must be compiled first. If a single command compiles all the files, the file containing the module must appear on the command line before the files containing the "use" statements:
pathf95 mymodule.f95 myprogram.f95

3-3


3 ­ The PathScale Fortran Compiler Linking When the Main Program Is In a Library

3.2.2

Linking Object Files to the Rest of the Program
A source file containing a module generates an object (.o) file as well as a module-information (.mod) file, even if the source file contains nothing other than the module. That object file must be linked with the rest of the program. If a single command compiles and links the entire program, this will happen automatically, but if you use a separate command to link objects together, you must be careful not to omit object files resulting from source files which contain only modules. The order of object files in such a command does not matter. For example:
pathf95 -c mymodule.f95 pathf95 -c myprogram.f95 pathf95 myprogram.o mymodule.o

Notice that a source file containing multiple modules will generate one object (.o) file which takes its name from the source file plus multiple module-information (.mod) files which take their names from the names of the modules themselves. For example, generate MYMODULE1.mod, MYMODULE2.mod, MYMODULE3.mod, and my3modules.o:
$ pathf95 -c my3modules.f95

Then generate the main program which uses modules:
$ pathf95 -c myprogram.f95 $ pathf95 my3modules.o myprogram.o
3.3

Linking When the Main Program Is In a Library
When workling with a long list of object files, it is possible to put them all into a single library, then specify the library in place of the object files when linking the program. If the main program is coded in Fortran, however, its linker symbol is MAIN__ rather than main, and when you link the program with pathf95, the linker will not automatically import it from a library. The usual symptom is a program which links without error but then prints:
Someone linked a Fortran program with no MAIN__!

The solution is to tell the linker explicitly to import the symbol MAIN__ (with two underscores):
$ pathf90 -Wl,--undefined=MAIN__ mylibrary.a
3.3.1

Module-related Error Messages
Error messages report the error as the first line in the module, even if the real error is further inside the module. The real error is reported after this first standard message. An example is given below.

3-4


3 ­ The PathScale Fortran Compiler Extensions

Here is a program, hellow.f95, which contains this module:
MODULE HELLOW CONTAINS SUBROUTINE HELLO( ) SPRINTZ *,"Hello, World!" END SUBROUTINE HELLO END MODULE HELLOW

Next compile the program containing the module, and look at the error that is generated:
$ pathf95 hellow.f95 MODULE HELLOW ^ pathf95-855 pathf95: ERROR HELLOW, File = hellow.f95, Line = 1, Column = 8 The compiler has detected errors in module "HELLOW". No module information file will be created for this module. SPRINTZ *,"Hello, World!" ^ pathf95-724 pathf95: ERROR HELLO, File = hellow.f95, Line = 5, Column = 11 Unknown statement. Expected assignment statement but found "*" instead of "=" or "=>". pathf95: PathScale(TM) Fortran Version 2.1.99 (f14) Tue Nov 21, 2006 14:22:16 pathf95: 9 source lines pathf95: 2 Error(s), 0 Warning(s), 0 Other message(s), 0 ANSI(s) pathf95: "explain pathf95-message number" gives more information about each message

Note that the real error is pointed out after the first error on line 1 is reported.
3.4

Extensions
The PathScale Fortran compiler supports a number of extensions to the Fortran standard, which are described in this section.
3.4.1

Promotion of REAL and INTEGER Types
Section 5 has more information about porting code, but it is useful to mention the following option that you can use to help in porting your Fortran code.

3-5


3 ­ The PathScale Fortran Compiler Extensions

-r8 -i8 Respectively promotes the default representation for REAL and INTEGER type from 4 bytes to 8 bytes. Useful for porting from Cray code when integer and floating point data is 8 bytes long by default. Watch out for type mismatches with external libraries. NOTE: The -r8 and -i8 flags only affect default reals and integers, not variable declarations or constants that specify an explicit KIND. This can cause incorrect results if a 4-byte default real or integer is passed into a subprogram that declares a KIND=4 integer or real. Using an explicit KIND value like this is unportable and is not recommended. Correct usage of KIND (i.e. KIND=KIND (1) or KIND=KIND (0.0d0)) will not result in any problems.

3.4.2

Cray Pointers
The Cray pointer is a data type extension to Fortran to specify dynamic objects, different from the Fortran pointer. Both Cray and Fortran pointers use the POINTER keyword, but they are specified in such a way that the compiler can differentiate between them. The declaration of a Cray pointer is:
POINTER ( , )

Fortran pointers are declared using:
POINTER :: [ ]

PathScale's implementation of Cray Pointers is the Cray implementation, which is a stricter implementation than in other compilers. In particular, the PathScale Fortran compiler does not treat pointers exactly like integers. The compiler will report an error if you do something like p = ( (p+7) / 8) * 8 to align a pointer.
3.4.3

Directives
Directives within a program unit apply only to that program unit, reverting to the default values at the end of the program unit. Directives that occur outside of a program unit alter the default value, and therefore apply to the rest of the file from that point on, until overridden by a subsequent directive. Directives within a file override the command line options by default. To have the command line options override directives, use the command line option:
-LNO:ignore_pragmas

Use following option to control the behavior for directives contained within comments:
-[no-]directives

3-6


3 ­ The PathScale Fortran Compiler Extensions

-no-directives ignores all directives (such as !$OMP or C*$* PREFETCH_REF) inside comments. The default is -directives, which scans the comments for

directives. Note that certain directives may have no effect unless additional options, such as -mp, are present. For the 3.1 release, the PathScale Compiler Suite supports the following prefetch directives.
3.4.3.1

F77 or F90 Prefetch Directives
C*$* PREFETCH(N [,N]) Specify prefetching for each level of the cache. The scope is the entire function containing the directive. N can be one of the following values: 0 Prefetching off (the default) 1 Prefetching on, but conservative 2 Prefetching on, and aggressive (the default when prefetch is on) C*$* PREFETCH_MANUAL(N) Specify if manual prefetches (through directives) should be respected or ignored. Scope: Entire function containing the directive. N can be one of the following values: 0 Ignore manual prefetches 1 Respect manual prefetches C*$* PREFETCH_REF_DISABLE=A [, size=num] This directive explicitly disables prefetching all references to array A in the current function. The auto-prefetcher runs (if enabled) ignoring array A. The size is used for volume analysis. Scope: Entire function containing the directive. size=num is the size of the array references in this loop, in Kbyte. This is an optional argument and must be a constant. C*$* PREFETCH_REF=array-ref, [stride=[str] [,str]], [level=[lev] [,lev]], [kind=[rd/wr]], [size=[sz]] This directive generates a single prefetch instruction to the specified memory location. It searches for array references that match the supplied reference in the current loop-nest. If such a reference is found, that reference is connected to this prefetch node with the specified parameters. If no such reference is found, this prefetch node stays free-floating and is scheduled "loosely". All references to this array in this loop-nest are ignored by the automatic prefetcher (if enabled). If the size is supplied, then the auto-prefetcher (if enabled) reduces the effective cache size by that amount in its calculations.

3-7


3 ­ The PathScale Fortran Compiler Extensions

The compiler tries to issue one prefetch per stride iteration, but cannot guarantee it. Redundant prefetches are preferred to transformations (such as inserting conditionals) which incur other overhead. Scope: No scope. Just generates a prefetch instruction. The following arguments are used with this option: array-ref Required. The reference itself, for example, A(i, j). str Optional. Prefetch every st r iterations of this loop. The default is 1. lev Optional. The level in memory hierarchy to prefetch. The default is 2. If lev= 1, prefetch from L2 to L1 cache.If lev=2, prefetch from memory to L1 cache. rd/wr Optional. The default is read/write. sz Optional. The size (in Kbytes) of the array referenced in this loop. This must be a constant.
3.4.3.2

Changing Optimization Using Directives
Optimization flags can now be changed via directives in the user program. In Fortran, the directive is used in the form:
C*$* options <"list-of-options">

Any number of these can be specified inside function scopes. Each affects only the optimization of the entire function in which it is specified. The literal string can also contain an unlimited number of different options separated by spaces and must include the enclosing quotes. The compilation of the next function reverts back to the settings specified in the compiler command line. In this release, there are limitations to the options that are processed in this options directive, and their effects on the optimization. There is no warning or error given for options that are not processed. These directives are processed only in the optimizing backend. Thus, only options that affect optimizations are processed. In addition, it will not affect the phase invocation of the backend components. For example, specifying -O0 will not suppress the invocation of the global optimizer, though the invoked backend phases will honor the specified optimization level. Apart from the optimization level flags, only flags belonging to the following option groups are processed: -LNO, -OPT and -WOPT.

3-8


3 ­ The PathScale Fortran Compiler Compiler and Runtime Features

3.5

Compiler and Runtime Features
The compiler offers three different preprocessing options; -cpp, -ftpp, and now -fcoco.
3.5.1

Preprocessing Source Files with -cpp
Before being passed to the compiler front-end, source files are optionally passed through a source code preprocessor. The preprocessor searches for certain directives in the file and, based on these directives, can include or exclude parts of the source code, include other files or define and expand macros. By default, Fortran .F, .F90, and .F95 files are passed through the C preprocessor -cpp.
3.5.2

3.4.2 Preprocessing Source Files with -ftpp
The Fortran preprocessor -ftpp accepts many of the same "#" directives as the C preprocessor but differs in significant details (for example, it does not allow C-style comments beginning with "/*" to extend across multiple lines.) You should use the -cpp option if you wish to use the C preprocessor on Fortran source files ending in .f, .f90, or .f95. These files will not be preprocessed unless you use either -ftpp (to select the Fortran preprocessor) or -cpp (to select the C preprocessor) on the command line.
3.5.3

Support for Varying Length Character Strings
Beginning with Release 2.5, PathScale Fortran compiler now supports ISO/IEC Standard 1539-2, which provides support for varying length character strings. This is an optional add-on to the Fortran Standard. You can download and compile this module. It is available from this location: http://www.fortran.com/fortran/iso_varying_string.f95
3.5.4

Preprocessing Source Files with -fcoco
Beginning with release 2.4, the PathScale Fortran compiler now supports the ISO/IEC 1539-3 conditional compilation preprocessor. When you use the -fcoco option, the compiler runs this preprocessor on each individual source file before compiling that source file, overriding the default whereby files suffixed with .F, .F90, or .F95 are preprocessed with cpp but files suffixed with .f, .f90, or .f95 are not preprocessed. The ISO/IEC standard does not specify any command-line options for the preprocessor, but as an extension, we pass -I and -D options to it, just as we do for the -cpp and -ftpp preprocessors. As with the other preprocessors, an option

3-9


3 ­ The PathScale Fortran Compiler Compiler and Runtime Features

like -Isubdir (no trailing "/" is needed) tells the preprocessor to add subdir to the list of directories in which it will search for included files. Unlike the -cpp and -ftpp preprocessors, this one requires that its identifiers be declared with a data type, so an option like -DIVAR=5 declares a constant (not a variable) IVAR with the type integer and the value 5, while an option like -DLVAR declares a constant LVAR with the type logical and the value " .true.". Only integer and logical constants are allowed. You can use the -D option to override the value of a constant declaration for that identifier which might appear in the source file. The standard requires that the preprocessor read a "setfile" capable of defining constants, variables and modes of operation, but it does not specify how to find the setfile. If you use -fcoco, the preprocessor looks for coco. set in the current directory. If no such file exists, the preprocessor quietly proceeds without it. If you use an option like -fcoco=somedir/mysettings, the preprocessor looks for file somedir/mysettings. You cannot use the -D option to override a constant declaration which appears in the setfile. The open-source package on which this feature is based does provide additional extensions and command-line options, described at http://users.erols.com/dnagle/coco.html. To pass those options through the compiler driver to the preprocessor, you can use the -Wp, flag. For example, you can use -Wp, -m to pass the -m option to the preprocessor to turn off macro preprocessing. Note that the instructions given in that web page for passing file names to the preprocessor and identifying the setfile are not relevant when you use the PathScale compiler, since the compiler automatically passes each source file name to the preprocessor for you, captures the preprocessor output for compilation, and identifies the setfile as described in the preceding paragraphs. More information about the -fcoco option can be found in the eko man page.
3.5.4.1

Pre-defined Macros
The PathScale compiler pre-defines some macros for preprocessing code. When you use the C preprocessor cpp with Fortran, or rely on the .F, .F90, and .F95 suffixes to use the default cpp preprocessor, the PathScale compiler uses the same preprocessor it uses for C, with the addition of the following macros: LANGUAGE_FORTRAN _LANGUAGE_FORTRAN 1 _LANGUAGE_FORTRAN90 1 LANGUAGE_FORTRAN90 1 _ _unix 1

3-10


3 ­ The PathScale Fortran Compiler Compiler and Runtime Features

unix 1 _ _unix_ _ 1 NOTE: When using an optimization level at -O1 or higher, the compiler will set and use the _ _OPTIMIZE_ _ macro with cpp.

See the complete list of macros for cpp in Section 4.2.1.1. If you use the Fortran preprocessor -ftpp, only these five macros are defined for you: LANGUAGE_FORTRAN 1 _ _LANGUAGE_FORTRAN90 1 LANGUAGE_FORTRAN90 1 _ _unix 1 unix 1 NOTE: By default, Fortran uses cpp. You must specify the -ftpp command-line switch with Fortran code to use the Fortran preprocessor.

This command will print to stdout all of the "#define"s used with -cpp on a Fortran file:
$ echo > junk.F90; pathf95 -cpp -Wp,-dD -E junk.F90

There is no corresponding way to find out what is defined by the default Fortran preprocessor (-ftpp). See section 3.5.4.1 for information on how to find pre-defined macros in C and C++. No macros are predefined for the -fcoco preprocessor.
3.5.5

Error Numbers: The explain Command
By default, the Fortran compiler and its runtime library print brief error messages, such as this one:
lib-4081 : UNRECOVERABLE library error An unformatted read or write is not allowed on a formatted file.

If you set the environment variable PSC_ERR_VERBOSE, the compiler and library will print a longer explanation following each message, such as this:
lib-4081 : UNRECOVERABLE library error An unformatted read or write is not allowed on a formatted file. A Fortran READ or WRITE statement attempted an unformatted I/O operation on a file that was opened for formatted I/O.

3-11


3 ­ The PathScale Fortran Compiler Compiler and Runtime Features

Either change the I/O statement to formatted (add a FORMAT specifier) or open the file for unformatted I/O. See the description of input/output statements in your Fortran reference manual.

Since the verbose messages print more slowly and take up more room on the screen, you may wish to unset the environment variable and instead use a tool called explain to print the longer message only when you need further explanation for a particular message. When the Fortran compiler or runtime prints out an error message, it prefixes the message with a string in the format "subsystem-number". For example, "pathf95-0724". The "pathf95-0724" is the message ID string that you will give to explain. When you type explain pathf95-0724, the explain program provides a more detailed error message:
$ explain pathf95-0724 Error : Unknown statement. Expected assignment statement but found "%s" instead of "=" or "=>". The compiler expected an assignment statement but could not find an assignment or pointer assignment operator at the correct point.

Another example:
$ explain pathf95-0700 Error : The intrinsic call "%s" is being made with illegal arguments. A function or subroutine call which invokes the name of an intrinsic procedure does not match any specific intrinsic. All dummy arguments without the OPTIONAL attribute must match in type and rank exactly.

The explain command can also be used with iostat= error numbers. When the iostat= specifier in a Fortran I/O statement provides an error number such as 4198, or when the program prints out such an error number during execution, you can look up its meaning using the explain command by prefixing the number with lib-, as in explain lib-4198. For example:
$ explain lib-4098 A BACKSPACE is invalid on a piped file. A Fortran BACKSPACE statement was attempted on a named or unnamed pipe (FIFO file) that does not support backspace. Either remove the BACKSPACE statement or change the file so that it is not a pipe. See the man pages for pipe(2), read(2), and write(2).

3-12


3 ­ The PathScale Fortran Compiler Mixed Code

3.5.6

Fortran 90 Dope Vector
Modern Fortran provides constructs that permit the program to obtain information about the characteristics of dynamically allocated objects such as the size of arrays and character strings. Examples of the language constructs that return this information include the ubound and the size intrinsics. To implement these constructs, the compiler may maintain information about the object in a data structure called a dope vector. If there is a need to understand this data structure in detail, it can be found in the source distribution in the file clibinc/cray/dopevec.h. See Appendix D for an example of a simplified version of that data structure, extracted from that file.
3.5.7

Bounds Checking
The PathScale Fortran compiler can perform bounds checking on arrays. To enable this feature, use the -C option:
$ pathf95 -C gasdyn.f90 -o gasdyn

The generated code checks all array accesses to ensure that they fall within the bounds of the array. If an access falls outside the bounds of the array, you will get a warning from the program printed on the standard error at runtime:
$ ./gasdyn lib-4961 : WARNING Subscript 20 is out of range for dimension 1 for array 'X' at line 11 in file 't.f90' with bounds 1:10.

If you set the environment variable F90_BOUNDS_CHECK_ABORT to YES, then the resulting program will abort on the first bounds check violation. Obviously, array bounds checking will have an impact on code performance, so it should be enabled only for debugging and disabled in production code that is performance sensitive.
3.5.8

Pseudo-random Numbers
The pseudo-random number generator (PRNG) implemented in the standard PathScale Fortran library is a non-linear additive feedback PRNG with a 32-entry long seed table. The period of the PRNG is approximately 16*((2**32)-1).
3.6

Mixed Code
If you have a large application that mixes Fortran code with code written in other languages, and the main entry point to your application is from C or C++, you can

3-13


3 ­ The PathScale Fortran Compiler Mixed Code

optionally use pathcc or pathCC to link the application, instead of pathf95. If you do, you must manually add the Fortran runtime libraries to the link line. As an example, you might do something like this:
$ pathCC -o my_big_app file1.o file2.o -lpathfstart -lpathfortran

If the main program is written in C or C++ but some procedures are written in Fortran, you may wish to call the function _PSC_ftn_init to initialize the Fortran runtime library. While standard Fortran I/O and most intrinsic functions will work correctly without this initialization, it is needed for runtime error messages, automatic stack sizing, and the intrinsics dealing with the command line arguments. You should call it prior to executing any Fortran-generated code, passing it the arguments argc and argv from the C main program:
int main(int argc, char **argv) { extern void _PSC_ftn_init(int argc, char **argv); _PSC_ftn_init(argc, argv); . . .
3.6.1

Calls between C and Fortran
In calls between C and Fortran, the two issues are: Mapping Fortran procedure names onto C function names and Matching argument types Normally a pathf90 procedure name "x" not containing an underscore creates a linker symbol "x_", and a pathf90 name "x_y" containing an underscore creates a linker symbol "x_y_ _" (note the second underscore). A pathcc function name, by contrast, does not append any underscores when creating a linker symbol. You can write your C code to conform to this: use "x_" in C so that it will match Fortran's "x". Or you can use the -fdecorate option, described in man pathf90, to provide a mapping from each Fortran name onto some (possibly quite different) linker symbol. Or you can use the -fno-underscoring option, but in many cases that will create symbols that conflict with those in the Fortran and C runtime libraries, so it is not the preferred choice. Normally pathf90 passes arguments by reference, so C needs to use pointers in order to interoperate with Fortran. In many cases you can use the %val() intrinsic function in Fortran to pass an argument by value. The programmer must be careful to match argument data types. For instance, pathf90 integer*4 matches C int,integer*8 matches C long long, real matches C float (provided the C function has an explicit prototype) and doubleprecision matches C double. Fortran character is problematic because in addition to passing a pointer to the first character, it appends an integer

3-14


3 ­ The PathScale Fortran Compiler Mixed Code

length-count argument to the end of the usual argument list. Fortran Cray pointers, declared with the pointer statement, correspond to C pointers, but Fortran 90 pointers, declared with the pointer attribute, are unique to Fortran. The sequence keyword makes it more likely that a Fortran 90 structure will use the same layout as a C structure, although it is wise to verify this by experiment in each case. For arrays, it is wise to limit the interface to the kinds of arrays provided in Fortran 77, since the arrays introduced in Fortran 90 add to the data structures information that C cannot understand. Thus, for example, an argument "a (5, 6) " or "a (n)" or "a (1:* )" (where "n" is a dummy argument) will pass a simple pointer that corresponds well to a C array, whereas "a (:,: )" or an allocatable array or a Fortran 90 pointer array does not correspond to anything in C. NOTE: Fortran arrays are placed in memory in column-major arrays use row-major order. And, of course, one must that C array indices originate a zero, whereas Fortran originate at 1 by default but can be declared with othe order whereas C adjust for the fact array indices r origins instead.

Calls between C++ and Fortran are more difficult, for the same reason that calls between C and C++ are difficult: the C++ compiler must "mangle" symbol names to implement overloading, and the C++ compiler must add to data structures various information (such as virtual table pointers) that other languages cannot understand. The simplest solution is to use the extern "C" declaration within the C++ source code to tell it to generate a C-compatible interface, which reduces the problem to that of interfacing C and Fortran.
3.6.1.1

Example: Calls between C and Fortran
Here are three files you can compile and execute that demonstrate calls between C and Fortran. This is the C source code (c_part.c):
#include #include #include extern void f1_(char *c, int *i, long long *ll, float *f, double *d, int *l, int c_len); /* Demonstrate how to call Fortran from C */ void call_fortran() { char *c = "hello from call_fortran"; int i = 123; long long ll = 456ll; float f = 7.8; double d = 9.1; int nonzero = 10; /* Any nonzero integer is .true. in Fortran */ f1_(c, &i, &ll, &f, &d, &nonzero, strlen(c));

3-15


3 ­ The PathScale Fortran Compiler Mixed Code

} /* C function designed to be called from Fortran, passing arguments by * reference */ void c_reference__(double *d1, float *f1, int *i1, long long *i2, char * c1, int *l1, int *l2, char *c2, char *c3, int c1_len, int c2_len, int c3_len) { /* A fortran string has no null terminator, so make a local copy and add * a terminator. Depending on the situation, it might be preferable * to put the terminator in place of the first trailing blank. */ char *null_terminated_c1 = memcpy(alloca(c1_len + 1), c1, c1_len); char *null_terminated_c2 = memcpy(alloca(c2_len + 1), c2, c2_len); char *null_terminated_c3 = memcpy(alloca(c3_len + 1), c3, c3_len); null_terminated_c1[ c1_len] = null_terminated_c2[ c2_len] = null_terminated_c3[ c3_len] = '\ 0'; printf("d1=%.1f, f1=%.1f, i1=%d, i2=%lld, l1=%d, l2=%d, " "c1_len=%d, c2_len=%d, c3_len=%d\n", *d1, *f1, *i1, *i2, *l1, *l2, c1_len, c2_len, c3_len); printf ("c1='%s', c2='%s', c3='%s'\n", null_terminated_c1, null_terminated_c2, null_terminated_c3); fflush(stdout); /* Flush output before switching languages */ call_fortran (); } /* C function designed to be called from Fortran, passing arguments by * value */ int c_value__(double d, float f, int i, long long i8) { printf("d=%.1f, f=%.1f, i=%d, i8=%lld\n", d, f, i, i8); fflush(stdout); /* Flush output before switching languages */ return 4; /* Nonzero will be treated as ".true." by Fortran */ }

Here is the Fortran source code (f_part.f90):
program f_part implicit none ! Explicit interface is not required, but adds some error-checking interface subroutine c_reference(d1, f1, i1, i2, c1, l1, l2, c2, c3) doubleprecision d1 real f1 integer i1 integer*8 i2 character* (*) c1, c3 character*4 c2 logical l1, l2 end subroutine c_reference logical function c_value(d, f, i, i8) doubleprecision d real f

3-16


3 ­ The PathScale Fortran Compiler Mixed Code

integer i integer*8 i8 end function c_value end interface logical l pointer (p_user, user) character*32 user integer*8 getlogin_nounderscore ! File decorate.txt maps this to external getlogin_nounderscore ! "getlogin" without underscore intrinsic char ! Demonstrate calling from Fortran a C function taking arguments by ! reference call c_reference(9.8d0, 7.6, 5, 4_8, 'hello', .false., .true., & 'from', 'f_part') ! Demonstrate calling from Fortran a C function taking arguments by ! value. l = c_value(%val(9.8d0), %val(7.6), %val(5), %val(4_8)) write(6 , "(a,l8)") "l=", l ! "getlogin" is a standard C library function which returns "char !*". ! When a C function returns a pointer, you must use a Cray pointer ! to receive the address and examine the data at that address, ! instead of assigning to an ordinary variable p_user = getlogin_nounderscore() write(6, "(3a)") "'", user(1:index(user, char(0)) - 1), "'" end program f_part ! Subroutine to be called from C subroutine f1(c, i, i8, f, d, l) implicit none intrinsic flush character* (*) c integer i integer*8 i8 real f doubleprecision d logical l write(6, "(3a,2i5,2f5.1,l8)") "'", c, "'", i, i8, f, d, l call flush(6); ! Flush output before switching languages end subroutine f1

And here is the third file (decorate.txt):
getlogin_nounderscore getlogin

3-17


3 ­ The PathScale Fortran Compiler Mixed Code

Compile and execute these three files (c_part.c, f_part.f90, and decorate.txt) like this:
$ pathf90 -Wall -intrinsic=flush -fdecorate decorate.txt f_part.f90 c_part.c $ ./a.out d1=9.8, f1=7.6, i1=5, i2=4, l1=0, l2=1, c1_len=5, c2_len=4, c3_len=6 c1='hello', c2='from', c3='f_part' 'hello from call_fortran' 123 456 7.8 9.1 T d=9.8, f=7.6, i=5, i8=4 l= T 'johndoe'
3.6.1.2

Example: Accessing Common Blocks from C
Variables in Fortran 90 modules are grouped into common blocks, one for initialized data and another for uninitialized data. It is possible to use -fdecorate to access these common blocks from C, as shown in this example:
$ cat mymodule.f90 module mymodule public integer :: modulevar1 doubleprecision :: modulevar2 integer :: modulevar3 = 44 doubleprecision :: modulevar4 = 55.5 end module mymodule program myprogram use mymodule modulevar1 = 22 modulevar2 = 33.3 call mycfunction () end program myprogram $ cat mycprogram.c #include extern struct { int modulevar1; double modulevar2; } mymodule_data; extern struct { int modulevar3; double modulevar4; } mymodule_data_init; void mycfunction () { printf ("%d %g\n", mymodule_data.modulevar1, mymodule_data.modulevar2); printf ("%d %g\n", mymodule_data_init.modulevar3, mymodule_data_init . modulevar4);

3-18


3 ­ The PathScale Fortran Compiler Runtime I/O Compatibility

} $ cat dfile .data_init.in.mymodule mymodule_data_init .data.in.mymodule.in.mymodule mymodule_data mycfunction mycfunction $ pathf90 -fdecorate dfile mymodule.f90 mycprogram.c mymodule. f90: mycprogram. c: $ ./a.out 22 33.3 44 55.5
3.7

Runtime I/O Compatibility
Files generated by the Fortran I/O libraries on other systems may contain data in different formats than that generated or expected by codes compiled by the PathScale Fortran compiler. This section discusses how the PathScale Fortran compiler interacts with files created by other systems.
3.7.1

Performing Endian Conversions
Use the assign command, or the ASSIGN()procedure, to perform endian conversions while doing file I/O.
3.7.1.1

The assign Command
The assign command changes or displays the I/O processing directives for a Fortran file or unit. The assign command allows various processing directives to be associated with a unit or file name. This can be used to perform numeric conversion while doing file I/O. The assign command uses the file pointed to by the FILENV environment variable to store the processing directives. This file is also used by the Fortran I/O libraries to load directives at runtime. For example:
$ FILENV=.assign $ export FILENV $ assign -N mips u:15

This instructs the Fortran I/O library to treat all numeric data read from or written to unit 15 as being MIPS-formatted data. This effectively means that the contents of the file will be translated from big-endian format (MIPS) to little-endian format (Intel) while being read. Data written to the file will be translated from little-endian format to big-endian format. See the assign(1) man page for more details and information.

3-19


3 ­ The PathScale Fortran Compiler Runtime I/O Compatibility

3.7.1.2

Using the Wildcard Option
The wildcard option for the assign command is:
assign -N mips p:%

Before running your program, run the following commands:
$ FILENV=.assign $ export FILENV $ assign -N mips p:%

This example matches all files.
3.7.1.3

Converting Data and Record Headers
To convert numeric data in all unformatted units from big endian, and convert the record headers from big endian, use the following:
$ assign -F f77.mips -N mips g:su $ assign -I -F f77.mips -N mips g:du

The su specifier matches all sequential unformatted open requests. The du specifier matches all direct unformatted open requests. The -F option sets the record header format to big endian (F77.mips).
3.7.1.4

The ASSIGN( ) Procedure
The ASSIGN() procedure provides a programmatic interface to the assign command. It takes as an argument a string specifying the assign command and an integer to store a returned error code. For example:
integer :: err call ASSIGN("assign -N mips u:15", err)

This example has the same effect as the example in section 3.7.1.1.
3.7.1.5

I/O Compilation Flags
Two compilation flags have been added to help with I/O: -byteswapio and -convert conversion. The -byteswapio flag swaps bytes during I/O so that unformatted files on a little-endian processor are read and written in big-endian format (or vice versa.) The -convert conversion flag controls the swapping of bytes during I/O so that unformatted files on a little-endian processor are read and written in big-endian format (or vice versa.) To be effective, the option must be used when compiling the Fortran main program.

3-20


3 ­ The PathScale Fortran Compiler Source Code Compatibility

Setting the environment variable FILENV when running the program will override the compiled-in choice in favor of the choice established by the command assign. The -convert conversion flag can take one of three arguments: native - no conversion, the default big_endian - files are big-endian little_endian - files are little-endian For more details, see the pathf95 man page.
3.7.2

Reserved File Units
The PathScale Fortran compiler reserves Fortran file units 5, 6, and 0.
3.8

Source Code Compatibility
This section discusses our compatibility with source code developed for other compilers. Different compilers represent types in various ways, and this may cause some problems.
3.8.1

Fortran KINDs
The Fortran KIND attribute is a way to specify the precision or size of a type. Modern Fortran uses KINDS to declare types. This system is very flexible, but has one drawback. The recommended and portable way to use KINDS is to find out what they are like this:
integer :: dp_kind = kind(0.0d0)

In actuality, some users hard-wire the actual values into their programs:
integer :: dp_kind = 8

This is an unportable practice, because some compilers use different values for the KIND of a double-precision floating point value. The majority of compilers use the number of bytes in the type as the KIND value. For floating point numbers, this means KIND=4 is 32-bit floating point, and KIND=8 is 64-bit floating point. The PathScale compiler follows this convention. Unfortunately for us and our users, this is written using GNU Fortran, g77. g77 uses KIND=2 for double precision (64 bits). For 1 byte, KIND=5 for 2 bytes, KIND=1 for 4 incompatible with unportable programs KIND=1 for single precision (32 bits) and integers, however, g77 uses KIND=3 for bytes, and KIND=2 for 8 bytes.

We are investigating the cost of providing a compatibility flag for unportable g77 programs. If you find this to be a problem, the best solution is to change your program to inquire for the actual KIND values instead of hard-wiring them.

3-21


3 ­ The PathScale Fortran Compiler Library Compatibility

If you are using -i8 or -r8, see section 3.4.1 for more details on usage.
3.9

Library Compatibility
This section discusses our compatibility with libraries compiled with C or other Fortran compilers. Linking object code compiled with other Fortran compilers is a complex issue. Fortran 90 or 95 compilers implement modules and arrays so differently that it is extremely difficult to attempt to link code from two or more compilers. For Fortran 77, run-time libraries for things like I/O and intrinsics are different, but it is possible to link both runtime libraries to an executable. We have experimented using object code compiled by g77. This code is not guaranteed to work in every instance. It is possible that some of our library functions have the same name but different calling conventions than some of g77's library functions. We have not tested linking object code from other compilers, with the exception of g77.
3.9.1

Name Mangling
Name mangling is a mechanism by which names of functions, procedures, and common blocks from Fortran source files are converted into an internal representation when compiled into object files. For example, a Fortran subroutine called foo gets turned into the name "foo_" when placed in the object file. We do this to avoid name collisions with similar functions in other libraries. This makes mixing code from C, C++, and Fortran easier. Name mangling ensures that function, subroutine, and common-block names from a Fortran program or library do not clash with names in libraries from other programming languages. For example, the Fortran library contains a function named "access", which performs the same function as the function access in the standard C library. However, the Fortran library access function takes four arguments, making it incompatible with the standard C library access function, which takes only two arguments. If your program links with the standard C library, this would cause a symbol name clash. Mangling the Fortran symbols prevents this from happening. By default, we follow the same name mangling conventions as the GNU g77 compiler and libf2c library when generating mangled names. Names without an underscore have a single underscore appended to them, and names containing an underscore have two underscores appended to them. The following examples should help make this clear:
molecule -> molecule_ run_check -> run_check_ _ energy_ -> energy_ _ _

3-22


3 ­ The PathScale Fortran Compiler Library Compatibility

This behavior can be modified by using the -fno-second-underscore and the -fno-underscoring options to the pathf95 compiler. The default policies for Intel ifort, PGI pgf90, Sun f90, GNU gfortran and g95 all correspond to our -fno-second-underscore option. Common block names are also mangled. Our name for the blank common block is the same as g77 (_BLNK_ _). PGI's compiler uses the same name for the blank common block, while Intel's compiler uses _BLANK_ _.
3.9.2

ABI Compatibility
The PathScale compilers support the official x86_64 Application Binary Interface (ABI), which is not always followed by other compilers. In particular, g77 does not pass the return values from functions returning COMPLEX or REAL values according to the x86_64 ABI. (Double precision REALs are OK.) For more details about what g77 does, see the "info g77" entry for the -ff2c flag. This issue is a problem when linking binary-only libraries such as Kazushige Goto's BLAS library or the ACML library (AMD Core Math Library (we have not tested ACML on the EM64T version of the compiler suite)). Libraries such as FFTW and MPICH don't have any functions returning REAL or COMPLEX, so there are no issues with these libraries. For linking with g77-compiled functions returning COMPLEX or REAL values see section 3.9.3. Like most Fortran compilers, we represent character strings passed to subprograms with a character pointer, and add an integer length parameter to the end of the call list.
3.9.3

Linking with g77-compiled Libraries
If you wish to link with a library compiled by g77, and if that library contains functions that return COMPLEX or REAL types, you need to tell the compiler to treat those functions differently. Use the -ff2c-abi switch at compile time to point the PathScale compiler at a file that contains a list of functions in the g77-compiled libraries that return COMPLEX or REAL types. When the PathScale compiler generates code that calls these listed functions, it will modify its ABI behavior to match g77's expectations. The -ff2c-abi flag is used at compile time and not at link time. NOTE: You can only specify the -ff2c-abi switch once on the command line. If you have multiple g77-compiled libraries, you need to place all the appropriate symbol names into a single file.

3-23


3 ­ The PathScale Fortran Compiler Library Compatibility

The format of the file is one symbol per line. Each symbol should be as you would specify it in your Fortran code (i.e. do not mangle the symbol). As an example:
$ cat example-list sdot cdot $

You can use the fsymlist program to generate a file in the appropriate format. For example:
$ fsymlist /opt/gnu64/lib/mylibrary.a > mylibrary-list

This will find all Fortran symbols in the mylibrary.a library and place them into the mylibrary-2.0-list file. You can then use this file with the -ff2c-abi switch. NOTE: The fsymlist program generates a list of all Fortran symbols in the library, including those that do not return COMPLEX or REAL types. The extra symbols will be ignored by the compiler.

3.9.3.1

AMD Core Math Library (ACML)
The AMD Core Math Library (ACML) incorporates BLAS, LAPACK, and FFT routines, and is designed to obtain maximum performance from applications running on AMD platforms. This highly optimized library contains numeric functions for mathematical, engineering, scientific, and financial applications. ACML is available both as a 32-bit library (for compatibility with legacy x86 applications), and as a 64-bit library that is designed to fully exploit the large memory space and improved performance offered by the x86_64 architecture (we have not tested ACML on the EM64T version of the compiler suite). To use ACML 1.5 with the PathScale Fortran compiler, use the following:
$ pathf95 foo.f bar.f -lacml

To use ACML 2.0 with the PathScale Fortran compiler, use the following:
$ pathf95 -L foo.f bar.f -lacml

ACML 2.5.1 and later, built with the PathScale compilers, is available from the AMD website at http://developer.amd.com/acml.aspx. With these later versions of ACML, the workarounds described above are unnecessary.
3.9.4

List Directed I/O and Repeat Factors
By default, when list directed I/O is used and two or more consecutive values are identical, the output uses a repeat factor.

3-24


3 ­ The PathScale Fortran Compiler Library Compatibility

For example:
real :: a(5)=88.0 write (*,*) a end

This example generates the following output:
5*88.

This behavior conforms to the language standard. However, some users prefer to see multiple values instead of the repeat factor:
88., 88., 88., 88., 88.

There are two ways to accomplish this, using an environment variable and using the assign command.
3.9.4.1

Environment Variable
If the environment variable FTN_SUPPRESS_REPEATS is set before the program starts executing, then list-directed "write" and "print" statements will output multiple values instead of using the repeat factor. To output multiple values when running within the bash shell:
export FTN_SUPPRESS_REPEATS=yes

To output multiple values when running within the csh shell:
setenv FTN_SUPPRESS_REPEATS yes

To output repeat factors when running within the bash shell:
unset FTN SUPPRESS REPEATS

To output repeat factors when running within the csh shell:
unsetenv FTN SUPPRESS REPEATS
3.9.4.2

assign Command
Using the -y on option to the assign command will cause all list directed output to the specified file names or unit numbers to output multiple values; using the -y off option will cause them to use repeat factors instead. For example, to output multiple values on logical unit 6 and on any logical unit which is associated with file test2559.out, type these commands before running the program:
export FILENV=myassignfile assign -I -y on u:6 assign -I -y on f:test2559.out

3-25


3 ­ The PathScale Fortran Compiler Porting Fortran Code

The following program would then use no repeat factors, because the first write statement refers explicitly to unit 6, the second write statement refers implicitly to unit 6 (by using "*" in place of a logical unit), and the third is bound to file test2559.out:
real :: a(5)=88.0 write (6,*) a write (*,*) 77.0, 77.0, 77.0, 77.0, 77.0 open(unit=17, file='test2559.out') write (17,*) 99.0, 99.0, 99.0, 99.0, 99.0 end
3.10

Porting Fortran Code
The following option can help you fix problems prior to porting your code. -r8 -i8 Respectively promotes the default representation for REAL and INTEGER type from 4 bytes to 8 bytes. Useful for porting from Cray code when integer and floating point data is 8 bytes long by default. Watch out for type mismatches with external libraries. These sections contain helpful information for porting Fortran code: Section 3.8.1 has information on porting code that includes KINDS, sometimes a problem when porting Fortran code Section 3.8 has information on source code compatibility Section 3.9 has information on library compatibility
3.11

Debugging and Troubleshooting Fortran
The flag -g tells the PathScale compilers to produce data in the form used by modern debuggers, such as PathScale's pathdb, GDB, Etnus' TotalView®, Absoft Fx2TM, and Streamline's DDTTM. This format is known as DWARF 2.0 and is incorporated directly into the object files. Code that has been compiled using -g will be capable of being debugged using pathdb, GDB, or other debuggers. The -g option automatically sets the optimization level to -O0 unless an explicit optimization level is provided on the command line. Debugging of higher levels of optimization is possible, but the code transforming performed by the optimizations many make it more difficult. Bounds checking is quite a useful debugging aid. This can also be used to debug allocated memory. If you are noticing numerical accuracy problems, see section 7.7 for more information on numerical accuracy.

3-26


3 ­ The PathScale Fortran Compiler Debugging and Troubleshooting Fortran

See section 10 for more information on debugging and troubleshooting. See the PathScale Debugger User Guide for more information on pathdb.
3.11.1

Writing to Constants Can Cause Crashes
Some Fortran compilers allocate storage for constant values in read-write memory. The PathScale Fortran compiler allocates storage for constant values in read-only memory. Both strategies are valid, but the PathScale compiler 's approach allows it to propagate constant values aggressively. This difference in constant handling can result in crashes at runtime when Fortran programs that write to constant variables are compiled with the PathScale Fortran compiler. A typical situation is that an argument to a subroutine or function is given a constant value such as 0 or .FALSE., but the subroutine or function tries to assign a new value to that argument. We recommend that where it no longer does this. Such compilers, but will allow the not crash and will run more If you cannot modify that will change the read-write memory. compiler 's ability to executables slower. possible, you fix code that assigns to constants so that a change will continue to work with other Fortran PathScale Fortran compiler to generate code that will efficiently. we provide an option called -LANG:rw_const=on behavior so that it allocates constant values in make this option the default, as it reduces the constant values, which makes the resulting

your code, compiler 's We do not propagate

You might also try the -LANG:formal_deref_unsafe option. This option tells the compiler whether it is unsafe to speculate a dereference of a formal parameter in Fortran. The default is OFF, which is better for performance. See the eko man page for more details on these two flags.
3.11.2

Runtime Errors Caused by Aliasing Among Fortran Dummy Arguments
The Fortran standards require that arguments to functions and subroutines not alias each other. As an example, this is illegal:
program bar ... call foo(c,c) ... subroutine foo (a,b) integer i real a(100), b(100) do i = 2, 100 a(i) = b(i) - b(i-1) enddo

3-27


3 ­ The PathScale Fortran Compiler Debugging and Troubleshooting Fortran

Because a and b are dummy arguments, the compiler relies on the assumption that a and b are in non-overlapping areas of memory when it optimizes the program. The resulting program when run will give wrong results. Programmers occasionally break this aliasing rule, and as a result, their programs get the wrong answer only under high levels of optimization. This sort of bug frequently is thought to be a compiler bug, so we have added this option to the compiler for testing purposes. If your failing program gets the right answer with -OPT:alias=no_parm or -WOPT:fold=off, then it is likely that your program is breaking this Fortran aliasing rule.
3.11.3

Fortran malloc Debugging
The PathScale Compiler Suite includes a feature to debug Fortran memory allocations. By setting the environment variable PSC_FDEBUG_ALLOC, memory allocations will be initialized during execution to the following values:
PSC FDEBUG ALLOC Value ----------------------ZERO 0 NaN 0xffa5a5a5 (4 byte NaN) NaN8 0xffa5a5a5fff5a5a5ll (8 byte NaN)

For example, to initialize all memory allocations to zeroes, set PSC_FDEBUG_ALLOC=ZERO before running the program. The four-byte and eight-byte NaNs will only initialize arrays that are aligned with their width (32 and 64 bits, respectively).
3.11.4

Arguments Copied to Temporary Variables
In some situations, the Fortran standard requires that actual arguments to procedure calls be copied to and from temporary variables. Often this occurs because a program employs array features introduced in the Fortran 90 standard along with procedures having traditional Fortran 77 style implicit interfaces. In particular, Fortran 77 style procedures expect all arrays to be contiguous in memory, but Fortran 90 permits arrays whose elements are scattered or strided. The copying takes time, but contiguous arrays may better use the processor cache memory. Whether the program runs faster or slower depends on whether one of those factors dominates the other, and that depends on the details of the program. Because unintended copying can slow program execution, the compiler provides optional warnings about it. The example below shows two out of many situations in which copying takes place: one in which copying is conditional on the nature of the array, and another in which copying is unconditional.

3-28


3 ­ The PathScale Fortran Compiler Debugging and Troubleshooting Fortran

$ cat cico.f90 subroutine possible(a, n) implicit none integer :: n integer, dimension(n) :: a print '(a,25i5)', "possible:", a end subroutine possible program copier implicit none logical :: l integer :: i integer, target :: a(5,5) = reshape((/ (i, i=1,25) /), (/ 5, 5 /)) integer, pointer, dimension(:,:) :: p read *, l if (l) then p => a else p => a(1:5:2, 1:5:2) endif ! Because "possible" does not have an explicit interface, it ! expects a contiguous array. Therefore, the compiler generates a ! runtime test to check a "contiguous" bit belonging to the ! pointer "p", and if the target is not contiguous, the values are ! copied to a temporary array before the call and copied back ! after the call call possible(p, size(p)) ! The compiler must always copy this sequence array to a ! temporary variable to make it contiguous call possible(a((/1,2,5/),(/2,3,5/)),size(a((/1,2,5/),(/2,3,5/)))) end program copier $ pathf90 -fullwarn -c cico.f90 call possible(p, size(p)) ^ pathf95-1438 pathf90: CAUTION COPIER, File = cico.f90, Line = 26, Column = 17 This argument produces a possible copy in and out to a temporary variable. call possible(a((/1,2,5/),(/2,3,5/)), size(a((/1,2,5/),(/2,3,5/)))) ^ pathf95-1438 pathf90: CAUTION COPIER, File = cico.f90, Line = 30,

3-29


3 ­ The PathScale Fortran Compiler Fortran Compiler Stack Size

Column = 18 This argument produces a copy in to a temporary variable. pathf95: PathScale(TM) Fortran Version 2.9.99 (f14) Thu Dec 7, 2006 06:03:17 pathf95: 32 source lines pathf95: 0 Error(s), 0 Warning(s), 2 Other message(s), 0 ANSI(s) pathf95: "explain pathf95-message number" gives more information about each message

One way to minimize copying, while still taking advantage of Fortran 90 features, is to use Fortran 90 style assumed-shape and deferred-shape arrays (that is, arrays whose bounds look like "(:,:)" rather than "(2,3)" or "(n,m)") for all dummy array arguments, so that procedure calls pass a bit indicating whether the array is contiguous. This requires that the program use explicit interfaces for all procedures, with interface blocks, with module use statements, or by nesting one procedure inside another with contains. Each of those methods provides the compiler with an explicit interface from the viewpoint of the Fortran standard. NOTE: Redundant interfaces are incorrect: don't provide an interface block for a procedure whose interface is already imported via a use statement.

The compiler will also copy noncontiguous arrays to temporary variables in some situations where the standard does not require it, but where heuristics suggest that this will improve performance by better using the cache. To disable this category of copying, use the command-line option "-LANG:copyinout=off".
3.12

Fortran Compiler Stack Size
The Fortran compiler allocates data on the stack by default. Some environments set a low limit on the size of a process's stack, which may cause Fortran programs that use a large amount of data to crash shortly after they start. If the PathScale Fortran runtime environment detects a low stack size limit, it will automatically increase the size of the stack allocated to a Fortran process before the Fortran program begins executing. By default, it on a system system with limit to 1G automatically increases this limit to the total amount of physical memory , less 128 megabytes per CPU. For example, when run on a 4-CPU 1G of memory, the Fortran runtime will attempt to raise the stack size (128M * 4), or 640M.

To have the Fortran runtime tell you what it is doing with the stack size limit, set the PSC_STACK_VERBOSE environment variable before you run a Fortran program. You can control the stack size limit that the Fortran runtime attempts to use using the PSC_STACK_LIMIT environment variable.

3-30


3 ­ The PathScale Fortran Compiler Fortran Compiler Stack Size

If this is set to the empty string, the Fortran runtime will not attempt modify the stack size limit in any way. Otherwise, this variable must contain a number. any text, it is treated as a number of bytes. If it is is treated as kilobytes (1024 bytes). If "m" or "M", it If "g" or "G", it is treated as gigabytes (1024M). If of the system's physical memory. If the number is not followed by followed by the letter "k" or "K", it is treated as megabytes (1024K). "%", it is treated as a percentage

If the number is negative, it is treated as the amount of memory to leave free, i.e. it is subtracted from the amount of physical memory on the machine. If all of this text is followed by " /cpu", it is treated as a "per cpu" number, and that number is multiplied by the number of CPUs on the system. This is useful for multiprocessor systems that are running several processes concurrently. The value specified (implicitly or explicitly) is the memory value per process. Here are some sample stack size settings (on a 4 CPU system with 1G of memory):
Value 100000 820K -0.25g 128M/cpu -10M/cpu 100000 bytes 820K (839680 bytes) all but 0.25G, or 0.75G total 128M per CPU, or 512M total all but 10M per CPU (all but 40M total), or 0.96G total Meaning

If the Fortran runtime encounters problems while attempting to modify the stack size limit, it will print some warning messages, but will not abort.

3-31


3 ­ The PathScale Fortran Compiler Fortran Compiler Stack Size

3-32


The PathScale C/C++ Compiler
The PathScale C and C++ compilers conform to the following set of standards and extensions. The C compiler: Conforms to ISO/IEC 9899:1990, Programming Languages - C standard Supports extensions to the C programming language as documented in "Using GCC: The GNU Compiler Collection Reference Manual," October 2003, for GCC version 3.3.1 Refer to section 4.4 of this document for the list of extensions that are currently not supported Complies with the C Application Binary Interface as defined by the GNU C compiler (gcc) as implemented on the platforms supported by the PathScale Compiler Suite Supports most of the widely used command-line options supported by gcc Generates code that complies with the x86_64 ABI and the 32-bit x86 ABI The C++ compiler: Conforms to ISO/IEC 14882:1998(E), Programming Languages - C++ standard Supports extensions to the C++ programming language as documented in "Using GCC: The GNU Compiler Collection Reference Manual," October 2003, for GCC version 3.3.1 Refer to section 4.4 of this document for the list of extensions that are currently not supported Complies with the C Application Binary Interface as defined by the GNU C++ compiler (g++) as implemented on the platforms supported by the PathScale Compiler Suite Supports most of the widely used command-line options supported by g++ Generates code that complies with the x86_64 ABI and the 32-bit x86 ABI To invoke the PathScale C and C++ compilers, use these commands: pathcc - invoke the C compiler pathCC - invoke the C++ compiler

Section 4

4-1


4 ­ The PathScale C/C++ Compiler Using the C/C++ Compilers

The command-line flags for both compilers are compatible with those taken by the GCC suite. See section 4.1 for more discussion of this.
4.1

Using the C/C++ Compilers
If you currently use the GCC compilers, the PathScale compiler commands will be familiar. Makefiles that presently work with GCC should operate with the PathScale compilers effortlessly­simply change the command used to invoke the compiler and rebuild. See section 5.7.1 for information on modifying existing scripts The invocation of the compiler is identical to the GCC compilers, but the flags to control the compilation are different. We have sought to provide flags compatible with GCC's flag usage whenever possible and also provide optimization features that are absent in GCC, such as IPA and LNO. Generally speaking, instead of being a single component as in GCC, the PathScale compiler is structured into components that perform different classes of optimizations. Accordingly, compilation flags are provided under group names like -IPA, -LNO, -OPT, -CG, etc. For this reason, many of the compilation flags in our compiler will differ from those in GCC. See the eko man page for more information. The default optimization level is 2. This is equivalent to passing -O2 as a flag. The following three commands are identical in their function:
$ pathcc hello.c $ pathcc -O hello.c $ pathcc -O2 hello.c

See section 7.1 for information about the optimization levels available for use with the compiler. To run with -Ofast or with -ipa, the flag must also be given on the link command.
$ pathCC -c -Ofast warpengine.cc $ pathCC -c -Ofast wormhole.cc $ pathCC -o ftl -Ofast warpengine.o wormhole.o See section 7.3 for information on -ipa and -Ofast.
4.1.1

Accessing the GCC 4.x Front-ends for C and C++
This release is compatible with version 4.2.0 of the GNU C/C++ compiler in terms of the source language constructs they support. This is the default on Linux distributions whose compiler is GNU 4.x. On systems with GNU 3.x compilers, pathcc/pathCC will generate code compitable with GNU 3.x. You can use the "-gnu4" option to direct pathcc/pathCC to be compitable with GNU 4.x. A sample command for C is:
$ pathcc -gnu4 world.c

4-2


4 ­ The PathScale C/C++ Compiler Compiler and Runtime Features

This default behavior can be changed in your compiler.defaults file by adding this line: -gnu4 See section 2.3 for an example compiler.defaults file. The option has no effect on pathf90 or pathf95. There are currently some limitations when using this option. Please see the Release Notes for more information.
4.2

Compiler and Runtime Features
4.2.1

Preprocessing Source Files
Before being passed to the compiler front-end, source files are optionally passed through a source code preprocessor. The preprocessor searches for certain directives in the file and, based on these directives, can include or exclude parts of the source code, include other files, or define and expand macros. All C and C++ files are passed through the the C preprocessor unless the -noccp flag is specified.

4-3


4 ­ The PathScale C/C++ Compiler Compiler and Runtime Features

4.2.1.1

Pre-defined Macros
The PathScale compiler pre-defines some macros for preprocessing code. These include the following: Table 4-1. Pre-defined Macros
Macro __linux 1 __linux__ 1 linux 1 __unix 1 __unix__ 1 unix 1 __gnu_linux__ 1 __GNUC__ 4 __GNUC_MINOR__ 1 __GNUC_PATCHLEVEL__ 1 __PATHSCALE__ "3.1" __PATHCC__ 3 __PATHCC_MINOR__ 1 __PATHCC_PATCHLEVEL__ 0 _LANGUAGE_FORTRAN 1 LANGUAGE_FORTRAN 1 _LANGUAGE_FORTRAN90 1 LANGUAGE_FORTRAN90 1 __i386 1 __i386__ 1 i386 1 __x86_64__ 1 __x86_64 1 __LP64__ 1 _LP64 1 These macros specify that /long/ and /pointer/ are 64-bit, while /int/ is 32-bit. When using an optimization level at -O1 or higher, the compiler will use this macro. MIPS-specific. These macros specify 64-bit x86 compilation. The macros specify32-bit x86 compilation. These Fortran macros will also be used if the source file is Fortran, but cpp is used. The _ _GNU* and _ _PATH* values are derived from the respective compiler version numbers, and will change with each release. Remarks These macros specify the type of operating system.

__OPTIMIZE__ 1
_mips 1 __mips__ 1 mips 1

Indicates the target is a MIPS processor.

4-4


4 ­ The PathScale C/C++ Compiler Compiler and Runtime Features

Table 4-1. Pre-defined Macros
Macro __mips64 1 MIPS-specific. Remarks

The target MIPS processor has 64-bit capability
_MIPS_SIM _ABIN32 _MIPS_SIM _ABI64 MIPS-specific.

For the _MIPS_SIM macro, _ABIN32 indiates the -n32 ABI and _ABI64 indicates the -64 ABI.
MIPS-specific.

_MIPS_ISA _MIPS_ISA_MIPS3 _MIPS_ARCH_MIPS3 1 _MIPS_ARCH "mips3" _MIPS_TUNE "mips3" _MIPS_TUNE_MIPS3 1 __mips 3 __MIPSEL__ 1 __MIPSEL 1 _MIPSEL 1 MIPSEL 1 _MIPS_SZPTR 32 _MIPS_SZINT 32 _MIPS_SZLONG 32

Indicates that the target supports the MIPS3 instruction set

MIPS-specific.

Indicates that the target is little-endian

MIPS-specific.

Size of pointer,int, and long,in bits.

A quick way to list all the predefined cpp macros would be to compile your program with the flags -dD -keep. You can find all the defines (or predefined macros) in the resulting .i file. Here is an example for C:
$ cat hello.c main(){ printf ("Hello World\n"); } $ pathcc -dD -keep hello.c $ $ wc hello.i 94 278 2606 hello.i $ cat hello.i

The hello.i file will contain the list of pre-defined macros. NOTE: Generating an .i file doesn't work well with Fortran, because if the preprocessor sends the "#define"s to the .i file, Fortran can't parse them. See section 3.5.4.1 for information on finding pre-defined macros in Fortran.

4-5


4 ­ The PathScale C/C++ Compiler Compiler and Runtime Features

4.2.2

Pragmas
4.2.2.1

Pragma pack
In this release, we have tested and verified that the pragma pack is supported. The syntax for this pragma is: #pragma pack (n) This pragma specifies that the next structure should have each of their fields aligned to an alignment of n bytes if its natural alignment is not smaller than n.
4.2.2.2

Changing Optimization Using Pragmas
Optimization flags can now be changed via directives in the user program. In C and C++, the directive is of the form:
#pragma options

Any number of these can be specified inside function scopes. Each affects only the optimization of the entire function in which it is specified. The literal string can also contain an unlimited number of different options separated by space.The compilation of the next function reverts back to the settings specified in the compiler command line. In this release, there are limitations to the options that are processed in this options directive, and their effects on the optimization. There is no warning or error given for options that are not processed. These directives are processed only in the optimizing backend. Thus, only options that affect optimizations are processed. In addition, it will not affect the phase invocation of the backend components. For example, specifying -O0 will not suppress the invocation of the global optimizer, though the invoked backend phases will honor the specified optimization level. Apart from the optimization level flags, only flags belonging to the following option groups are processed: -LNO, -OPT and -WOPT.
4.2.2.3

Code Layout Optimization Using Pragmas
This pragma is applicable to C/C++. The user can provide a hint to the compiler regarding which branch of an IF-statement is more likely to be executed at runtime. This hint allows the compiler to optimize code generated for the different branches.

4-6


4 ­ The PathScale C/C++ Compiler Debugging and Troubleshooting C/C++

The directive is of the form:
#pragma frequency_hint

where is a choice from: never: The branch is rarely or never executed. init: The branch is executed only during initialization. frequent: The branch is executed frequently.
The branch of the IF-statement that contains the pragma will be affected.
4.2.3

Mixing Code
If you have a large application that mixes Fortran code with code written in other languages, and the main entry point to your application is from C or C++, you can optionally use pathcc or pathCC to link the application, instead of pathf95. If you do, you must manually add the Fortran runtime libraries to the link line. See section 3.6 for details. To link object files that were generated with pathCC using pathcc or pathf95, include the option -lstdc++.
4.2.4

Linking
Note that the pathcc (C language) user needs to add -lm to the link line when calling libm functions. The second pass of feedback compilation may require an explicit -lm.
4.3

Debugging and Troubleshooting C/C++
The flag -g tells the PathScale C and C++ compilers to produce data in the form used by modern debuggers, such as pathdb or GDB. This format is known as DWARF 2.0 and is incorporated directly into the object files. Code that has been compiled using -g will be capable of being debugged using pathdb, GDB, or other debuggers. The -g option automatically sets the optimization level to -O0 unless an explicit optimization level is provided on the command line. Debugging of higher levels of optimization is possible, but the code transformation performed by the optimizations may make it more difficult. See section 10 for more information on troubleshooting and debugging. See the PathScale Debugger User Guide for more information on pathdb.

4-7


4 ­ The PathScale C/C++ Compiler Unsupported GCC Extensions

4.4

Unsupported GCC Extensions
The PathScale C and C++ Compiler Suite supports most of the C and C++ extensions supported by the GCC version 4.2.0 suite. In this release, we do not support the following extensions: For C: Nested functions Complex integer data type: Complex integer data types are not supported. Although the PathScale Compiler Suite fully supports floating point complex numbers, it does not support complex integer data types, such as _Complex int. SSE3 intrinsics Many of the __builtin functions A goto outside of the block. PathScale compilers do support taking the address of a label in the current function and doing indirect jumps to it. The compiler generates incorrect code for structs generated on the fly (a GCC extension). Java-style exceptions java_interface attribute init_priority attribute

4-8


4 ­ The PathScale C/C++ Compiler Unsupported GCC Extensions

Notes

4-9


4 ­ The PathScale C/C++ Compiler Unsupported GCC Extensions

4-10


Porting and Compatibility
5.1

Section 5

Getting Started
Here are some tips to get you started compiling selected applications with the PathScale Compiler Suite.
5.2

GNU Compatibility
The PathScale Compiler Suite C, C++, and Fortran compilers are compatible with gcc and g77. Some packages will check strings like the gcc version or the name of the compiler to make sure you are using gcc; you may have to work around these tests. See section 5.7.1 for more information. Some packages continue to use deprecated features of gcc. While gcc may print a warning and continue compilation, the PathScale Compiler Suite C, C++, and Fortran compilers may print an error and exit. Use the instructions in the error to substitute an updated flag. For example, some packages will specify the deprecated "-Xlinker" gcc flag to pass arguments to the linker, while the PathScale Compiler Suite uses the modern -Wl flag. Some gcc flags may not yet be implemented. These will be documented in the release notes. If a configure script is being used, PathScale provides wrapper scripts for gcc that are frequently helpful. See section 5.7.1 for more information.
5.3

Compatibility with Other Fortran Compilers
For Fortran, the term "compatibility" can mean two different things: Do two compilers accept the same source code? Can object files generated by two different compilers be linked together? With respect to source code, Pathscale Fortran is compatible with all other compilers provided the program conforms strictly to the Fortran 95 standard. It is compatible with g77 (with relatively few exceptions, such as the meaning of kind= type parameters) even if the program uses extensions (such as additional intrinsic functions which g77 implements.) With respect to linking, the PathScale Fortran compiler is not generally compatible with other Fortran compilers (such as gfortran, g95, or commercial compilers) when

5-1


5 ­ Porting and Compatibility Compatibility with Other Fortran Compilers

source code makes use of language features beyond Fortran 77, although careful programming may make linking possible. Pathscale Fortran is compatible with g77 with respect to linking, provided you use the command line option -ff2c-abi. There are five major issues affecting linking compatibility: 1. ABI (application binary interface) and data representation: the size and encoding of each data type, and how each data type is passed as an argument in a procedure call. For example, one compiler might use an integer 1 to represent .true. while another might use -1; one compiler might interpret integer(kind=2) as a two-byte integer and another interpret that as a two-word integer. 2. Each compiler may use a different runtime library to perform tasks such as I/O, string manipulation, and certain other operations which are too bulky to perform in line. For example, in contrast with the C language, where the standard dictates that the runtime library will provide functions named strcpy, strcmp, and fputs to copy, compare, and write strings, the Fortran standard merely describes the behavior of assignment using "=", operators like ".ge.", and statements like "write" and "format". The Fortran standard leaves it to the implementation to choose names for any runtime library functions used to implement that behavior. 3. Each compiler may use a different data structure (often called a "dope vector") to implement an assumed-shape array argument, allocatable array, or Fortran pointer. In contrast with the C language, the data structure is more elaborate than a simple hardware pointer, because it must be capable of describing the shape, element type, and stride of an array or a section of an array.) 4. Each compiler uses a different strategy to "mangle" or "decorate" module level identifiers to generate symbols which will not collide in the "flat" namespace of the linker. For example, two modules M1 and M2 may each define a public procedure named x, and the program may define a third Fortran-77 style external procedure which is also named x: all three must have different names from the point of view of the linker. One compiler might use ___M1__x to represent procedure x belonging to module M1, where another might use X.in.M1. 5. Each compiler pursues a different strategy to implement the use statement. Even if two compilers both expect to employ a .mod file to communicate module information from one compilation to another, the compilers generally assume different formatting of data inside the .mod file. For the special case of the g77 compiler, Pathscale addresses issue (1) by using the same data representation for default data types, and by providing the -ff2c-abi option to address a situation where g77 deviates from the Linux standard ABI for the x8664 machine. We address issue (2) by including the g77 runtime library in the PathScale library. Issues (3), (4), and (5) do not arise because

5-2


5 ­ Porting and Compatibility Porting Fortran

g77 does not support any of the Fortran 90/95 features which require a dope vector, the decoration of identifiers, or the generation of a .mod file. For compilers other than g77, it may nevertheless be possible to link their object files with those generated by Pathscale Fortran, even if the program uses features from Fortran 90 and later standards, provided one manages to circumvent incompatibilities when coding. Some tips: 1. When code generated by one compiler calls a procedure generated by another, use the Fortran 77 style of procedure call, avoiding any of the sorts of dummy arguments which would require the calls to be "explicit" in Fortran 90 and later standards. Do not use a module generated by one compiler in a procedure generated by another. 2. Use options like -fno-second-underscore and -fdecorate as needed. The gfortran, g95, ifort, pgf90, and Sun f90 compilers all behave like our -fno-second-underscore; g77 behaves like our -fsecond-underscore. These options are meant to address the name-mangling problems for Fortran 77 style external identifiers, not for Fortran 90 style module-level identifiers. 3. When linking with one compiler, specify explicitly the additional runtime library or libraries needed by the other compiler. If you need additional control over the order in which the linker scans libraries, run the linker directly, specifying the startup object file which the first compiler would use, and the union of the sets of libraries which the two compilers would use. For Pathscale Fortran, running pathf95 with the command-line option -show will print the names of these objects and libraries. 4. If possible, perform all I/O in code generated by one compiler. If that is not possible, make sure that all I/O related to a particular logical unit and file occurs within code generated by one compiler.
5.4

Porting Fortran
If you are porting Fortran code, see section 3.10 for more information about Fortran-specific issues.
5.4.1

Intrinsics
The PathScale Fortran compiler supports many intrinsics and also has many unique intrinsics of its own. See Appendix C for the complete list of supported intrinsics.

5-3


5 ­ Porting and Compatibility Porting to x86_64

5.4.1.1

An Example
Here is some sample output from compiling Amber 8 using only ANSI intrinsics. You get this series of error messages: $ pathf95 -O3 -msse2 -m32 -o fantasian fantasian.o . ./.. /lib/random.o . ./.. /lib/mexit.o fantasian.o: In function `simplexrun_': fantasian.o(. text+0xaad4): undefined reference to `rand_' fantasian.o (.text+0xab0e): undefined reference to `rand_' fantasian.o (.text+0xab48): undefined reference to `rand_' fantasian.o(. text+0xab82): undefined reference to `rand_' fantasian.o (.text+0xabbf): undefined reference to `rand ' fantasian.o (.text+0xee0a): more undefined references to `rand_' follow collect2: ld returned 1 exit status

The problem is that RAND is not ANSI. The solution is to build the code with the flag -intrinsic=PGI.
5.4.2

Name-mangling
Name mangling ensures that function, subroutine, and common-block names from a Fortran program or library do not clash with names in libraries from other programming languages. This makes mixing code from C, C++, and Fortran easier. See section 3.9.1 for details on name mangling.
5.4.3

Static Data
Some codes expect data to be initialized to zero and allocated in the heap. If this is the case with your code use the -static flag when compiling.
5.5

Porting to x86_64
Keep these things in mind when porting existing code to x86_64: Some source packages make assumptions about the locations of libraries and fail to look in lib64-named directories for libraries resulting in unresolved symbols at during the link. For the x86 platform, use the -mcpu flag x86any to specify the x86 platform, like this: -mcpu=x86_64.

5-4


5 ­ Porting and Compatibility Compatibility

5.6

Migrating from Other Compilers
Here is a suggested step-by-step approach to migrating code from other compilers to the PathScale compilers: 1. Check the compiler name in your makefile; is the correct compiler being called? For example, you may need to add a line like this:
$ CC=pathcc ./configure

Change the compiler in your makefile to pathcc or pathf95. 2. Check any flags that are called to be sure that the PathScale Compiler Suite supports them. See the eko man page in Appendix E for a complete listing of supported flags. 3. If you plan on using IPA, see section 7.3 for suggestions. 4. Compile your code and look at the results. a. Did the program compile and link correctly? Are there missing libraries that were previously linked automatically? b. Look for behavior differences; does the program behave correctly? Are you getting the right answer (for example, with numerical analysis)?
5.7

Compatibility
5.7.1

gcc Compatibility Wrapper Script
Many software build packages check for the existence of gcc, and may even require the compiler used to be called gcc in order to build correctly. To provide complete compatibility with gcc, we provide a set of gcc compatibility wrapper scripts in /opt/pathscale/compat-gcc/bin (or /compat-gcc/bin). This script can be invoked with different names: gcc, cc - to look like the GNU C compiler, and call pathcc g++, c++ - to look like the GNU C++ compiler, and call pathCC g77, f77 - to look like the GNU Fortran compiler, and call pathf95 To use this script, you must put the path to this directory before the location of your system's gcc (which is usua confirm the order in the search path by running "which search path. The output should print the location of the /usr/bin/gcc. in your shell's search path lly /usr/bin). You can gcc" after modifying your gcc wrapper, not

5-5


5 ­ Porting and Compatibility Compatibility

Notes

5-6


Tuning Quick Reference
This section provides some ideas for tuning your code's performance with the PathScale compiler. The following sections describe a small set of tuning options that are relatively easy to try, and often give good results. These are tuning options that do not require Makefile changes, or risk the correctness of your code results. More detail on these flags can be found in the next section and in the man pages. A comprehensive list of the options for the PathScale compiler can be found in the eko man page.
6.1

Section 6

Basic Optimization
Here are some things to try first when optimizing your code. The basic optimization flag-O is equivalent to -O2. This is the first flag to think about using when tuning your code. Try:
O2

then,
O3

and then,
O3 -OPT:Ofast.

For more information on the -O flags and -OPT:Ofast, see section 7.1.
6.2

IPA
Inter-Procedural Analysis (IPA), invoked most simply with -ipa, is a compilation technique that analyzes an entire program. This allows the compiler to do optimizations without regard to which source file the code appears in. IPA can improve performance significantly. IPA can be used in combination with the other optimization flags. -O3 -ipa or -O2 -ipa will typically provide increased performance over the -O3 or -O2 flags alone. -ipa needs to be used both in the compile and in the link steps of a build. See section 7.3 for more details on how to use -ipa.

6-1


6 ­ Tuning Quick Reference Feedback Directed Optimization (FDO)

6.3

Feedback Directed Optimization (FDO)
Feedback directed optimization uses a special instrumented executable to collect profile information about the program that is then used in later compilations to tune the executable. See section 7.6 for more information.
6.4

Aggressive Optimization
The PathScale compilers provide an extensive set of additional options to cover special case optimizations. The ones documented in section 7 contain options that may significantly improve the speed or performance of your code. This section briefly introduces some of the first tuning flags to try beyond -O2 or -O3. Some of these options require knowledge of what the algorithms are and what coding style of the program require, otherwise they may impact the program's correctness. Some of these options depend on certain coding practices to be effective. One word of caution: The PathScale Compiler Suite, like all modern compilers, has a range of optimizations. Some produce identical program output to the non-optimized, some can change the program's behavior slightly. The first class of optimizations is termed "safe" and the second "unsafe". See for section 7.7 for more information on these optimizations. -OPT:Olimit=0 is a generally safe option but may result in the compilation taking a long time or consuming large quantities of memory. This option tells the compiler to optimize the files being compiled at the specified levels no matter how large they are. The option -fno-math-errno bypasses the setting of ERRNO in math functions. This can result in a performance improvement if the program does not rely on IEEE exception handling to detect runtime floating point errors. -OPT:roundoff=2 also allows for fairly extensive code transformations that may result in floating point round-off or overflow differences in computations. Refer to section 7.7.4.2 and section 7.7.4 for more information. The option -OPT:div_split=ON allows the conversion of x/y into x*(recip (y)), which may result in less accurate floating point computations. Refer to section 7.7.4.2 and section 7.7.4 for more information. The -OPT:alias settings allow the compiler to apply more aggressive optimizations to the program. The option -OPT:alias=typed assumes that the program has been coded in adherence with the ANSI/ISO C standard, which states that two pointers of different types cannot point to the same location in memory. Setting -OPT:alias=restrict allows the compiler to assume that points refer

6-2


6 ­ Tuning Quick Reference Compiler Flag Recommendations

to distinct, non-overlapping objects. If the these options are specified and the program does violate the assumptions being made, the program may behave incorrectly. Refer to section 7.7.1 for more information. There are several shorthand options that can be used in place of the above options. The option -OPT:Ofast is equivalent to -OPT:roundoff=2:Olimit=0:div_split=ON:alias=typed. -Ofast is equivalent to -O3 -ipa -OPT:Ofast -fno-math-errno. When using this shorthand options, make sure the impact of the option is understood by stepwise building up the functionality by using the equivalent options. There are many more options that may help the performance of the program. These options are discussed elsewhere in the User Guide and in the associated man pages.
6.5

Compiler Flag Recommendations
As a general methodology, we usually recommend that you start tuning with -O2, then -O3, then -O3 -OPT:Ofast and then -Ofast. With -O3 -OPT:Ofast and -Ofast, you should look to see if the results are accurate. The -OPT:Ofast flag uses optimizations selected to maximize performance. Although the optimizations are generally safe, they may affect floating point accuracy due to rearrangement of computations. This effectively turns on the following optimizations:
-OPT:ro=2:Olimit=0:div_split=ON:alias=typed

If there are numerical problems with -O3 -OPT:Ofast, then try either of the following:
-O3 -OPT:Ofast:ro=1 -O3 -OPT:Ofast:div_split=OFF

Note that 'ro' is short for roundoff. -Ofast is equivalent to -O3 -ipa -OPT:Ofast -fno-math-errno -ffast-math so similar cautions apply to it as to -O3 -OPT:Ofast. To use interprocedural analysis without the "Ofast-type" optimizations, use either of the following:
-O3 -ipa -O2 -ipa

Testing different optimizations can be automated by pathopt2. This program compiles and runs your program with a variety of compiler options and creates a sorted list of the execution times for each run.

6-3


6 ­ Tuning Quick Reference Performance Analysis

The try5 target tests five flag combinations, which is easily done using pathopt2. The combinations are:
-O2, -O3, -O3 -ipa, -O3 -OPT:Ofast -Ofast.

For more information on using pathopt2, see section 7.9.
6.6

Performance Analysis
In addition to these suggestions for optimizing your code, here are some other ideas to assist you in tuning. Section 2.11 discusses figuring out where to tune your code, using time to get an overview of your code, and using pathprof to find your program's hot spots.
6.7

Optimize Your Hardware
Make sure you are optimizing your hardware as well. section 7.8 discusses getting the best performance out of x86_64-based hardware (Opteron, AthlonTM64, AthlonTM64 FX, and Intel®EM64T). Hardware configuration can have a significant effect on the performance of your application.

6-4


Tuning Options
This section discusses in more depth some of the major groups of flags available in the PathScale Compiler Suite.
7.1

Section 7

Basic Optimizations: The -O flag
The -O flag is the first flag to think about using. See table 7-3 showing the default flag settings for various levels of optimization. -O0 (O followed by a zero) specifies no optimization­this is useful for debugging. The -g debugging flag is fully compatible with this level of optimization. NOTE: Using -g by itself without specifying -O will change the default optimization level from -O2 to -O0 unless explicitly specified.

-O1 specifies minimal optimizations with no noticeable impact on compilation time compared with -O0. Such optimizations are limited to those applied within straight-line code (basic blocks), like peephole optimizations and instruction scheduling. The -O1 level of optimization minimizes compile time. -O2 only turns on optimizations which always increase performance and the increased compile time (compared to -O1) is commensurate with the increased performance. This is the default, if you don't use any of the -O flags. The optimizations performed at level 2 are: For inner loops, perform: Loop unrolling Simple if-conversion Recurrence-related optimizations Two passes of instruction scheduling Global register allocation based on first scheduling pass Global optimizations within function scopes: Partial redundancy elimination Strength reduction and loop termination test replacement Dead store elimination Control flow optimizations Instruction scheduling across basic blocks

7-1


7 ­ Tuning Options Syntax for Complex Optimizations (-CG, -IPA, -LNO -OPT, -WOPT)

-O2 implies the flag -OPT:goto=on, which enables the conversion of GOTOs into higher level structures like FOR loops. -O2 also sets -OPT:Olimit=6000 -O3 turns on additional optimizations which will most likely speed your program up, but may, in rare cases, slow your program down. The optimizations provided at this level includes all -O1 and -O2 optimizations and also includes but is not limited to the flags noted below: -LNO:opt=1 Turn on Loop Nest Optimization (for more details, see section 7.4) -OPT with the following options in the OPT group: (see the -opt man pages for more information) OPT:roundoff=1 (see section 7.7.4.2) OPT:IEEE_arith=2 (see section 7.7.4) OPT:Olimit=9000 (see section 6.3) OPT:reorg_common=1 (see the eko(7) man page) NOTE: In our in-house testing, we have noticed that several codes which are slower at -O3 than -O2 are fixed by using -O3 -LNO:prefetch=0. This seems to mainly help codes that fit in cache.

7.2

Syntax for Complex Optimizations (-CG, -IPA, -LNO -OPT, -WOPT)
The group optimizations control a variety of behaviors and can override defaults. This section covers the syntax of these options. The group options allow for the setting of multiple sub-options in two ways: Separating each sub-flag by colons, or Using multiple flags on the command line. For example, the following command lines are equivalent:
pathcc -OPT:roundoff=2:alias=restrict wh.c pathcc -OPT:roundoff=2 -OPT:alias=restrict wh.c

Some sub-options either enable or disable the feature. To enable a feature, either specify only the subflag name or with =1, =ON, or =TRUE. Disabling a feature, is accomplished by adding =0, =OFF, or =FALSE. The following command lines mean the same thing:
pathf95 -OPT:div_split:fast_complex=FALSE:IEEE_NaN_inf=OFF wh.F pathf95 -OPT:div_split=1:fast_complex=0:IEEE_NaN_inf=false wh.F

7-2


7 ­ Tuning Options Inter-Procedural Analysis (IPA)

7.3

Inter-Procedural Analysis (IPA)
Software applications are normally written and organized into multiple source files that make up the program. The compilation process, usually defined by a Makefile, invokes the compiler to compile each source file, called compilation unit, separately. This traditional build process is called separate compilation. After all compilation units have been compiled into .o files, the linker is invoked to produce the final executable. The problem with separate compilation is that it does not provide the compiler with complete program information. The compiler has to make worst-case assumptions at places in the program that access external data or call external functions. In whole program optimization, the compiler can collect information over the entire program so it can make better decision on whether it is safe to perform various optimizations. Thus, the same optimization performed under whole program compilation will become much more effective. In addition, more types of optimization can be performed under whole program compilation than separate compilation. This section presents the compilation model that enables whole program optimization in the PathScale compiler and how it relates to the -ipa flag that invokes it at the user level. Various analyses and optimizations performed by IPA are described. How IPA improves the quality of the backend optimization is also explained. Various IPA-related flags that can be used to tune for program performance are presented and described. Finally, we have an example of the difference that IPA makes in the performance of the SPEC CPU2000 benchmark suite.
7.3.1

The IPA Compilation Model
Inter-procedural compilation is the mechanism that enables whole program compilation in the PathScale compiler. The mechanism requires a different compilation model than separate compilation. This new mode of compilation is used when the -ipa flag is specified. Whole program compilation requires the entire program to be presented to the compiler for analysis and optimization. This is possible only after a link step is applied. Ordinarily, the link step is applied to .o files, after all optimization and code generation have been performed. In the IPA compilation model, the link step is applied very early in the compilation process, before most optimization and code generation. In this scenario, the program code being linked are not in the object code format. Instead, they are in the form of the intermediate representation (IR) used during compilation and optimization. After the program has been linked at the IR level, inter-procedural analysis and optimization are applied to the whole program. Subsequently, compilation continues with the backend phases to generate the final object code.

7-3


7 ­ Tuning Options Inter-Procedural Analysis (IPA)

The IPA compilation model (see Figure 7.1) has been implemented with ease-of-use as one of its main objectives. At the user level, it is sufficient to just add the -ipa flag to both the compile line and the link line. Thus, users can avoid having to re-structure their Makefiles to use IPA. In order to do this, we have to introduce a new kind of .o files that we call IPA .o's. These are .o files in which the program code is in the form of IR, and are different from ordinary .o files that contain object code. IPA .o files are produced when a file is compiled with the flags -ipa -c. IPA .o files can only be linked by the IPA linker. The IPA linker is invoked by adding the -ipa flag to the link command. This appears as if it is the final link step. In reality, this link step performs the following tasks: 1. 2. 3. 4. Invokes the IPA linker Performs inter-procedural analysis and optimization on the linked program Invokes the backend phases to optimize and generate the object code Invokes the real linker to produce the final executable.

Under IPA compilation, the user will notice that the compilation of separate files proceeds very fast, because it does not involve the backend phases. On the other hand, the linking phase will appear much slower because it now encompasses the compilation and optimization of the entire program.
7.3.2

Inter-procedural Analysis and Optimization
We call the phase that operates on the IR of the linked program IPA, for Inter-Procedural Analysis, but its tasks can be divided into two categories: Analysis to collect information over the entire program Optimization to transform the program so it can run faster
7.3.2.1

Analysis
IPA first constructs the program call graph. Each node in the call graph corresponds to a function in the program. The call graph represents the caller-callee relationship in the program. Once the call graph is built, based on different inlining heuristics, IPA prepares a list of function calls where it wants to inline the callee into the caller. Based on the call graph, IPA computes the mod-ref information for the program variables. This represents the information as to whether a variable is modified or referenced inside a function call. IPA also computes alias information for all the program variables. Whenever a variable has its address taken, it can potentially be pointed to by a pointer. Places that dereference or store through the pointer potentially access the variable. IPA's alias analysis keeps track of this information so that in the presence of pointer

7-4


7 ­ Tuning Options Inter-Procedural Analysis (IPA)

accesses, as few variables are affected as possible so they can be optimized more aggressively. The mod-ref and alias information collected by IPA are not just used by IPA itself. The information is also recorded in the program representation so the optimizations in the backend phases also benefit.
7.3.3

Optimization
The most important optimization performed by IPA is inlining, in which the call to a function is replaced by the actual body of the function. Inlining is most versatile in IPA because all the user function definitions are visible to it. Apart from eliminating the function call overhead, inlining increases optimization opportunities of the backend phases by letting them work on larger pieces of code. For instance, inlining may result in the formation of a loop nest that enables aggressive loop transformations. Inlining requires careful benefit analysis because overdoing it may result in performance degradation. The increased program size can cause higher instruction cache miss rate. If a function is already quite large, inlining may result in the compiler running out of registers, so it has to use memory more often, which causes program slow-down. In addition, too much inlining can slow down the later phases of the compilation process. Many function calls pass constants (including addresses of variables) as parameters. Replacing a formal parameter by its known constant value helps in the optimization of the function body. Very often, part of the code of the function can be determined useless and deleted. Function cloning creates different clones of a function with its parameters customized to the forms of the calls. It provides a subset of the benefits of inlining without increasing the size of the function that contains the call. Like inlining, it also increases the total size of the program. If IPA can determine that all the calls pass the same constant parameter, it will perform constant propagation for the parameter. This has the same benefit as

7-5


7 ­ Tuning Options Inter-Procedural Analysis (IPA)

Source

Source

pathcc -ipa -c

Language Front-end

Language Front-end Other .o's, .a's, .so's

IPA .o

IPA .o

IPA

pathcc -ipa *.o

Backend

Backend

.o

.o

ld

a.out

Figure 7-1. IPA Compilation Model function cloning but does not increase the size of the program. Constant propagation also applies to global variables. If a global variable is found to be constant throughout the entire program execution, IPA will replace the variable by the constant value. Dead variable elimination finds global variables that are never used over the program and deletes them. These variables are often exposed due to IPA's constant propagation. Dead function elimination finds functions that are never called and deletes them. They can be the by-product of inlining and cloning. Common padding applies to common blocks in Fortran programs. Ordinarily, compilers are incapable of changing the layout of the user variables in a common block, because this has to be co-ordinated among all the subroutines that use the same common block, and the subroutines may belong to different compilation units. But under IPA, all the subroutines are available. The padding improves the

7-6


7 ­ Tuning Options Inter-Procedural Analysis (IPA)

alignments of the arrays so they can be accessed more efficiently and even vectorized. The padding can also reduce data cache conflicts during execution. Common block splitting also applies to common blocks in Fortran programs. This splits a common block into a number of smaller blocks which also reduces data cache conflicts during execution. Procedure re-ordering lays out the functions of the program in an order based on their call relationship. This can reduce thrashing in the instruction cache during execution.
7.3.4

Controlling IPA
Although the compiler tries to make the best decisions regarding how to optimize a program, it is hard to make the optimal choice in general. Thus, the compiler provides many compilation options so the user can use them to tune for the peak performance of his program. This section presents the IPA-related compilation options that are useful in tuning programs. But first, it is worthwhile to mention that IPA is one of the compilation phases that can benefit substantially from feedback compilation. In feedback compilation, a feedback data file containing a profile of a typical run of the program is presented to the compiler. This enables IPA to make better decisions regarding what functions to inline and clone. By ensuring that busy callers and callees are placed next to each other, IPA's procedure re-ordering can also be more effective. Feedback compilation is enabled by the -fb-create and -fb-opt options. See section 7.6 for more details.
7.3.4.1

Inlining
There are actually two incarnations of the inliner in the PathScale compiler, depending on whether -ipa is specified. This is because inlining is nowadays a language feature, and has to be performed independent of IPA. The inliner invoked when -ipa is not specified is the lightweight inliner, and it can only operate on a single compilation unit. The lightweight inliner does not do automatic inlining. It inlines strictly according to the C++ language requirement, C inline keyword or any -INLINE options specified by the user. It may be invoked by default. The basic options to control inlining in the lightweight inliner are: -inline or -INLINE causes the lightweight inliner to be invoked when -ipa is not specified. -INLINE:=off suppresses the invocation of the lightweight inliner. The options below are applicable to both the lightweight inliner and IPA's inliner: -INLINE:all performs all possible inlining. Since this results in code bloat, this should only be used if the program is small.

7-7


7 ­ Tuning Options Inter-Procedural Analysis (IPA)

-INLINE:list=ON makes the inliner list its actions on the fly. This is an useful option for the user to find out which functions are getting inlined, which functions are not being inlined and why. Thus, if the user wants to inline or not inline a function, tweaking the inlining controls based on the reasons specified by the output of this flag should help. -INLINE:must=name1[ ,name2,...] forces inlining for the named functions. -INLINE:never=name1[ ,name2, . . .] suppresses inlining for the named functions. When -ipa is specified, IPA will invoke its own inliner and the lightweight inliner is not invoked. IPA's inliner automatically determines additional functions to inline in addition to those that are required. Small callees or callers are favored over larger ones. If profile data is available, calls executed more frequently are preferred. Otherwise, calls inside loops are preferred. Leaf routines (functions containing no call) are also favored. Inlining continues until no more call satisfies the inlining criteria, which can be controlled by the inlining options: -IPA:inline=OFF turns off IPA's inliner, and the lightweight inliner is also suppressed since IPA is invoked. Default is ON. -INLINE:none turns off automatic inlining by IPA but required inlining implied by the language or specified by the user are still performed. By default, automatic inlining is turned ON. IPA:specfile=filename directs the compiler to open the given file to read more -IPA: or -INLINE: options. The following options can be used to tune the aggressiveness of the inliner. Very aggressive inlining can cause performance degradation as discussed in section 7.3.3. -OPT:Olimit=N specifies the size limit N, where N is computed from the number of basic blocks that make up a function; inlining will never cause a function to exceed this size limit. The default is 6000 under -O2 and 9000 under -O3. The value 0 means no limit is imposed. -IPA:space=N specifies that inlining should continue until a factor of N% increase in code size is reached. The default is 100%. If the program size is small, the value of N could be increased. -IPA:plimit =N suppresses inlining into a function once its size reaches N, where N is measured in terms of the number of basic blocks and the number of calls inside a function. The default is 2500. -IPA:small_pu=N specifies that a function with size smaller than N basic blocks is not subject to the -IPA:plimit restriction. The default is 30. -IPA:callee_limit=n specifies that a function whose size exceeds this limit will never be automatically inlined by IPA. The default is 500.

7-8


7 ­ Tuning Options Inter-Procedural Analysis (IPA)

-IPA:min hotness =N is applicable only under feedback compilation. A call site's invocation count must be at least N before it can be inlined by IPA. The default is 10. -INLINE:aggressive=ON increases the aggressiveness of the inlining, in which more non-leaf and out-of-loop calls are inlined. Default is OFF. We mentioned that leaf functions are good candidates to be inlined. These functions do not contain calls that may inhibit various backend optimizations. To amplify the effect of leaf functions, IPA provides two options that exploit its call-tree-based inlining feature. This is based on the fact that a function that calls only leaf functions can become a leaf function if all of its calls are inlined. This in turn can be applied repeatedly up the call graph. In the description of the following two options, a function is said to be at depth N if it is never more than N edges from a leaf node in the call graph. A leaf function has depth 0. -IPA:maxdepth=N causes IPA to inline all routines at depth N in the call graph subject to space limitation. -IPA:forcedepth=N causes IPA to inline all routines at depth N in the call graph regardless of space limitation.
7.3.5

Cloning
There are two options for controlling cloning: -IPA:multi_clone=N specifies the maximum number of clones that can be created from a single function. The default is 0, which implies that cloning is turned OFF by default. -IPA:node_bloat=N specifies the maximum percentage growth in the number of procedures relative to the original program that cloning can produce. The default is 100.
7.3.6

Other IPA Tuning Options
The following are options un-related to inlining and cloning, but useful in tuning:

-IPA:common_pad_size=N specifies that common block padding should use pad size of up to N bytes. The default value is 0, which specifies that the compiler will determine the best padding size. -IPA:linear=ON enables linearization of array references. When inlining Fortran subroutines, IPA tries to map formal array parameters to the shape of the actual parameters. The default is OFF, which means IPA will suppress the inlining if it cannot do the mapping. Turning this option ON instructs IPA to still perform the inlining but linearizes the array references. Such linearization may cause performance problems, but the inlining may produce more performance gain.

7-9


7 ­ Tuning Options Inter-Procedural Analysis (IPA)

-IPA:pu_reorder=N controls IPA's procedure reordering optimization. A value of 0 disables the optimization. N = 1 enables reordering based on the frequency in which different procedures are invoked. N = 2 enables procedure reordering based on caller-callee relationship. The default is 0. -IPA:field_reorder=ON enables IPA's field reordering optimization to minimize data cache misses. This optimization is based on reference patterns of fields in large structs, learned during feedback compilation. The default is OFF. -IPA:ctype=ON optimizes interfaces to constructs defined in the standard header file ctype.h by assuming that the program will not run in a multi-threaded environment. The default is OFF.
7.3.6.1

Disabling Options
The following options are for disabling various optimizations in IPA. They are useful for studying the effects of the optimizations. -IPA:alias=OFF disables IPA's alias and mod-ref analyses -IPA:addressing=OFF disables IPA's address-taken analysis, which is a component of the alias analysis -IPA:cgi=OFF disables the constant propagation for global variables (constant global identification) -IPA:cprop=OFF disables the constant propagation for parameters -IPA:dfe=OFF disables dead function elimination -IPA:dve=OFF disables dead variable elimination -IPA:split=OFF disables common block splitting
7.3.7

Case Study on SPEC CPU2000
This section presents experimental data to show the importance of IPA in improving program performance. Our experiment is based on the SPEC CPU2000 benchmark suite compiled using release 1.2 of the PathScale compiler. The compiled benchmarks are run on a 1.4 GHz Opteron system. Two sets of data are shown here. The first set studies the effects of using the single option -ipa. The second set shows the effects of additional IPA-related tuning flags on the same files. Table 7-1. Effects of IPA on SPEC CPU 2000 Performance
Benchmark 164.gzip 175.vpr 176.gcc Time w/o -ipa 170.7 s 202.4 s 113.6 s Time with -ip 164.7 s 192.3 s 113.2 s Improvement% 3.5% 5% 0.4%

7-10


7 ­ Tuning Options Inter-Procedural Analysis (IPA)

Table 7-1. Effects of IPA on SPEC CPU 2000 Performance (Continued)
Benchmark 181.mcf 186.crafty 197.parser 252.eon 253.perlbmk 254.gap 255.vortex 256.bzip2 300.twolf 168.wupwise 171.swim 172.mgrid 173.applu 177.mesa 178.galgel 179.art 183.equake 187.facerec 188.ammp 189.lucas 191.fma3d 200.sixtrack 301.apsi Time w/o -ipa 391.9 s 83.5 s 301.4 s 152.8 s 196.2 s 153.5 s 175.2 s 210.2 s 376.5 s 220.0 s 181.4 s 184.7 s 282.5 s 155.4 s 150.4 s 245.7 s 143.7 s 154.3 s 266.5 s 165.9 s 239.6 s 265.0 s 280.7 s Time with -ip 390.8 s 83.4 s 289.3 s 126.8 s 192.3 s 128.6 s 132.1 s 181.0 s 362.2 s 161.5 s 180.7 s 182.3 s 245.2 s 131.5 s 149.9 s 221.1 s 143.2 s 147.4 s 261.7 s 167.9 s 244.6 s 276.9 s 273.7 s Improvement% 0.3% 0.1% 4% 17% 2% 16.2% 24.6% 13.9% 3.8% 26.6% 0.4% 1.3% 13.2% 15.4% 0.3% 10% 0.3% 4.5% 1.8% -1.2% -2.1% -4.5% 2.5%

Table 7-1 shows how -ipa effects the base runs of the CPU2000 benchmarks. IPA improves the running times of 17 out of the 26 benchmarks; the improvements range from 1.3% to 26.6%. There are six benchmarks that improve by less than 0.5%, which is within the noise threshold. There are three FP benchmarks that slow down from 1.2% to 4.5% due to -ipa. The slowdown indicates that the benchmarks do not benefit from the default settings of the IPA parameters. By using additional IPA

7-11


7 ­ Tuning Options Inter-Procedural Analysis (IPA)

tuning flags, such slowdown can often be converted to performance gain. The average performance improvement over all the benchmarks listed in table 7-1 is 6%. Table 7-2. Effects of IPA tuning on some SPEC CPU2000 benchmarks
Benchmark 181.mcf 197.parser 253.perlbmk Time: Peak Time: Peak flags w/o IPA flags with IPA tuning tuning 325.3 s 296.5 s 195.1 s 275.5 s 245.2 s 177.7 s 129.7 s Improvement% 15.3% 17.3% 8.9% 12.2%

IPA Tuning Flags -IPA:_eld_reorder=on -IPA:ctype=on -IPA:min_hotness=5: plimit=20000 -IPA:space=1000:linear=on -IPA:plimit=50000: callee_limit=5000 -INLINE:aggressive=on -IPA:plimit=1800

168.wupwise 147.7 s

187.facerec

144.6 s

141.6 s

2.1%

Table 7-2 shows the effects of using additional IPA tuning flags on the peak runs of the CPU2000 performance. In the peak runs, each benchmark can be built with its own combination of any number of tuning flags. We started with the peak flags of the benchmarks used in PathScale's SPEC CPU2000 submission, and we found that five of the benchmarks are using IPA tuning flags. Table 7-1 lists these five benchmarks. The second column gives the running times if the IPA-related tuning flags are omitted. The third column gives the running times with the IPA-related tuning flags. The fifth column lists their IPA-related tuning flags. As this second table shows, proper IPA tuning can produce major improvements in applications.
7.3.8

Invoking IPA
Inter-procedural analysis is invoked in several possible ways: -ipa, -IPA, and implicitly via -Ofast. IPA can be used with any optimization level, but gives the biggest potential benefit when combined with -O3. The -Ofast flag turns on -ipa as part of its many optimizations. When compiling with -ipa the .o files that are created are not regular .o files. IPA uses the .o files in its analysis of your program, and then does a second compilation using that information to optimize the executable. The IPA linker checks to see if the entire program is compiled with the same set of optimization options. If different optimization options are used, IPA will give a warning:
Warning: Inconsistent optimization options detected between files involved in

For example, the following invocation will generate this warning for two C files a.c and b.c.

7-12


7 ­ Tuning Options Inter-Procedural Analysis (IPA)

~ $ pathcc -O2 -ipa -c a.c ~ $ pathcc -O3 -ipa -c b.c ~ $ pathcc -ipa a.o b.o

The user can pass consistent optimization options to the individual compilations to remove the warning. In the above example, the user can either pass -O2 or pass -O3 to both the files. The -ipa flag implies -O2 -ipa because -O2 is the default. Flags like -ipa can be used in combination with a very large number of other flags, but some typical combinations with the -O flags are shown below: -O3 -ipa or -O2 -ipa is a typical additional attempt at improved performance over the -O3 or -O2 flag alone. -ipa needs to be used both in the compile and in the link steps of a build. Using IPA with your program is usually straightforward. If you have only a few source files, you can simply use it like this:
pathf95 -O3 -ipa main.f subs1.f subs2.f

If you compile files separately, the *.o files generated by the compiler do not actually contain object code; they contain a representation of the source code. Actual compilation happens at link time. The link command also needs the -ipa flag added. For example, you could separately compile and then link a series of files like this:
pathf95 pathf95 pathf95 pathf95 -c -c -c -O3 -O3 -ipa main.f -O3 -ipa subs1.f -O3 -ipa subs2.f -ipa main.o subs1.o subs2.o

Currently, there is a restriction that each archive (for example libfoo.a) must contain either .o files compiled with -ipa or .o files compiled without -ipa, but not both. Note that, in a non-IPA compile, most of the time is incurred with files to create the object files (the .o's) and the link step is quite compile, the creating of . o files is very fast, but the link step can The total compile time can be considerably longer with IPA than compiling all the fast. In an IPA take a long time. without.

When invoking the final link phase with -ipa (for example, pathcc -ipa -o foo *.o), significant portions of this process can be done in parallel on a system with multiple processing units. To use this feature of the compiler, use the -IPA:max_jobs flag. Here are the options for the -IPA:max_jobs flag: -IPA:max_jobs=N This option limits the maximum parallelism when invoking the compiler after IPA to (at most) N compilations running at once. The option can take the following values:

7-13


7 ­ Tuning Options Loop Nest Optimization (LNO)

0 = The parallelism chosen is equal to either the number of CPUs, the number of cores, or the number of hyperthreading units in the compiling system, whichever is greatest. 1 = Disable parallelization during compilation (default) >1 = Specifically set the degree of parallelism
7.3.9

Size and Correctness Limitations to IPA
IPA often works well on programs up to 100,000 lines, but is not recommended for use in larger programs in this release.
7.4

Loop Nest Optimization (LNO)
If your program has many nests of loops, you may want to try some of the Loop Nest Optimization group of flags. This group defines transformations and options that can be applied to loop nests. One of the nice features of the PathScale compilers is that its powerful Loop Nest Optimization feature is invoked by default at -O3. This feature can provide up to a 10-20x performance advantage over other compilers on certain matrix operations at -O3. In rare circumstances, this feature can -LNO:opt=0 to disable nearly all loop compile faster by adding -LNO:opt=0 only active with -O3 (or -Ofast which make things slower, so you can use nest optimization. Trying to make an -O2 will not work because the -LNO feature is implies -O3).

Some of the features that one can control with the -LNO: group are: Loop fusion and fission Blocking to optimize cache line reuse Cache management TLB (Translation Lookaside Buffer) optimizations Prefetch In this section we will highlight a few of the LNO options that have frequently been valuable.
7.4.1

Loop Fusion and Fission
Sometimes loop nests have too few instructions and consecutive loops should be combined to improve utilization of CPU resources. Another name for this process is loop fusion. Sometimes a loop nest will have too many instructions, or deal with too many data items in its inner loop, leading to too much pressure on the registers, resulting in

7-14


7 ­ Tuning Options Loop Nest Optimization (LNO)

spills of registers to memory. In this case, splitting loops can be beneficial. Like splitting an atom, splitting loops is termed fission. These are the LNO options to control these transformations: -LNO:fusion=n Perform loop fusion, n: 0 off, 1 conservative, 2 aggressive. Level 2 implies that outer loops in consecutive loop nests should be fused, even if it is found that not all levels of the loop nests can be fused. The default level is 1 (standard outer loop fusion), but 2 has been known to benefit a number of well-known codes. -LNO:fission=n Perform loop fission, n: 0 off, 1 standard, 2 try fission before fusion. The default level is 0, but 2 has been known to benefit a number of well-known codes. Be careful with mixing the above two flags, because fusion has some precedence overfission: if -LNO:fission=[1 or 2] and -LNO:fusion=[1 or 2] then fusion is performed. -LNO:fusion_peeling_limit=n controls the limit for the number of iterations allowed to be peeled in fusion, where n has a default of 5 but can be any non-negative integer. Peeling is done when the iteration counts in consecutive loops is different, but close, and several iterations are replicated outside the loop body to make the loop counts the same.
7.4.2

Cache Size Specification
The PathScale compilers are primarily targeted at the Opteron CPU currently, so they assume an L2 cache size of 1MB. Athlon 64 can have either a 512KB or 1MB L2 cache size. If your target machine is Athlon 64 and you have the smaller cache size, then setting -LNO:cs2=512k could help. You can also specify your target machine instead, using -march=athlon 64. That would automatically set the standard machine cache sizes. Here is the more general description of some of what is available.
-LNO:cs1=n,cs2=n,cs3=n,cs4=n

This option specifies the cache size. n can be 0 or a positive integer followed by one of the following letters: k, K, m, or M. These letters specify the cache size in Kbytes or Mbytes. Specifying 0 indicates there is no cache at that level. cs1 is the primary cache cs2 refers to the secondary cache cs3 refers to memory cs4 is the disk

7-15


7 ­ Tuning Options Loop Nest Optimization (LNO)

Default cache size for each type of cache depends on your system. Use -LIST:options=ON to see the default cache sizes used during compilation. With a smaller cache, the cache set associativity is often decreased as well. The flagset: -LNO:assoc1=n,assoc2=n,assoc3=n,assoc4=n can define this appropriately for your system. Once again, the above flags are already set appropriately for Opteron.
7.4.3

Cache Blocking, Loop Unrolling, Interchange Transformations
Cache blocking, also called 'tiling', is the process of choosing the appropriate loop interchanges and loop unrolling sizes at the correct levels of the loop nests so that cache reuse can be optimized and memory accesses reduced. This whole LNO feature is on by default, but can be turned off with: -LNO:blocking=off. -LNO:blocking_size=n specifies a block size that the compiler must use when performing any blocking, where n is a positive integer that represents the number of iterations. -LNO:interchange is on by default, but setting this =0 can disable the loop interchange transformation in the loop nest optimizer. The LNO group controls outer loop unrolling, but the -OPT: group controls inner loop unrolling. Here are the major -LNO: flags to control loop unrolling: -LNO:outer_unroll_max,ou_max=n specifies that the compiler may unroll outer loops in a loop nest by up to n per loop, but no more. The default is 10. -LNO:ou_prod_max=n Indicates that the product of unrolling levels of the outer loops in a given loop nest is not to exceed n, where n is a positive integer. The default is 16. To be more specific about how much unrolling is to be done, use -LNO:outer_unroll,ou=n. This indicates that exactly n outer loop iterations should be unrolled, if unrolling is legal. For loops where outer unrolling would cause problems, unrolling is not performed.
7.4.4

Prefetch
The LNO group can provide guidance to the compiler about the level and type of prefetching to enable. General guidance on how aggressively to prefetch is specified by -LNO:prefetch=n, where n=1 is the default level. n=0 disables prefetching in loop nests, while n=2 means to prefetch more aggressively than the default. -LNO:prefetch_ahead=n defines how many cache lines ahead of the current data being loaded should be prefetched. The default is n=2 cache lines.

7-16


7 ­ Tuning Options Code Generation (-CG:)

7.4.5

Vectorization
Vectorization is an optimization technique that works on multiple pieces of data at once. For example, the compiler will turn a loop computing the mathematical function sin() into a call to the vsin() function, which is twice as fast. The use of vectorized versions of functions in the math library like sin(), cosin() is controlled by the flag -LNO:vintr=0|1|2. 0 will turn off vectorization of math intrinsics, while 1 is the default. Under -LNO:vintr=2 the compiler will vectorize all math functions. Note that vintr=2 could be unsafe in that the vector forms of some of the functions could have accuracy problems. Vectorization of user code (excluding these mathematical functions) is controlled by the flag -LNO:simd[=(0|1|2)], which enables or disables inner loop vectorization. 0 turns off the vectorizer, 1 (the default) causes the compiler to vectorize only if it can determine that there is no undesirable performance impact due to sub-optimal alignment, and 2 will vectorize without any constraints (this is the most aggressive). -LNO:simd_verbose=ON prints vectorizer information (from vectorizing user code) to stdout. -LNO:vintr_verbose=ON prints information about whether or not the math intrinsic functions were vectorized. See the eko man page for more information.
7.5

Code Generation (-CG:)
The code generation group governs some aspects of instruction-level code generation that can have benefits for code tuning. -CG:gcm=OFF turns off the instruction-level global code motion optimization phase. The default is ON. -CG:load_exe=n specifies the threshold for subsuming a memory load operation into the operand of an arithmetic instruction. The value of 0 turns off this subsumption optimization. By default this subsumption is performed only when the result of the load has only one (n=1) use. This subsumption is not performed if the number of times the result of the load is used exceeds the value n, a non-negative integer. We have found that load_exe=2 or 0 are occasionally profitable. The default for 64-bit ABI and Fortran is n=2; otherwise the default is n=1. -CG:use_prefetchnta=ON means for the compiler to use the prefetch operation that assumes that data is Non-Temporal at All (NTA) levels of the cache hierarchy. This is for data streaming situations in which the data will not need to be re-used soon. Default is OFF.

7-17


7 ­ Tuning Options Feedback Directed Optimization (FDO)

7.6

Feedback Directed Optimization (FDO)
Feedback directed optimization uses a special instrumented executable to collect profile information about the program; for example, it records how frequently every if () statement is true. This information is then used in later compilations to tune the executable. FDO is most useful if a program's typical execution is roughly similar to the execution of the instrumented program on its input data set; if different input data has dramatically different if () frequencies, using FDO might actually slow down the program. This section also discusses how to invoke this feature with the -fb-create and -fb-opt flags. NOTE: If the -fb-create and -fb-opt compiles are done with different compilation flags, it may or may not work, depending on whether the different compilation flags cause different code to be seen by the phase that is performing the instrumentation/feedback. We recommend using the same flags for both instrumentation and feedback.

FDO requires compiling the program at least twice. In the first pass:
pathcc -O3 -ipa -fb-create fbdata -o foo foo.c

The executable foo will contain extra instrumentation library calls to collect feedback information; this means foo will actually run a bit slower than normal. We are using fbdata for the file name in this example; you can use any name for your file. Next, run the program foo with an example dataset:
./foo

During this run, a file with the prefix "fbdata" will be created, containing feedback information. The file name you use will become the prefix for your output file. For example, the output file from this example dataset might be named fbdata.instr0.ab342. Each file will have a unique string as part of its name so that files can't be overwritten. To use this data in a subsequent compile:
pathcc -O3 -ipa -fb-opt fbdata -o foo foo.c

This new executable should run faster than a non-FDO foo, and will not contain any instrumentation library calls. Experiment to see if FDO provides significant benefit for your application. More details on feedback compilation with the PathScale compilers can be found under the -fb-create and -fb-opt options in the eko man page.

7-18


7 ­ Tuning Options Aggressive Optimizations

7.7

Aggressive Optimizations
The PathScale Compiler Suite, like all modern compilers, has a range of optimizations. Some produce identical program output to the original, some can change the program's behavior slightly. The first class of optimizations is termed "safe" and the second "unsafe". As a general rule, our -O1,-O2,-O3 flags only perform "safe" optimizations. But the use of "unsafe" optimizations often can produce a good speedup in a program, while producing a sufficiently accurate result. Some "unsafe" optimizations may be "safe" depending on the coding practices used. We recommend first trying "safe" flags with your program, and then moving on to "unsafe" flags, checking for incorrect results and noting the benefit of unsafe optimizations. Examples of unsafe optimizations include the following.
7.7.1

Alias Analysis
Both C and Fortran have occasions where it is possible that two variables might occupy the same memory. For example, in C, two pointers might point to the same location, such that writing through one pointer changes the value of the variable pointed to by another. While the C standard prohibits some kinds of aliasing, many real programs violate these rules, so the aliasing behavior of the compiler is controlled by the -OPT:alias flag. See section 7.7.4.2 for more information. Aliases are hidden definitions and uses of data due to: Accesses through pointers Partial overlap in storage locations (e.g. unions in C) Procedure calls for non-local objects Raising of exceptions The compiler normally has to assume that aliasing will occur. The compiler does alias analysis to identify when there is no alias, so later optimizations can be performed. Certain C and C++ language rules allow some levels of alias analysis. Fortran has additional rules which make it possible to rule out aliasing in more situations: subroutine parameters have no alias, and side effects of calls are limited to global variables and actual parameters. For C or C++, the coding style can help the compiler make the right assumptions. Using type qualifiers such as const, restrict, or volatile can help the compiler. Furthermore, if you supply some assumptions to make concerning your program, more optimizations can then be applied. The following are some of the various aliasing models you can specify, listed in order of increasingly stringent, and potentially dangerous, assumptions you are telling the compiler to make about your program:

7-19


7 ­ Tuning Options Aggressive Optimizations

-OPT:alias=any the default level, which implies that any two memory references can be aliased. -OPT:alias=typed means to activate the ANSI rule that objects are not aliased it they have different base types. This option is activated by -Ofast. -OPT:alias=unnamed assumes that pointers never to point to named objects. -OPT:alias=restrict tells the compiler to assume that all pointers are restricted pointers and point to distinct non-overlapping objects. This allows the compiler to invoke as many optimizations as if the program were written in Fortran. A restricted pointer behaves as though the C 'restrict' keyword had been used with it in the source code. -OPT:alias=disjoint says that any two pointer expressions are assumed to point to distinct, non-overlapping objects. To make the opposite assertion about your program's behavior, put 'no_' before the value. For example, -OPT:alias=no_restrict means that distinct pointers may point to overlapping storage. Additional -OPT:alias values are relevant to Fortran programmers in some situations: -OPT:alias=cray_pointer asserts that an object pointed to by a Cray pointer is never overlaid on another variable's storage. This flag also specifies that the compiler can assume that the pointed-to object is stored in memory before a call to an external procedure and is read out of memory at its next reference. It is also stored before a END or RETURN statement of a subprogram. OPT:alias=parm promises that Fortran parameters do not alias to any other variable. This is the default. no_parm asserts that parameter aliasing is present in the program.
7.7.2

Numerically Unsafe Optimizations
Rearranging mathematical expressions and changing the order or number of floating point operations can slightly change the result. Example:
A = 2. * X B = 4. * Y C = 2. * (X + 2. * Y)

A clever compiler will notice that C = A + B. But the order of operations is different, and so a slightly different C will be the result. This particular transformation is controlled by the -OPT:roundoff flag, but there are several other numerically unsafe flags. Some options that fall into this category are: The options that control IEEE behavior such as -OPT:roundoff=N and -OPT:IEEE_arithmetic=N. Here are a couple of others:

7-20


7 ­ Tuning Options Aggressive Optimizations

-OPT:div_split=(ON |OFF) This option enables or disables transforming expressions of the form X/Y into X* (1/Y). The reciprocal is inherently less accurate than a straight division, but may be faster. -OPT:recip=(ON |OFF) This option allows expressions of the form 1/X to be converted to use the reciprocal instruction of the computer. This is inherently less accurate than a division, but will be faster. These options can have performance impacts. For more information, see the e ko manual page. You can view the manual page by typing man eko at the command line.
7.7.3

Fast-math Functions
When -OPT:fast_math=on is specified, the compiler uses fast versions of math functions tuned for the processor. The affected math functions include log, exp, sin, cos, sincos, expf, and pow. In general, the accuracy is within 1 ulp of the fully precise result, though the accuracy may be worse than this in some cases. The routines may not raise IEEE exception flags. They call no error handlers, and denormal number inputs/outputs are typically treated as 0, but may also produce unexpected results. -OPT:fast_math=on is effected when -OPT:roundoff is set to 2 or above, A different flag -ffast-math improves FP speed by relaxing ANSI & IEEE rules. -fno-fast-math tells the compiler to conform to ANSI and IEEE math rules at the expense of speed. -ffast-math implies -OPT:IEEE_arithmetic=2 -fno-math-errno, while -fno-fast-math implies -OPT:IEEE_arithmetic=1 -fmath-errno. These flags apply to all languages. Both -OPT:fast_math=on and -ffast-math are implied by -Ofast.
7.7.4

IEEE 754 Compliance
It is possible to control the level of IEEE 754 compliance through options. Relaxing the level of compliance allows the compiler greater latitude to transform the code for improved performance. The following subsections discuss some of those options.
7.7.4.1

Arithmetic
Sometimes it is possible to allow the compiler to use operations that deviate from the IEEE 754 standard to obtain significantly improved performance, while still obtaining results that satisfy the accuracy requirements of your application.

7-21


7 ­ Tuning Options Aggressive Optimizations

The flag regulating the level of conformance to ANSI/IEEE 754-1985 floating pointing roundoff and overflow behavior is: -OPT:IEEE_arithmetic=N (where N= 1, 2, or 3). -OPT:IEEE_arithmetic =1 Requires strict conformance to the standard =2 Allows use of any operations as long as exact results are produced. This allows less accurate inexact results. For example, X*0 may be replaced by 0, and X/X may replaced by 1 even though this is inaccurate when X is +inf, -inf, or NaN. This is the default level at -O3. =3 Means to allow any mathematically valid transformations. For example, replacing x/y by x*(recip (y) ). For more information on the defaults for IEEE arithmetic at different levels of optimization, see Table 7.3.
7.7.4.2

Roundoff
Use -OPT:roundoff= to identify the extent of roundoff error the compiler is allowed to introduce: 0 No roundoff error 1 Limited roundoff error allowed 2 Allow roundoff error caused by re-associating expressions 3 Any roundoff error allowed The default roundoff level with -O0, -O1, and -O2 is 0. The default roundoff level with -O3 is 1. Listing some of the other -OPT: sub-options that are activated by various roundoff levels can give more understanding about what the levels mean. - OPT:roundoff=1 implies: -OPT:fast_exp=ON This option enables optimization of exponentiation by replacing the run-time call for exponentiation by multiplication and/or square root operations for certain compile-time constant exponents (integers and halves). -OPT:fast_trunc implies inlining of the NINT, ANINT, AINT, and AMOD Fortran intrinsics. -OPT:roundoff=2 turns on the following sub-options: -OPT:fold_reassociate which allows optimizations involving re-association of floating-point quantities.

7-22


7 ­ Tuning Options Aggressive Optimizations

-OPT:roundoff=3 turns on the following sub-options: -OPT:fast_complex When this is set ON, complex absolute value (norm) and complex division use fast algorithms that overflow for an operand (the divisor, in the case of division) that has an absolute value that is larger than the square root of the largest representable floating-point number. -OPT:fast_nint uses a hardware feature to implement single and double-precision versions of NINT and ANINT
7.7.5

Other Unsafe Optimizations
A few advanced optimizations intended to exploit some exotic instructions such as CMOVE (conditional move) result in slightly changed program behavior, such as programs which write into variables guarded by an if () statement. For example:
if (a .eq. 1) then a=3 endif

In this example, the fastest code on an x86 CPU is code which avoids a branch by always writing a; if the condition is false, it writes a's existing value into a, else it writes 3 into a. If a is a read-only value not equal to 1, this optimization will cause a segmentation fault in an odd but perfectly valid program.
7.7.6

Assumptions About Numerical Accuracy
See the following table for the assumptions made about numerical accuracy at different levels of optimization. Table 7-3. Numerical Accuracy with Options
-OPT: option name div_split fast_complex fast_exp fast_nint fast_s qrt fast_trunc fold_reassociate fold_unsafe_relops fold_unsigned_relops IEEE_arithmetic -O0 off off off off off off off on off 1 -O1 off off off off off off off on off 1 -O2 off off off off off off off on off 1 -O3 off off on off off on off on off 2 -Ofast on off on off off on on on off 2 onifroundoff>=1 onifroundoff>=2 Notes onif IEEE_a=3 onifroundoff=3 onifroundoff>=1 onifroundoff=3

7-23


7 ­ Tuning Options Hardware Performance

Table 7-3. Numerical Accuracy with Options
IEEE_NaN_inf recip roundoff fast_math rsqrt off off 0 off 0 off off 0 off 0 off off 0 off 0 off off 1 off 0 off on 2 off 1 onifroundoff>=2 1 ifroundoff>=2 onifroundoff>=2

For example, if you use -OPT:IEEE_arithmetic at -O3, the flag is set to IEEE_arithmetic=2 by default.
7.7.6.1

Flush-to-Zero Behavior
The processor hardware which implements IEEE floating point arithmetic generally runs faster if it is allowed to generate zero rather than a denormalized number when an arithmetic operation underflows. Therefore, at optimization level -O3, the PathScale compiler allows this behavior, which is commonly known as flush to zero. The flush-to-zero behavior is controlled by the -OPT:IEEE_arith= flag. Setting it to either 2 or 3 will result in flush-to-zero. The -OPT:IEEE_arith= flag defaults to 1 under -O0/-O1/-O2 and it defaults to 2 under -O3, as seen in the table above. The compilation flag works by generating instructions to do the setting at the entry to main(). During runtime, it can be further set by the IEEE_SET_UNDERFLOW_MODE Fortran intrinsic found in the intrinsic module IEEE_ARITHMETIC:
! Gradual underflow means "produce denormalized numbers" USE,INTRINSIC :: IEEE_ARITHMETIC CALL IEEE_SET_UNDERFLOW_MODE(GRADUAL=.TRUE.)
7.8

Hardware Performance
Although the x86_64 platform has excellent performance, there are a number of subtleties in configuring your hardware and software that can each cause substantial performance degradations. Many of these are not obvious, but they can reduce performance by 30% or more at a time. We have collected a set of techniques for obtaining best performance described below.
7.8.1

Hardware Setup
There is no "catch all" memory configuration that works best across all systems. We have seen instances where the number, type, and placement of memory modules on a motherboard can each affect the memory latency and bandwidth that you can achieve.

7-24


7 ­ Tuning Options Hardware Performance

Most motherboard manuals have tables that document the effects of memory placement in different slots. We recommend that you read the table for your motherboard, and experiment. If you fail to set up your memory correctly, this can account for up to a factor-of-two difference in memory performance. In extreme cases, this can even affect system stability.
7.8.2

BIOS Setup
Some BIOSes allow you to change your motherboard's memory interleaving options. Depending on your configuration, this may have an effect on performance. For a discussion of memory interleaving across nodes, see section 7.8.3 below.
7.8.3

Multiprocessor Memory
Traditional small multiprocessor (MP) systems use symmetric multiprocessing (SMP), in which the latency and bandwidth of memory is the same for all CPUs. This is not the case on Opteron multiprocessor systems, which provide non-uniform memory access, known as NUMA. On Opteron MP systems, each CPU has its own direct-attached memory. Although every CPU can access the memory of all others, memory that is physically closest has both the lowest latency and highest bandwidth. The larger the number of CPUs, the higher will be the latency and the lower the bandwidth between the two CPUs that are physically furthest apart. Most multiprocessor BIOSes allow you to turn on or off the interleaving of memory across nodes. Memory interleaving across nodes masks the NUMA variation in behavior, but it imposes uniformly lower performance. We recommend that you turn node interleaving off.
7.8.4

Kernel and System Effects
To achieve best performance on a NUMA system, a process or thread and as much as possible of the memory that it uses must be allocated to the same single CPU. The Linux kernel has historically had no support for setting the affinity of a process in this way. Running a non-NUMA kernel on a NUMA system can result in changes in performance while a program is running, and non-reproducibility of performance across runs. This occurs because the kernel will schedule a process to run on whatever CPU is free without regard to where the process's memory is allocated. Recent kernels have some degree of NUMA support. They will attempt to allocate memory local to the CPU where a process is running, but they still may not prevent that process from later being run on a different CPU after it has allocated memory.

7-25


7 ­ Tuning Options Hardware Performance

Current NUMA-aware kernels do not migrate memory across NUMA nodes, so if a process moves relative to its memory, its performance will suffer in unpredictable ways. Note that not all vendors ship NUMA-aware kernels or C libraries that can interface to them. If you are unsure of whether your kernel supports NUMA, check with your distribution vendor.
7.8.5

Tools and APIs
Recent Linux distributions include tools and APIs that allow you to bind a thread or process to run on a specific CPU. This provides an effective workaround for the problem of the kernel moving a process away from its memory. Your Linux distribution may come with a package called schedutils, which includes a program called taskset. You can use taskset to specify that a program must run on one particular CPU. For low-level programming, this facility is provided by the sched_setaffinity (2) call in the C library. You will need a recent C library to be able to use this call. On systems that lack NUMA support in the kernel, and on runs that do not set process affinity before they start, we have seen variations in performance of 30% or more between individual runs.
7.8.6

Testing Memory Latency and Bandwidth
To test your memory latency and bandwidth, we recommend two tools. For memory latency, the LMbench package provides a tool called lat_mem_rd. This provides a cryptic, but fairly accurate, view of your memory hierarchy latency. LMbench is available from http://www.bitmover.com/lmbench/ For measuring memory bandwidth, the STREAM benchmark is a useful tool. Compiling either the Fortran or C version of the benchmark with the following command lines will provide excellent performance:
$ pathf95 -Ofast stream_d.f second_wall.c -DUNDERSCORE $ pathcc -Ofast -lm stream_d.c second_wall.c

(If you do not compile with at least -O3, performance may drop by 40% or more.) The STREAM benchmark is available from http://www. streambench.org/ For both of these tools, we recommend that you perform a number of identical runs and average your results, as we have observed variations of more than 10% between runs.

7-26


7 ­ Tuning Options The pathopt2 Tool

7.9

The pathopt2 Tool
The pathopt2 tool is used to iteratively test different options and option combinations by compiling a set of application source code files, measuring the performance of the executable and tracking the results. The best options are obtained from the output of these runs and are used to adaptively tune successive runs, yielding the best set of compiler options for a given combination of application code, data set, hardware, and environment. A sorted list of execution times is produced for each run. The tool uses an XML option configuration file that defines one or more execution targets. Each execution target specifies options to try and indicates how they are to be combined into a series of tests. In general, using pathopt2 involves these steps: 1. Run pathopt2 using an execution target in the supplied option configuration file. 2. Interpret the results. 3. Choose a more detailed execution target based on the results from the first run, and repeat the process until the best compiler options are found. The pathopt2 tool can be completely driven from its command line, or it can alternatively use scripts to build and test the programs. Scripts are useful for more complex runs, for interfacing to existing build and test mechanisms, and for automating the process. For a standard installation, the program pathopt2 is located in:
/opt/pathscale/bin

This is the same directory that contains pathcc, pathCC, pathf95, pathf90, and so on. An option configuration file, pathopt2.xml, is provided. The default location is:
/opt/pathscale/share/pathopt2/pathopt2.xml

See section 7.9.3 for details on this file format. Sample programs are found in:
/opt/pathscale/share/pathopt2/examples

In the following sections we review the command syntax, the option configuration file structure, and general usage information. Step-by-step examples show how to use the different features of pathopt2.

7-27


7 ­ Tuning Options The pathopt2 Tool

7.9.1

A Simple Example
An example is provided here to show basic usage of pathopt2. In this example you will copy a test program into your working directory, and then run pathopt2 with the options file and the test program. Copy the program factorial.c from /opt/pathscale/share/pathopt2/examples into your own working directory. factorial.c is a program that calculates a table of 50,000 factorials, from 1! to 50000! You can now run this simple example by typing:
$ pathopt2 -f pathopt2.xml -t try5 \ -r ./factorial pathcc @ -o factorial factorial.c

NOTE:

If you do not have '.' set in your PATH, you need to use './factorial' to run this command from the current working directory. The PATH for the program pathopt2 is the same as for pathcc, etc., and should already be set correctly. See the PathScale Compiler Suite and Subscription Manager Install Guide for general information on setting your PATH.

You should see a list of output summarizing the result of all the runs. The first set of flags are listed in the order in which they were run. This is followed by a summary table which sorts the same output by time, from fastest to slowest. Sample output from this run is shown below:
Flagsb -O2 -O3 -O3 -ipa -O3 -OPT:Ofast -Ofast Flags -O3 -OPT:Ofast -Ofast -O3 -O3 -ipa -O2 Build PASS PASS PASS PASS PASS Build PASS PASS PASS PASS PASS Test PASS PASS PASS PASS PASS Test PASS PASS PASS PASS PASS Real 2.83 2.39 2.40 2.37 2.38 Real 2.37 2.38 2.39 2.40 2.83 User 2.82 2.39 2.40 2.38 2.38 User 2.38 2.38 2.39 2.40 2.82 System 0.00 0.00 0.01 0.00 0.00 System 0.00 0.00 0.00 0.01 0.00

Sorted summary from all runs:

From these results, we see that the best option from this run is -O3 -OPT:Ofast. The next sections will discuss details on usage, command line options, and the configuration file format.

7-28


7 ­ Tuning Options The pathopt2 Tool

7.9.2

pathopt2 Usage
Basic usage is as follows:
pathopt2 [-n num_iterations] [-f configfile] [-t execute_target] [-r test_command] [-S real|user|system] build_command @ [args] ...

The command line above shows the most commonly used options; for the complete list of options, see Table 7.4. The pathopt2 tool runs build_command with the provided arguments and using additional options as specified in configfile. The build command can be an PathScale invocation command (pathcc, pathf 95, pathCC), a make command, or a script which eventually invokes the compiler, perhaps via a make command. The character @ is replaced in the command with the list of options from the configfile being considered. The configfile is typically the provided pathopt2.xml file, although you can write your own. The execute_target parameter specifies the execution target from the configfile. The test_command parameter is the command to run the program and can be replaced with a script. The program is expected to return a status value of 0 to indicate success, or a non-zero status to indicate failure. The -S option specifies the metric used for comparing performance: real: the elapsed real time (this is the default). user: the CPU time spent executing in user mode. system: the CPU time spent executing in system mode. timing-file: to use a file containing a timing value. rate-file: to use a file containing a rate value. The chosen metric is used to guide the choices made by the pathopt2 algorithms when selecting options for the best performance, and is used to sort the final output. The interpretation of real, user and system time is the same as the time( 1) command. real is equivalent to wall-clock time. An application may switch back and forth between user and kernel mode so these components are factored separately into user and system times. Since the O/S is typically time-slicing between many processes, the sum of user and system does not necessarily equal real since other processes could also have run. The default metric used when comparing the performance of one set of options with another is real time. All 3 times will be displayed in the output. Additionally, pathopt2 allows arbitrary performance metrics to be used to guide option selection using the timing-file and rate-file choices. When either of these options is used, pathopt2 sets an environment variable called PSC_METRIC_FILE with the name of a temporary file before running the command. The run command is required to write the performance metric into this file before it terminates. The pathopt2 tool then opens this file, reads a value from the file as a double-precision floating-point number, and deletes the temporary file. The only interpretation placed on these values is that smaller is better for timing, and that

7-29


7 ­ Tuning Options The pathopt2 Tool

larger is better for rate. The actual units of the values do not matter as far as pathopt2 is concerned since it just performs comparisons on the values. Using the above usage as a guide, we can now summarize the simple command from the previous section:
$ pathopt2 -f pathopt2.xml -t try5 \ -r ./factorial pathcc @ -o factorial factorial.c

This example directs pathopt2 to use pathopt2.xml as the configuration file. The build command pathcc @ -o factorial factorial.c is used for the building phase where option "@" is iteratively replaced with the rules specified in the try5 subset within the configuration file pathopt2.xml. The "@" character must be included somewhere in the build command since this is the mechanism by which the chosen optimization options are propagated to the build command. Finally, . /factorial is used as the test_command. For simple cases, the -o flag can be omitted, and the default executable output a.out can be used as the test_command:
$ pathopt2 -f pathopt2.xml -t try5 \ -r ./a.out pathcc @ factorial.c

NOTE:

The order of the options in the command line does not matter. However, the required build_command comes last since it may have an arbitrary number of options and arguments of its own. When the -f option is not specified pathopt2 will use the file pathopt2.xml if it is present in the current working directory, otherwise it will use the default pathopt2.xml that ships with the software.

The pathopt2 available options are given in Table7.4. You can also type:
$ pathopt2 -h

on the command line to get usage information. Table 7-4. pathopt2 Options
Option -D Description Do not redirect I/O to /dev/null This is useful for debugging problems with the compilation, the run, or the build and test scripts. Default All I/O from the build and test commands will be sent to /dev/null under the assumption that the program will build and run cleanly.

7-30


7 ­ Tuning Options The pathopt2 Tool

Table 7-4. pathopt2 Options (Continued)
-f con figfile The -f option is used to specify the filename of the pathopt2 XML configuration file. If it is not specified the tool will first check for a file called pathopt2.xml in the current working directory and use it if present, otherwise the tool will use the file /pathscale/share/pa thopt2 /pathopt2.xml

-g external_con figfile

Loads in additional user-defined configfile(s). This allows a user to extend the pathopt2.xml file without having to modify it. Show usage Number of jobs Keep temporary directory (with -T) Directory name Number of iterations to run on each option Test script 1 Remove temporary directory 'pwd' 1 If this option is not specified then there is no test run, and the performance of the build command is used. This is useful when the program is built and run in one step, and the timing-file or rate-file mechanism is used to report the performance. real

-h -j -k -M -n num_iterations -r test_command

-S|real |user |system | timing_file |rate_file -t execute_target

Selects the performance metric for choosing options and for sorting the results

Use The first target in execute_target, configfile which corresponds to an tag found in configfile.

7-31


7 ­ Tuning Options The pathopt2 Tool

Table 7-4. pathopt2 Options (Continued)
-T -v -w columns Run script in temporary directory Generate more verbose output Number of columns to use in formatting output Don't print out a summary table 40 Do not use a temporary directory

-X

7.9.3

Option Configuration File
The PathScale Compiler Suite includes pathopt2.xml, a pre-configured option configuration file found in /opt/pathscale/share/pathopt2/ that contains about 200 test flags and options. This XML file specifies a tree of options to try. A small set of tags and attributes are used. The file supports many common combinations of options in a framework that enables pathopt2 to adapt as it runs. pathopt2.xml can be used on its own, or as a framework for creating a custom configuration file. More than one configuration can be described in a single file. A single configuration in pathopt2.xml consists of two parts: A list of options. This list is contained within a tag. This list can also contain any number of