ACESgrid
Alliance for Computational Earth Science
About & News
Getting Started
   Get An Account
   Login
   Environment customization
   ACES queues
   Queue Examples
   Compile code
      Compiler options
      High Performance Fortran
   Hardware Groups
   Itanium2 nodes and IA64 software
   Storage
   Office hours
Sites
Available software
Status
People
FAQ
Mailing Lists
Quick Links
Contact Us
Sponsors
Search

MIT logo

1. Compilers and compiler options for the cluster nodes

The ACESgrid clusters have the following compilers installed:

  • C compilers:

    • GNU compiler collection v. 3.3.3: gcc or cc

      • Also covers Objective C.

    • GNU compiler collection v. 2.96: gcc296

    • GNU compiler collection v. 3.4.0: gcc34

    • GNU compiler collection v. 3.2.3: gcc or cc on IA64 systems

    • Intel C/C++ compiler v. 8.1: icc

    • Portland Group compiler v. 5.2: pgcc

  • C++ compilers:

    • GNU compiler collection v. 3.3.3: g++ or c++

    • GNU compiler collection v. 2.96: g++296

    • GNU compiler collection v. 3.4.0: g++34

    • GNU compiler collection v. 3.2.3: g++ or c++ on IA64 systems

    • Intel C/C++ compiler v. 8.1: icpc or icc

    • Portland Group compiler v. 5.2: pgCC

  • Fortran 77 compilers:

    • GNU compiler collection v. 3.3.3: g77 or f77

    • GNU compiler collection v. 3.2.3: g77 or f77 on IA64 systems

    • Intel Fortran compiler v 8.1: ifort or ifc

    • Portland Group compiler v. 5.2: pgf77

  • Fortran 90/95/2003 compilers:

    • GNU Fortran (G95) v 3.5 (beta) and 4.0.0: g95 for both IA32 and IA64 systems

    • GNU Fortran (GFortran) v 4.0.0: gfortran for IA32 systems

    • Intel Fortran compiler v 8.1: ifort or ifc

    • Portland Group compiler v. 5.2: pgf90

  • HPF compilers:

    • Portland group compiler v. 5.2: pghpf

    • ADAPTOR source-to-source compiler v. 10.2: adaptor

  • Java

    • GNU compiler collection v 3.3.3: gcj

    • GNU compiler collection v 3.4.0: gcj34

    • GNU compiler collection v 3.2.3: gcj on IA64 systems

    • Sun JDK 1.4.2_06: javac for both IA32 and IA64 systems

    • Sun JDK 1.5.0: javac for IA32 systems

    • IBM JDK 1.4.2: javac with JAVA_HOME=/usr/lib/jvm/java-1.4.2-ibm-1.4.2.0 for IA64 systems

2. Compiler options

2.1. Basic options for all compilers

Please run all production code on the ACESgrid clusters optimized, unless you can not get the correct results from an optimized binary. Basic flags accepted by all compilers are:

  • Basic optimizations: -O

  • Basic debugging: -g

2.1.1. Useful optimization options for the GNU compiler collection

  • For more information please look at the man and info pages (e.g. man gcc or info gcc) and the online documentation.

  • Suggested flags:

    • Safe:

      • IA32: -O2 -march=pentium4 -mfpmath=sse -msse -msse2
      • IA64: -O2

    • Aggressive but relatively safe:

      • IA32: -O3 -funroll-loops -march=pentium4 -malign-double -mfpmath=sse -msse -msse2
      • IA64: -O3 -funroll-loops

    • Aggressive and unsafe:

      • IA32: -O3 -funroll-loops -ffast-math -march=pentium4 -malign-double -mfpmath=sse -msse -msse2
      • IA64: -O3 -funroll-loops -ffast-math

  • -O or -O1

  • Optimize. Optimizing compilation takes somewhat more time, and a lot more memory for a large function.

    With -O, the compiler tries to reduce code size and execution time, without performing any optimizations that take a great deal of compilation time.

  • -O2

  • Optimize even more. GCC performs nearly all supported optimizations that do not involve a space-speed tradeoff. The compiler does not perform loop unrolling or function inlining when you specify -O2. As compared to -O, this option increases both compilation time and the performance of the generated code.
  • -Os

  • Optimize for size. -Os enables all -O2 optimizations that do not typically increase code size. It also performs further optimizations designed to reduce code size.
  • -O3

  • Optimize yet more. -O3 turns on all optimizations specified by -O2 and also turns on function inlining and other options.
  • -funroll-loops

  • Unroll loops whose number of iterations can be determined at com- pile time or upon entry to the loop. This option makes code larger, and may or may not make it run faster.
  • -funroll-all-loops

  • Unroll all loops, even if their number of iterations is uncertain when the loop is entered. This usually makes programs run more slowly. -funroll-all-loops implies the same options as -funroll-loops
  • Relevant optimization flags for the Pentium4/Xeon processors used in the ACESgrid clusters:

    • -march=pentium4 (or -march=pentiumpro for v. 2.96)

    • Generate instructions for Pentium4 (or Pentium Pro at least) cpus. Tune the code generation to the specific cpu.
    • -mfpmath=sse -msse -msse2 (not for v. 2.96)

    • Use the SSE floating point unit for all floating point operations.
    • -mfpmath=387 (not for v. 2.96 where this is default)

    • Use the x387 floating point unit for all floating point operations
    • -malign-double

    • Control whether GCC aligns "double", "long double", and "long long" variables on a two word boundary or a one word boundary. Aligning "double" variables on a two word boundary will produce code that runs somewhat faster on a Pentium at the expense of more memory.

      Warning: if you use the -malign-double switch, structures containing the above types will be aligned differently than the published application binary interface specifications for the 386 and will not be binary compatible with structures in code compiled without that switch.

  • Optimization flags related to floating point behaviour:

    • -ffloat-store

    • Do not store floating point variables in registers, and inhibit other options that might change whether a floating point value is taken from a register or memory.

      This option prevents undesirable excess precision on machines such as the 68000 where the floating registers (of the 68881) keep more precision than a "double" is supposed to have. Similarly for the x86 architecture. For most programs, the excess precision does only good, but a few programs rely on the precise definition of IEEE floating point. Use -ffloat-store for such programs, after modifying them to store all pertinent intermediate computations into variables.

    • -ffast-math

      • Sets -fno-math-errno, -funsafe-math-optimizations, -fno-trapping-math, -ffinite-math-only and -fno-signaling-nans.

        This option causes the preprocessor macro FAST_MATH to be defined.

        This option should never be turned on by any -O option since it can result in incorrect output for programs which depend on an exact implementation of IEEE or ISO rules/specifications for math functions.

      • -fno-math-errno

        Do not set ERRNO after calling math functions that are executed with a single instruction, e.g., sqrt. A program that relies on IEEE exceptions for math error handling may want to use this flag for speed while maintaining IEEE arithmetic compatibility.

        This option should never be turned on by any -O option since it can result in incorrect output for programs which depend on an exact implementation of IEEE or ISO rules/specifications for math functions.

        The default is -fmath-errno.

      • -funsafe-math-optimizations

        Allow optimizations for floating-point arithmetic that (a) assume that arguments and results are valid and (b) may violate IEEE or ANSI standards. When used at link-time, it may include libraries or startup files that change the default FPU control word or other similar optimizations.

        This option should never be turned on by any -O option since it can result in incorrect output for programs which depend on an exact implementation of IEEE or ISO rules/specifications for math functions.

        The default is -fno-unsafe-math-optimizations.

      • -ffinite-math-only

        Allow optimizations for floating-point arithmetic that assume that arguments and results are not NaNs or +-Infs.

        This option should never be turned on by any -O option since it can result in incorrect output for programs which depend on an exact implementation of IEEE or ISO rules/specifications.

        The default is -fno-finite-math-only.

      • -fno-trapping-math

        Compile code assuming that floating-point operations cannot generate user-visible traps. These traps include division by zero, overflow, underflow, inexact result and invalid operation. This option implies -fno-signaling-nans. Setting this option may allow faster code if one relies on "non-stop" IEEE arithmetic, for example.

        This option should never be turned on by any -O option since it can result in incorrect output for programs which depend on an exact implementation of IEEE or ISO rules/specifications for math functions.

        The default is -ftrapping-math.

      • -fsignaling-nans

        Compile code assuming that IEEE signaling NaNs may generate user-visible traps during floating-point operations. Setting this option disables optimizations that may change the number of exceptions visible with signaling NaNs. This option implies -ftrapping-math.

        This option causes the preprocessor macro SUPPORT_SNAN to be defined.

        The default is -fno-signaling-nans.

        This option is experimental and does not currently guarantee to disable all GCC optimizations that affect signaling NaN behavior.

      • -mieee-fp and -mno-ieee-fp

        Control whether or not the compiler uses IEEE floating point comparisons. These handle correctly the case where the result of a comparison is unordered.

2.1.2. Useful optimization options for the Intel compilers

  • For more information please look at the man and info pages (e.g. man icc or man ifort), the online C/C++ and Fortran or local (in /usr/local/pkg/i?c/version/doc/ or /usr/local/pkg-ia64/e?c/version/doc/) documentation.

  • Suggested flags:

    • Safe:

      • IA32: -O -xN -ip -mp1
      • IA64: -O -ip -mp1

    • Aggressive but relatively safe:

      • IA32: -fast -xN
      • IA64: -fast

    • Aggressive and unsafe (for Fortran):

      • IA32: -fast -xN -pad
      • IA64: -fast -pad

  • -O1

  • For C/C++: Optimize to favor code size and code locality. Disables loop unrolling. -O1 may improve performance for applications with very large code size, many branches, and execution time not dominated by code within loops. In most cases, -O2 is recommended over -O1 - in fact for Fortran -O1 is the same as -O2.
  • -O2 or -O (the default)

  • Optimize for code speed. This is the generally recommended optimization level.
  • -O3

  • Enable -O2 optimizations and in addition, enable more aggressive optimizations such as loop and memory access transformation. The -O3 optimizations may slow down code in some cases compared to -O2 optimizations. Recommended for applications that have loops with heavy use of floating point calculations and process large data sets.
  • -fast

  • The -fast option maximizes speed across the entire program. It sets command options that can improve run-time performance, as follows: -O3 -ipo -static. You need to add -xN to override the default of -xP on IA32 systems.
  • -Os

  • Enable speed optimizations, but disable some optimizations that increase code size for small speed benefit.
  • -unroll[n]

  • Set maximum number of times to unroll loops. This applies only to loops that the compiler determines should be unrolled. Omit n to let the compiler decide whether to perform unrolling or not. Use n=0 to disable loop unrolling. Enabled by default.
  • -scalar_rep (for Fortran on IA32 only)

  • Enable scalar replacement performed during loop transformations. Requires -O3.
  • -complex_limited_range

  • This option enables the use of the basic algebraic expansions of some complex arithmetic operations. At the loss of some exponent range, the -complex_limited_range option can allow for some performance improvement in programs which utilize complex arithmetic.
  • -align

  • Analyze and reorder memory layout for variables and arrays.
  • -pad (Fortran only)

  • Enables the changing of the variable and array memory layout. The default is -nopad. The -pad option is effectively not different from -align when applied to structures and derived types. However, the scope of -pad is greater because it applies also to common blocks, derived types, sequence types, and structures.
  • -ip

  • Enable single-file IP optimizations (within files). With this option, the compiler performs inline function expansion for calls to functions defined within the current source file.
  • -ipo

  • Enables multifile IP optimizations (between files). When you specify this option, the compiler performs inline function expansion for calls to functions defined in separate files. To use when creating libraries add -ipo_obj.
  • -static

  • Prevents linking with shared libraries. Causes the executable to link all libraries statically (if possible) for a slight performance boost. May break the linking step if all libraries do not have static versions available.
  • Relevant optimization flags for the Pentium4/Xeon processors used in the ACESgrid clusters:

    • -tpp7

    • Optimize for Intel Pentium 4 processors. (default on IA32 and not necessary to add)
    • -tpp2

    • Optimize for Intel Itanium2 processors. (default on IA64 and not necessary to add)
    • -xcodes

      • Generate specialized code to run exclusively on processors supporting the extensions indicated by codes. codes includes one or more of the following characters:
      • K -- Intel Pentium III processors and compatible Intel

      • processors. Generated binary should work on non-Intel processors supporting SSE instructions (eg. AthlonMP, AthlonXP, Athlon4, Athlon64, Opteron).
      • W -- Intel Pentium 4 processors and compatible Intel processors.

      • Generated binary should work on non-Intel processors supporting SSE2 instructions (eg. Athlon64, Opteron).
      • N -- Intel Pentium 4 processors and compatible Intel processors.

      • Enables new optimizations in addition to Intel processor-specific optimizations. Will not work on non-Intel processors even if they support SSE2 instructions.
      • B -- Intel Pentium M processors and compatible Intel processors.

      • Enables new optimizations in addition to Intel processor-specific optimizations. Will not work on non-Intel processors even if they support SSE2 instructions.
      • P -- Intel Pentium 4 Processors with Streaming SIMD

      • Extensions 3 (SSE3) instruction support. Enables new optimizations in addition to Intel processor-specific optimizations. Will not work on non-Intel processors even if they support SSE3 instructions.
  • Optimization flags related to floating point behaviour:

    • -mp

    • Maintain floating-point precision (disables some optimizations). The -mp option restricts optimization to maintain declared precision and to ensure that floating-point arithmetic conforms more closely to the ANSI and IEEE standards. For most programs, specifying this option adversely affects performance. If you are not sure whether your application needs this option, try compiling and running your program both with and without it to evaluate the effects on both performance and precision. A lot of the same behaviour can be replicated by using -mp1 or -prec_div -fp_port -xN at a lesser cost in speed.
    • -mp1

    • Improve floating-point precision. -mp1 disables fewer optimizations and has less impact on performance than -mp.
    • -prec_div (only for IA32)

    • Improve precision of floating-point divides (some speed impact). With some optimizations the Intel compilers changes floating-point division computations into multiplication by the reciprocal of the denominator. For example, A/B is computed as A x (1/B) to improve the speed of the computation. However, for values of B greater than 2126, the value of 1/B is "flushed" (changed) to 0. When it is important to maintain the value of 1/B, use -prec_div to disable the floating-point division-to-multiplication optimization. The result of -prec_div is more accurate, with some loss of performance.
    • -pcn (only for IA32)

      • Enable floating-point significand precision control. Some floating-point algorithms are sensitive to the accuracy of the significand, or fractional part of the floating-point value. For example, iterative operations like division and finding the square root can run faster if you lower the precision with the -pcn option. Set n to one of the following values to round the significand to the indicated number of bits:
      • 32: 24 bits (single precision - 32 bits total extent)

      • 64: 53 bits (double precision - 64 bits total extent)

      • 80: 64 bits (extended/long double precision - 80 bits total extent)

      • Caution: This is the default behaviour for all floating-point operations in the x387 floating point unit whose registers are 80 bits long internally. Any SSE/SSE2/SSE3 operations that employ the XMM registers use either 32 or 64 bits long variables for single/double precision respectively.
        Caution: A change of the default precision control or rounding mode (for example, by using the -pc32 flag or by user intervention) may affect the results returned by some of the mathematical functions.
    • -rcd (only for IA32)

    • Enable fast float-to-int conversions. The Intel compiler uses the -rcd option to improve the performance of code that requires floating-point-to-integer conversions. The system default floating point rounding mode is round-to-nearest. However, the C & Fortran languages require floating point values to be truncated when a conversion to an integer is involved. To do this, the compiler must change the rounding mode to truncation before each floating-point-to-integer conversion and change it back afterwards. The -rcd option disables the change to truncation of the rounding mode for all floating point calculations, including floating point-to-integer conversions. Turning on this option can improve performance, but floating point conversions to integer will not conform to language semantics.
    • -fp_port (only for IA32)

    • Round floating-point results at assignments and casts (some speed impact).
    • -fpstkchk (only for IA32)

    • Generate extra code after every function call to assure that the FP stack is in the expected state. Generally, when the FP stack overflows, a NaN value is put into FP calculations, and the program's results differ. Unfortunately, the overflow point can be far away from the point of the actual bug. The -fpstkchk option places code that would access violate immediately after an incorrect call occurred, thus making it easier to locate these issues when debugging.
    • -ftz

    • Flush denormal results to zero. Denormal handling can slow down appreciably Pentium4/Xeon systems, so this options helps application speed at a potential accuracy cost. This option has effect only when compiling the main program. Set as the default for modern Pentium4 systems, it can be disabled by -ftz-.
    • -IPF_fma[-] (only for IA64)

    • Enable/disable the combining of floating point multiplies and add/subtract operations
    • -IPF_fltacc[-] (only for IA64)

    • Enable/disable optimizations that affect floating point accuracy
    • -IPF_flt_eval_method0 (only for IA64)

    • Floating point operands evaluated to the precision indicated by program
    • -IPF_fp_speculationmode (only for IA64)

    • Enable floation point speculations with the following mode conditions:
      • fast - speculate floating point operations (DEFAULT)
      • safe - speculate only when safe
      • strict - same as off
      • off - disables speculation of floating-point operations
    • -IPF_fp_relaxed[-] (only for IA64)

    • Enable/disable use of faster but slightly less accurate code sequences for math functions

2.1.3. Useful optimization options for the Portland Group compilers

  • For more information please look at the man and info pages (e.g. man pgcc or info pgf77), the online or local (in /usr/local/pkg/pgi/pgi-5.2/linux86/5.2/doc/) documentation.

  • Suggested flags:

    • Safe: -fastsse -Mvect=sse,assoc,cachesize:524288 -Mlre=noassoc -Mnoflushz -Kieee

    • Aggressive but relatively safe: -fastsse -Mvect=sse,assoc,cachesize:524288

    • Very aggressive and unsafe: -fastsse -O3 -Mvect=sse,assoc,cachesize:524288,recog,transform -Mipa=fast -nodepchk

  • -O[level]

    • Set the optimization level. If -O is not specified, then the default level is 1 if -g is not specified, and 0 if -g is specified. If a number is not supplied with -O then the optimization level is set to 2. The optimization levels and their meanings are as follows:
    • -O0

    • A basic block is generated for each C statement. No scheduling is done between statements. No global optimizations are performed.
    • -O1

    • Scheduling within extended basic blocks is performed. Some register allocation is performed. No global optimizations are performed.
    • -O2

    • All level 1 optimizations are performed. In addition, traditional scalar optimizations such as induction recognition and loop invariant motion are performed by the global optimizer.
    • -O3

    • Aggressive global optimization. This level performs all -O2 optimizations and enables more aggressive hoisting and scalar replacement optimizations that may or may not be profitable.
  • -fast

  • Chooses generally optimal flags for the target platform. Use -fast -help to see the equivalent switches. Note this sets the optimization level to a minimum of 2; see -O. Currently set to -O2 -Munroll=c:1 -Mnoframe -Mlre
  • -fastsse

  • Chooses generally optimal flags for a processor that supports the SSE (Pentium 3/4, AthlonXP/MP, Opteron) and SSE2 (Pentium 4, Opteron) instructions. Use -fastsse -help to see the equivalent switches. Currently set to -fast -Mvect=sse -Mscalarsse -Mcache_align -Mflushz.
  • -Mcache_align

  • Align unconstrained data objects of size greater than or equal to 16 bytes on cache-line boundaries. An unconstrained object is a variable or array that is not a member of an aggregate structure or common block, is not allocatable, and is not an automatic array.
  • -Mdepchk

  • Assume that potential data dependencies exist. -Mnodepchk may result in incorrect code; For C/C++ the -Msafeptr (see manual) switch provides a less dangerous way to accomplish the same thing.
  • -Mlre[=assoc|noassoc] or -Mnolre

  • Enable (disable) loop-carried redundancy elimination. The assoc option allows expression reassociation (default), and the noassoc option disallows expression reassociation.
  • -Mframe or -Mnoframe (default)

  • Set up (don't set up) a true stack frame pointer for functions; -Mnoframe allows slightly more efficient operation when a stack frame is not needed, but some options override -Mnoframe.
  • -Munroll[=option[,option...]] or -Mnounroll (default)

    • Invoke (don't invoke) the loop unroller. This also sets the optimization level to a minimum of 2; see -O. The option is one of the following:
    • c:m

    • Instructs the compiler to completely unroll loops with a constant loop count less than or equal to m, a supplied constant. If this value is not supplied, the m count is set to 4.
    • n:u

    • Instructs the compiler to unroll u times, a loop which is not completely unrolled, or has a non-constant loop count. If u is not supplied, the unroller computes the number of times a candidate loop is unrolled.
      -Mnounroll instructs the compiler not to unroll loops.
  • -Mipa[=option[,option,...]]

    • Enable and specify options for InterProcedural Analysis (IPA). This also sets the optimization level to a minimum of 2; see -O. If no option list is specified, then it is equivalent to -Mipa=const. This requires two passes of the compiler (the first one may appear to fail). The minimal and maximal options are:
    • const (default) or noconst

    • Enable (disable) propagation of constants across procedure calls.
    • fast

    • Chooses generally optimal -Mipa flags for the target platform; use -Mipa -help to see the equivalent options. Currently set to -Mipa=align,arg,const,f90ptr,shape,globals,localarg,ptr.
  • Relevant optimization flags for the Pentium4/Xeon processors used in the ACESgrid clusters:

    • -tp p7 or -tp piv

    • Optimize for the Pentium 4 processor
    • -tp piii

    • Optimize for the Pentium III processor
    • -tp px

    • Blended code generation that will work on any x86-compatible processor

      The default in the absence of the -tp flag is to compile for the type of CPU on which the compiler is running, namely Pentium4/Xeon.

    • -Mscalarsse or -Mnoscalarsse

    • Scalar SSE code generation for the XMM registers; implies -Mflushz. Utilize (don't use) SSE (Pentium 3, 4, AthlonXP/MP, Opteron) and SSE2 (Pentium 4, Opteron) instructions to perform all the floating-point operations coded. This requires the assembler to be capable of interpreting SSE/SSE2 instructions. The default is -Mnoscalarsse.
    • -Mnontemporal

    • Allow nontemporal move prefetching. -Mnontemporal used with -fastsse can sometimes be faster than -fastsse alone.
    • -Mvect[=option[,option,...]]

      • Pass options to the internal vectorizer. This also sets the optimization level to a minimum of 2; see -O. If no option list is specified, then the following vector optimizations are used: assoc,cachesize:262144,nosse. The vect options are:
      • altcode:n or noaltcode (default)

      • Generate (don't generate) alternate scalar code for vectorized loops. If altcode is specified without arguments, the vectorizer determines an appropriate cutoff length and generates scalar code to be executed whenever the loop count is less than or equal to that length. If altcode:n is specified, the scalar altcode is executed whenever the loop count is less than or equal to n.
      • assoc or noassoc (default)

      • Enable (disable) certain associativity conversions that can change the results of a computation due to floating point roundoff error differences. A typical optimization is to change the order of additions, which is mathematically correct, but can be computationally different, due to roundoff error.
      • cachesize:number (default=automatic)

      • Instructs the vectorizer, when performing cache tiling optimizations, to assume a cache size of number.
      • prefetch

      • Use prefetch instructions in loops where profitable.
      • sse

      • Use SSE, SSE2, 3Dnow, and prefetch instructions in loops where possible. Overrides prefetch.
  • Optimization flags related to floating point behaviour:

    • -Kieee or -Knoieee (default)

    • Perform (don't perform) float and double divides in conformance with the IEEE 754 standard. This is done by replacing the usual in-line divide algorithm with a subroutine call, at the expense of performance. The default algorithm produces results that differ from the correctly rounded result by no more than 3 units in the last place. Also, on some systems, a more accurate math library may be linked if -Kieee is used during the link step.
    • -Ktrap=[option,[option]...]

    • Controls the behavior of the processor when floating-point exceptions occur. Possible options include fp, align (ignored), inv, denorm, divz, ovf, unf, and inexact. -Ktrap is only processed when compiling a main function/program. The options inv, denorm, divz, ovf, unf, and inexzct correspond to the processor's exception mask bits invalid operation, denormalized operand, divide-by-zero, overflow, underflow, and precision, respectively. Normally, the processor's exception mask bits are on (floating-point exceptions are masked; the processor recovers from exceptions and continues). If a floating-point exception occurs and its corresponding mask bit is off (or "unmasked"), execution terminates with an arithmetic exception (C's FPE signal). -Ktrap=fp is equivalent to -Ktrap=inv,divz,ovf.
    • -Mflushz (default) or -Mnoflushz

    • Set SSE to flush-to-zero mode.
    • -Mfptrap (default) or -Mnofptrap

    • -Mnofptrap performs the semantics of -Knoieee (use in-line divide, link in non-IEEE libraries if available, and disable underflow traps) and disables floating point traps.
    • -pc val

      • The IA-32 architecture implements a floating-point stack using 8 80-bit registers. Each register uses bits 0-63 as the significand, bits 64-78 for the exponent, and bit 79 is the sign bit. This 80-bit real format is the default format (called the extended format). When values are loaded into the floating point stack they are automatically converted into extended real format. The precision of the floating point stack can be controlled, however, by setting the precision control bits (bits 8 and 9) of the floating control word appropriately. In this way, the programmer can explicitly set the precision to standard IEEE double using 64 bits, or to single precision using 32 bits. The default precision setting is system dependent. For Linux systems the default precision is extended. If you use -pc to alter the precision setting for a routine, the main program must be compiled with the same value for -pc. The command line option -pc val lets the programmer set the compiler's precision preference. Valid values for val are:
      • 32 single precision

      • 64 double precision

      • 80 extended precision

        Operations performed exclusively on the floating point stack using extended precision, without storing into or loading from memory, can cause problems with accumulated values within the extra 16 bits of extended precision values. This can lead to answers, when rounded, that do not match expected results.

2.2. OpenMP and automatic parallelization options

Of the installed compilers, only the Intel and Portland Group ones are capable of interpreting OpenMP directives and producing OpenMP code.

  • As the cluster nodes have only two cpus, the available parallelization benefit is limited

  • If planning to mix MPI and OpenMP keep in mind that the installed MPI implementations are at best capable of handling MPI calls from the master thread only (preferably outside parallel regions).

2.2.1. Useful parallelization options for the Intel compilers

  • -openmp

  • enable the compiler to generate multi-threaded code based on the OpenMP directives
  • -openmp_stubs

  • enables the user to compile OpenMP programs in sequential mode. The openmp directives are ignored and a stub OpenMP library is linked (sequential)
  • -openmp_report{0|1|2}

  • control the OpenMP parallelizer diagnostic level
  • -parallel

  • enable the auto-parallelizer to generate multi-threaded code for loops that can be safely executed in parallel
  • -par_report{0|1|2|3}

  • control the auto-parallelizer diagnostic level
  • -par_threshold[n]

  • set threshold for the auto-parallelization of loops where n is an integer from 0 to 100

2.2.2. Useful parallelization options for the Portland Group compilers

  • -mp

    • Enable OpenMP and SGI parallelization directives
    • -Mnoopenmp

    • Ignore OpenMP directives; use with -mp
    • -Mnosgimp

    • Ignore SGI parallelization directives; use with -mp
  • -Mconcur[=option[,option,...]]

    • Automatically generate parallel loops. Instructs the compiler to enable auto-concurrentization of loops. This also sets the optimization level to a minimum of 2; see -O. If -Mconcur is specified, multiple processors will be used to execute loops which the compiler determines to be parallelizable. When linking, the -Mconcur switch must be specified or unresolved references will occur. The NCPUS environment variable controls how many processors will be used to execute parallelized loops. The options can be one or more of the following:
    • altcode:n or noaltcode

    • Generate (don't generate) alternate scalar code for parallelized loops. The parallelizer generates scalar code to be executed whenever the loop count is less than or equal to n. If noaltcode is specified, the parallelized version of the loop is always executed regardless of the loop count.
    • altreduction or altreduction:n

    • Generate alternate scalar code for parallelized loops containing a reduction. If a parallelized loop contains a reduction, the parallelizer generates scalar code to be executed whenever the loop count is less than or equal to n.
    • assoc (default) or noassoc

    • Enable (disable) parallelization of loops with reductions.
    • dist:block

    • Parallelize with block distribution. Contiguous blocks of iterations of a parallelizable loop are assigned to the available processors.
    • dist:cyclic

    • Parallelize with cyclic distribution. The outermost parallelizable loop in any loop nest is parallelized. If a parallelized loop is innermost, its iterations are allocated to processors cyclically. For example, if there are 3 processors executing a loop, processor 0 performs iterations 0, 3, 6, etc; processor 1 performs iterations 1, 4, 7, etc; and processor 2 performs iterations 2, 5, 8, etc.
    • cncall or nocncall (default)

    • Assume (don't assume) that loops containing calls are safe to parallelize. Also, no minimum loop count threshold must be satisfied before parallelization will occur, and last values of scalars are assumed to be safe.
    • levels:n

    • Parallelize loops nested at most n levels deep; the default is 3.
  • -Mdepchk (default) or -Mnodepchk

  • Assume (don't assume) that potential data dependencies exist. -Mnodepchk may result in incorrect code.
  • -Msafe_lastval

  • Allow parallelization of loops with conditional scalar assignments In the case where a scalar is used after a loop, but is not defined on every iteration of the loop, the compiler does not by default parallelize the loop. However, this option tells the compiler it is safe to parallelize the loop.