Compilation flags advice for production and distribution

Following this nice suggestion, maybe a good thread is one that discusses the preferred compilation flags, considering these (and other?) characteristics:

1. The source code is distributed: which flags provide the best performance, and which are flags are safe in the sense of being available in all compilers and for any architecture? One would not like to put a set of default flags in a Makefile and have users experiencing errors because of them.

2. Distribution of a binary: Which are the performance flags which are safe in the same sense as above. Is using -static the correct and best way to distribute standalone binaries?

The choice of compiler flags depends a lot on the goal. Is it a shareable library? Is it a final main program on a supercomputer?
These are the flags I use to generate libraries,

  • Intel ifort debug on UNIX:
    -debug full           # generate full debug information
    -g3                   # generate full debug information
    -O0                   # disable optimizations
    -CB                   # Perform run-time bound-checks on array subscript and substring references (same as the -check bounds option)
    -init:snan,arrays     # initialize arrays and scalars to NaN
    -warn all             # enable all warning
    -gen-interfaces       # generate interface block for each routine in the source files
    -traceback            # trace back for debugging
    -check all            # check all
    -check bounds         # check array bounds
    #-fpe-all=0            # Floating-point invalid, divide-by-zero, and overflow exceptions are enabled
    -fpe0                 # Ignore underflow (yield 0.0); Abort on other IEEE exceptions.
    -diag-error-limit=10  # max diagnostic error count
    -diag-disable=5268    # Extension to standard: The text exceeds right hand column allowed on the line.
    -diag-disable=7025    # This directive is not standard Fxx.
    -diag-disable=10346   # optimization reporting will be enabled at link time when performing interprocedural optimizations.
    -ftrapuv              # Initializes stack local variables to an unusual value to aid error detection.
    
  • Intel ifort debug on Windows
     /debug:full
     /stand:f18      # issue compile-time messages for nonstandard language elements.
     /Zi
     /CB
     /Od
     /Qinit:snan,arrays
     /warn:all
     /gen-interfaces
     /traceback
     /check:all
     /check:bounds
     #/fpe-all:0
     /fpe:0
     /Qdiag-error-limit:10
     /Qdiag-disable:5268
     /Qdiag-disable:7025
     /Qtrapuv
    
  • Intel ifort release on Windows:
     /O3                             # Enable O3 optimization.
     /Qvec                           # enable vectorization.
     /Qunroll                        # [:n] set the maximum number of times to unroll loops (no number n means automatic).
     /Qunroll-aggressive             # use more aggressive unrolling for certain loops.
     /Qinline-forceinline            # Instructs the compiler to force inlining of functions suggested for inlining whenever the compiler is capable doing so.
     #/Qguide-vec:4                   # enable guidance for auto-vectorization, causing the compiler to generate messages suggesting ways to improve optimization (default=4, highest).
     #/Qparallel                      # generate multithreaded code for loops that can be safely executed in parallel.
     #/Qipo-c:                        # Tells the compiler to optimize across multiple files and generate a single object file ipo_out.obj without linking
                                     # info at: https://software.intel.com/en-us/Fortran-compiler-developer-guide-and-reference-ipo-c-qipo-c
     /Qftz
     /Qipo               # enable interprocedural optimization between files.
     /Qip                # determines whether additional interprocedural optimizations for single-file compilation are enabled.
    
  • Intel ifort release on UNIX:
     -stand f18                      # issue compile-time messages for nonstandard language elements.
     -O3                             # set the optimizations level
     -unroll                         # [=n] set the maximum number of times to unroll loops (no number n means automatic).
     -unroll-aggressive              # use more aggressive unrolling for certain loops.
     -diag-disable=10346             # optimization reporting will be enabled at link time when performing interprocedural optimizations.
     -diag-disable=10397             # optimization reporting will be enabled at link time when performing interprocedural optimizations.
     #-guide-vec=4                    # enable guidance for auto-vectorization, causing the compiler to generate messages suggesting ways to improve optimization (default=4, highest).
     #-parallel                       # generate multithreaded code for loops that can be safely executed in parallel. This option requires MKL libraries.
     #-qopt-subscript-in-range        # assumes there are no "large" integers being used or being computed inside loops. A "large" integer is typically > 2^31.
     -ftz                            # Flushes subnormal results to zero.
     -inline-forceinline # Instructs the compiler to force inlining of functions suggested for inlining whenever the compiler is capable doing so.
     -finline-functions  # enables function inlining for single file compilation.
     -ipo                # enable interprocedural optimization between files.
     -ip                 # determines whether additional interprocedural optimizations for single-file compilation are enabled.
    
  • gfortran debug
      -g3                                 # generate full debug information
      -O0                                 # disable optimizations
     #-fsanitize=undefined                # enable UndefinedBehaviorSanitizer for undefined behavior detection.
     #-fsanitize=address                  # enable AddressSanitizer, for memory error detection, like out-of-bounds and use-after-free bugs.
     #-fsanitize=leak                     # enable LeakSanitizer for memory leak detection.
      -fcheck=all                         # enable the generation of run-time checks
      -ffpe-trap=invalid,zero,overflow    # ,underflow : Floating-point invalid, divide-by-zero, and overflow exceptions are enabled
      -ffpe-summary=all                   # Specify a list of floating-point exceptions, whose flag status is printed to ERROR_UNIT when invoking STOP and ERROR STOP.
                                          # Can be either 'none', 'all' or a comma-separated list of the following exceptions:
                                          # 'invalid', 'zero', 'overflow', 'underflow', 'inexact' and 'denormal'
      -finit-integer=-2147483647          # initilize all integers to negative infinity
      -finit-real=snan                    # initialize REAL and COMPLEX variables with a signaling NaN
      -fbacktrace                         # trace back for debugging
     #-pedantic                           # issue warnings for uses of extensions to the Fortran standard. Gfortran10 with MPICH 3.2 in debug mode crashes with this flag at mpi_bcast. Excluded until MPICH upgraded.
      -fmax-errors=10                     # max diagnostic error count
      -Wno-maybe-uninitialized            # avoid warning of no array pre-allocation.
      -Wall                               # enable all warnings:
                                          # -Waliasing, -Wampersand, -Wconversion, -Wsurprising, -Wc-binding-type, -Wintrinsics-std, -Wtabs, -Wintrinsic-shadow,
                                          # -Wline-truncation, -Wtarget-lifetime, -Winteger-division, -Wreal-q-constant, -Wunused, -Wundefined-do-loop
                                          # gfortran10 crashes and cannot compile MPI ParaMonte with mpich in debug mode. Therefore -wall is disabled for now, until MPICH upgrades interface.
     #-Wconversion-extra                  # Warn about implicit conversions between different types and kinds. This option does not imply -Wconversion.
     #-Werror=conversion                  # Turn all implicit conversions into an error. This is important to avoid inadvertent implicit change of precision in generic procedures of various kinds, due to the use of `RK` to represent different kinds.
     #-Werror=conversion-extra            # Turn all implicit conversions into an error. This is too aggressive and as such currently deactivated. For example, it yields an error on the multiplication of integer with real.
      -fno-unsafe-math-optimizations
      -fsignaling-nans
      -frounding-math
      -Wno-surprising                     # -Wsurpring yields many false positives like "Array x at (1) is larger than limit set by '-fmax-stack-var-size='".
    
  • gfortran release on MacOS
     -fauto-inc-dec
     -fbranch-count-reg
     -fcombine-stack-adjustments
     -fcompare-elim
     -fcprop-registers
     -fdce
     -fdefer-pop
     #-fdelayed-branch
     -fdse
     -fforward-propagate
     -fguess-branch-probability
     -fif-conversion
     -fif-conversion2
     -finline-functions-called-once
     -fipa-profile
     -fipa-pure-const
     -fipa-reference
     -fipa-reference-addressable
     -fmerge-constants
     -fmove-loop-invariants
     -fomit-frame-pointer
     -freorder-blocks
     -fshrink-wrap
     -fshrink-wrap-separate
     -fsplit-wide-types
     -fssa-backprop
     -fssa-phiopt
     -ftree-bit-ccp
     -ftree-ccp
     -ftree-ch
     -ftree-coalesce-vars
     -ftree-copy-prop
     -ftree-dce
     -ftree-dominator-opts
     -ftree-dse
     -ftree-forwprop
     -ftree-fre
     -ftree-phiprop
     -ftree-pta
     -ftree-scev-cprop
     -ftree-sink
     -ftree-slsr
     -ftree-sra
     -ftree-ter
     -funit-at-a-time
     -falign-functions  -falign-jumps
     -falign-labels  -falign-loops
     -fcaller-saves
     -fcode-hoisting
     -fcrossjumping
     -fcse-follow-jumps  -fcse-skip-blocks
     -fdelete-null-pointer-checks
     -fdevirtualize  -fdevirtualize-speculatively
     -fexpensive-optimizations
     -fgcse  -fgcse-lm
     -fhoist-adjacent-loads
     -finline-functions
     -finline-small-functions
     -findirect-inlining
     -fipa-bit-cp  -fipa-cp  -fipa-icf
     -fipa-ra  -fipa-sra  -fipa-vrp
     -fisolate-erroneous-paths-dereference
     -flra-remat
     -foptimize-sibling-calls
     -foptimize-strlen
     -fpartial-inlining
     -fpeephole2
     -freorder-blocks-algorithm=stc
     -freorder-blocks-and-partition  -freorder-functions
     -frerun-cse-after-loop
     -fschedule-insns  -fschedule-insns2
     -fsched-interblock  -fsched-spec
     -fstore-merging
     -fstrict-aliasing
     -fthread-jumps
     -ftree-builtin-call-dce
     -ftree-pre
     -ftree-switch-conversion  -ftree-tail-merge
     -ftree-vrp
     -fgcse-after-reload
     -fipa-cp-clone
     -floop-interchange
     -floop-unroll-and-jam
     -fpeel-loops
     -fpredictive-commoning
     -fsplit-paths
     -ftree-loop-distribute-patterns
     -ftree-loop-distribution
     -ftree-loop-vectorize
     -ftree-partial-pre
     -ftree-slp-vectorize
     -funswitch-loops
     -fvect-cost-model
     -fversion-loops-for-strides
    
  • gfortran linux/windows
     -ftree-vectorize        # perform vectorization on trees. enables -ftree-loop-vectorize and -ftree-slp-vectorize.
     -funroll-loops          # [=n] set the maximum number of times to unroll loops (no number n means automatic).
     -O3                     # set the optimizations level
     -finline-functions      # consider all functions for inlining, even if they are not declared inline.
     #-fwhole-program         # allow the compiler to make assumptions on the visibility of the symbols leading to more aggressive optimization decisions.
     -flto=3                 # enable interprocedural optimization between files in parallel on 3 processors.
    

I don’t remember the specific reasons, but I ended up separating the GNU release flags for macOS from Linux/Windows because a particular unknown flag, switched on by -O3 (or -flto), led to segfaults on MacOS. This happened several years ago with GNU 7/8/9(?). The bugs may have been resolved in the newer releases of GNU compilers.
These flags are taken from the CMake files of the ParaMonte library. These flags do not include architecture-specific optimization flags that Steve Kargl listed (to ensure portability of the generated library).
Perhaps, a compilation of the flags similar to the above should appear on the FortranLang website, if there is not any there already.

Intel has an excellent summary of its compiler flags.

One final note, enabling interprocedural optimizations (e.g., with -flto or -ipo) will significantly lengthen the compilation process possibly extending a 30-sec process to 30 mins.

I mentioned only gfortran and ifort because I have experience primarily with these two.

8 Likes

More a shared library or a standalone package for use in personal or small HPC systems. Meaning, how to have a more or less standard set of flags which at the same time provide good performance and portability. The main distinction I make here is: is the binary or the source that is being distributed. For instance is now my understanding that --march=native should be used with care if the goal is to distribute a binary, but it is important for performance if the user can locally compile the code.

For high end HPC a deeper understanding of all options is probably expected from the user. I’m more concerned about users that are not necessarily familiar with programming and will download an executable or a source code and just follow simple build instructions.

I have used the above flags to generate portable x86_64 libraries for several years and have not encountered problems. The nature of the libraries that I distribute does not require native optimizations because the heavy computations are supposed to happen somewhere else (in the user-supplied external functions). But if such optimizations are needed it is easy to add them to the build generators (like CMake).
I’d caution against the use of some of the flags listed above. For example,

-parallel

of Intel ifort compiler creates a runtime dependency on Intel OpenMP shared library if I remember correctly. That is not a big deal if you also share the Intel runtime libraries with the users, or simply ship the OMP library along with yours. This is what I frequently do with gofrtran runtime dependencies.

Another example is the -fwhole-program option of gfortran which I think should be specified only for local main programs (and that is why I commented it out in the above).

Over the years, I realized the easiest way to handle different build scenarios with different compilers is to list them all along with the relevant compile flags in a build generator like CMake, something like,

if (intel_enabled AND release_enabled)
...
endif()

One such example is this file, although this file more is reflective of my learning history than a good CMake file design (a clean fully revamped version is currently under construction in a private branch).
Once the CMake (or meson, …) build generator files are set up, it truly becomes bliss to compile and run complex codebases for intricate build scenarios with any compiler, even for the end-users. For example, to build the ParaMonte library, the user would only need to type

install.sh --build release

in Bash, or

install.bat --build release

in Windows command line. The rest is all taken care of for the user, even the compiler installation (on Linux, for example). On Linux, the user does not even need to know if the codebase is in Fortran, C, or … . Then, one can add more build scenarios to the scripts like a --build fast option to build for the native arch. Currently, I separate ifort’s -ipo and gfortran -flto from the rest of the flags and invoke them once all development is done via a separate build flag --build ipo because specifying these two enables interprocedural optimizations and function inlines that can make the linking process too long. I specify these links only in the final production stage.

In sum, if portability is desired, the above optimizations flags should be good. But if that is not enough, having a build script that automates the process for the end-users is a great solution. It takes some time to write and fine-tune but once written, it is there forever, or for as long as Bash terminals are available on computers. The compiler options or bash scripting are unlikely to change for the foreseeable future, making such build scripts a good time investment.

1 Like

The intel runtime libraries can be downloaded from this website.

1 Like

Note: For my money, the best way to get the intel libs nowadays is via conda: Package repository for intel :: Anaconda.org

2 Likes

Just cross-linking this other interesting post for future reference: How to correctly determine the `-march` flag when working with Intel oneAPI on Intel processors? - #3 by Hongyi

1 Like

A question that has occurred to me when reading these two topics: do “standard libraries” (containing Fortran intrinsic procedures) provided by compiler vendors contain any alternative code to use advanced CPU instructions when run on such processors?

1 Like