Optimization flags for ifort and gfortran

I have a very specific question regarding flags for optimized code that runs only on the machine where the application was built:

Is my understanding correct that ifort’s xHost is equivalent to gfortran’s march=native?

1 Like

I am not sure, but I would say that you need to specify “-mtune=native” to get a close enough equivalent. “-march=native” might still cater for other machines and I do not know if that would slow down the program or not. But that is just my reading of the descriptions.

@Arjen: My understand was exactly the other way round. The gcc manual says: “Specifying -march=cpu-type implies -mtune=cpu-type" and "march=cpu-type allows GCC to generate code that may not run at all on processors other than the one indicated.”

The Intel manual says for -xcode: Do not use code values to create binaries that will execute on a processor that is not compatible with the targeted processor.

I guess you’re right - I overlooked the first sentence, jumping directly to the “native” entry.

-xHost is not like other -x options, which require running on an Intel CPU with an instruction set compatible with what follows the x. -xHost queries the CPU type and applies the appropriate -x option for Intel CPUs, or the appropriate -arch option for non-Intel CPUs. As it happens, I wrote the initial implementation of this for Intel compilers, though I don’t know what it looks like now,.

-xHost uses the best-available instruction set for the CPU on which you compiled the source. That doesn’t mean it runs ONLY on that CPU, but rather any CPU you do run it on has to support that level of instruction set. If you compile on an Intel CPU, a run-time check is added to the main program that will give an error if the CPU type is not compatible (or it’s not an Intel CPU.) If you compile on a non-Intel CPU, no check is added and you might get reserved instruction faults, etc.

It is also the case that some optimizations are enabled only for Intel CPUs. (There is no “deoptimization” for non-Intel.)

The Intel math library and Intel MKL also do “CPU dispatch” which, for some Intel CPUs, executes a customized instruction path. Non-Intel CPUs take the “generic” path. There are options to always use the generic path for better cross-platform reproducibility.

I’ll also mention that -fast implies -xHost, along with a bunch of other options.

6 Likes

The title describes “Optimization” but your question may be more specific about “-march=native”.

Addressing Optimization : Using gFortran, I also always use -ffast-math for 8-byte reals. I have never found this to produce unreliable results.

I have found that if compiled on older Intel i5 or i7, the resulting code does work on newer Intel i7 (as expected), but performs poorly on AMD Ryzen, but still runs. Recompiling with -march=native and running on the same processor provides a better result.

My mainly FE computation is dominated by 8-byte reals in ddotp and daxpy calculations, where I hope AVX instructions are efficiently used. I have found changing the Fortran code provides little benefit, with gfortran -march=native -ffast-math -O3 the best approach where AVX instructions are targeted. I have also used -fopenmp -fstack-arrays to maximise stack usage.
I have also been advised to use -mno-fma, but this does not show a significant effect.

1 Like

In reality, the Kahan sum is only as accurate as the least accurate value in x(:), which is typically the largest value. Any suggestion of improved accuracy from the carry-over approach is ignoring the accuracy of the original data, although we do like to assess the accuracy of the calculation and the data seperately.
That my calculations are not significantly affected by the features you are describing is due to the robust nature of the Finite Element algorithm being used.
After solving for x in “f = K . x”, I then calculate the error vector e = abs (f - K . x), as an estimate of round-off errors for all equations due to the K matrix reduction and x calculation. (There was a time when I used “real*10 e(neq)”.
There can be significant power variation in the accumulated values in the K matrix, which is why 8-byte storage is used, even though many contributing elastic values used are accurate to only 4 or 5 sig figures.

1 Like

@skargl
I modified your example, with various values of n (3,4 or 6 million), but am not sure what the results demonstrate. They are different from your example. I didn’t use the compile options you provided, but did get considerable variation in the accumulated sum during the calculation of y4 for differing n.
Dsum is a simple version of extended precision (that would of worked well on an 8087!)
Does random_number(x) reproduce a similar sequence for varying n ?

!   
!% gfcx -o z -O3 -march=native k.f90 && ./z
! 3EA688C4 BE9E0A9A BEB061E0 3EF8833C
!-3.24500702E+02 C3A24017
!-3.24496216E+02 C3A23F84
!-3.24496216E+02 C3A23F84
!
!% gfcx -o z -O3 -march=native -ffast-math k.f90 && ./z
! 3EA688C4 BE9E0A9A BEB061E0 3EF8833C
!-3.24494629E+02 C3A23F50
!-3.24500702E+02 C3A24017
!-3.24496216E+02 C3A23F84
!
!% gfcx -o z -O3 -march=native -ffast-math -fno-associative-math k.f90 && ./z
! 3EA688C4 BE9E0A9A BEB061E0 3EF8833C
!-3.24500702E+02 C3A24017
!-3.24496216E+02 C3A23F84
!-3.24496216E+02 C3A23F84
!
program k
   implicit none
   integer, parameter :: million = 1000*1000
   integer, parameter :: n = 6*million
   real x(n), y1, y2, y3, y4

   call random_init(.true., .false.)
   call random_number(x)
   x = x - 0.5

   write(*,'(6Z9)') x(1:6)
   write(*,'(6f9.6)') x(1:6)
   y1 = sum(x)
   y2 = mysum(x)
   y3 = real(sum(real(x,kind(1.d0))),kind(1.e0))
   y4 = Dsum(x)

   write(*,'(ES15.8,1X,Z8)') y1, y1
   write(*,'(ES15.8,1X,Z8)') y2, y2
   write(*,'(ES15.8,1X,Z8)') y3, y3
   write(*,'(ES15.8,1X,Z8)') y4, y4

   contains
      function mysum(x) result(r)
         real r
         real, intent(in) :: x(:)
         integer i
         real c, y, t
         c = 0
         r = x(1)
         do i = 2, size(x)
            y = x(i) - c
            t = r + y
            c = (t - r) - y
            r = t
         end do
      end function mysum

      function Dsum(x) result(r)
         real r
         real, intent(in) :: x(:)
         integer i,next,inc
         real*10 y, xi
         inc  = size(x)/20
         next = inc
         y = 0
         do i = 1, size(x)
            xi = x(i)
            y  = y + xi
            if ( i==next ) then
              write (*,*) i,y
              next = next+inc
            end if
         end do
         r = y
      end function Dsum
end program k

I think OP is correct.
I think intel’s -O3 -xhost equals gfortran’s -O3 -march=native
I have tested some of my programs, with the above flags, usually intel and gfortran’s speed are similar.

But gfortran also have a flag called -fast, it includes -O3 -march=native I believe. It may do some more aggressive optimization. But perhaps it has drawbacks, so in gfortran I do not use it.
https://docs.oracle.com/cd/E19059-01/stud.10/819-0492/3_options.html
it says

-fast

Select options that optimize execution performance.

Note - This option is defined as a particular selection of other options that is subject to change from one release to another, and between compilers. Also, some of the options selected by -fast might not be available on all platforms. Compile with the -v (verbose) flag to see the expansion of -fast for any release.

-fast provides high performance for certain benchmark applications. However, the particular choice of options may or may not be appropriate for your application. Use -fast as a good starting point for compiling your application for best performance. But additional tuning may still be required. If your program behaves improperly when compiled with -fast, look closely at the individual options that make up -fast and invoke only those appropriate to your program that preserve correct behavior.

Note also that a program compiled with -fast may show good performance and accurate results with some data sets, but not with others. Avoid compiling with -fast those programs that depend on particular properties of floating-point arithmetic.

Because some of the options selected by -fast have linking implications, if you compile and link in separate steps be sure to link with -fast also.

-fast selects the following options:

  • -dalign
  • -depend (SPARC)
  • -fns
  • -fsimple=2
  • -ftrap=common
  • -libmil
  • -xtarget=native
  • -O5
  • -xlibmopt
  • -pad=local (SPARC)
  • -xvector=yes (SPARC)
  • -xprefetch=yes
  • -xprefetch_level=2
  • -nofstore (x86)

Details about the options selected by -fast:

  • The -xtarget=native hardware target.
    If the program is intended to run on a different target than the compilation machine, follow the -fast with a code-generator option. For example:
    f95 -fast -xtarget=ultra …
  • The -O5 optimization level option.
  • The -depend option analyzes loops for data dependencies and possible restructuring. (SPARC)
  • The -libmil option for system-supplied inline expansion templates.
    For C functions that depend on exception handling, follow -fast by -nolibmil (as in -fast -nolibmil). With -libmil, exceptions cannot be detected with errno or matherr(3m).
  • The -fsimple=2 option for aggressive floating-point optimizations.
    -fsimple=2 is unsuitable if strict IEEE 754 standards compliance is required. See Section , -fsimple[={1|2|0}].
  • The -dalign option to generate double loads and stores for double and quad data in common blocks. Using this option can generate nonstandard Fortran data alignment in common blocks.
  • The -xlibmopt option selects optimized math library routines.
  • -pad=local inserts padding between local variables, where appropriate, to improve cache usage. (SPARC)
  • -xvector=yes transforms certain math library calls within DO loops to single calls to a vectorized library equivalent routine with vector arguments.
  • -fns selects non-standard floating-point arithmetic exception handling and gradual underflow. See Section , -fns[={yes|no}].
  • Trapping on common floating-point exceptions, -ftrap=common, is the enabled with Fortran 95.
  • -xprefetch=yes enables the compiler to generate hardware prefetch instructions where appropriate.
  • -xprefetch_level=2 sets the default level for insertion of prefetch instructions.
  • -nofstore cancels forcing expressions to have the precision of the result. (x86)

It is possible to add or subtract from this list by following the -fast option with other options, as in:

f95 -fast -fsimple=1 -xnolibmopt …

which overrides the -fsimple=2 option and disables the -xlibmopt selected by -fast.

Because -fast invokes -dalign, -fns, -fsimple=2, programs compiled with -fast can result in nonstandard floating-point arithmetic, nonstandard alignment of data, and nonstandard ordering of expression evaluation. These selections might not be appropriate for most programs.

Note that the set of options selected by the -fast flag can change with each compiler release.