I have a very specific question regarding flags for optimized code that runs only on the machine where the application was built:

Is my understanding correct that ifort’s `xHost`

is equivalent to gfortran’s `march=native`

?

I have a very specific question regarding flags for optimized code that runs only on the machine where the application was built:

Is my understanding correct that ifort’s `xHost`

is equivalent to gfortran’s `march=native`

?

1 Like

I am not sure, but I would say that you need to specify “-mtune=native” to get a close enough equivalent. “-march=native” might still cater for other machines and I do not know if that would slow down the program or not. But that is just my reading of the descriptions.

@Arjen: My understand was exactly the other way round. The gcc manual says: “Specifying `-march=cpu-type`

implies `-mtune=cpu-type`

" and "`march=cpu-type`

allows GCC to generate code that may not run at all on processors other than the one indicated.”

The Intel manual says for `-xcode`

: Do not use `code`

values to create binaries that will execute on a processor that is not compatible with the targeted processor.

I guess you’re right - I overlooked the first sentence, jumping directly to the “native” entry.

-xHost is not like other -x options, which require running on an Intel CPU with an instruction set compatible with what follows the x. -xHost queries the CPU type and applies the appropriate -x option for Intel CPUs, or the appropriate -arch option for non-Intel CPUs. As it happens, I wrote the initial implementation of this for Intel compilers, though I don’t know what it looks like now,.

-xHost uses the best-available instruction set for the CPU on which you compiled the source. That doesn’t mean it runs ONLY on that CPU, but rather any CPU you do run it on has to support that level of instruction set. If you compile on an Intel CPU, a run-time check is added to the main program that will give an error if the CPU type is not compatible (or it’s not an Intel CPU.) If you compile on a non-Intel CPU, no check is added and you might get reserved instruction faults, etc.

It is also the case that some optimizations are enabled only for Intel CPUs. (There is no “deoptimization” for non-Intel.)

The Intel math library and Intel MKL also do “CPU dispatch” which, for some Intel CPUs, executes a customized instruction path. Non-Intel CPUs take the “generic” path. There are options to always use the generic path for better cross-platform reproducibility.

I’ll also mention that -fast implies -xHost, along with a bunch of other options.

3 Likes

The title describes “Optimization” but your question may be more specific about “-march=native”.

Addressing Optimization : Using gFortran, I also always use -ffast-math for 8-byte reals. I have never found this to produce unreliable results.

I have found that if compiled on older Intel i5 or i7, the resulting code does work on newer Intel i7 (as expected), but performs poorly on AMD Ryzen, but still runs. Recompiling with -march=native and running on the same processor provides a better result.

My mainly FE computation is dominated by 8-byte reals in ddotp and daxpy calculations, where I hope AVX instructions are efficiently used. I have found changing the Fortran code provides little benefit, with gfortran -march=native -ffast-math -O3 the best approach where AVX instructions are targeted. I have also used -fopenmp -fstack-arrays to maximise stack usage.

I have also been advised to use -mno-fma, but this does not show a significant effect.

1 Like

Then you’ve never used summation-with-carry in your Fortran code or written code where the order of operations enforced by parentheses matters.

```
!
!% gfcx -o z -O3 -march=native k.f90 && ./z
! 3EA688C4 BE9E0A9A BEB061E0 3EF8833C
!-3.24500702E+02 C3A24017
!-3.24496216E+02 C3A23F84
!-3.24496216E+02 C3A23F84
!
!% gfcx -o z -O3 -march=native -ffast-math k.f90 && ./z
! 3EA688C4 BE9E0A9A BEB061E0 3EF8833C
!-3.24494629E+02 C3A23F50
!-3.24500702E+02 C3A24017
!-3.24496216E+02 C3A23F84
!
!% gfcx -o z -O3 -march=native -ffast-math -fno-associative-math k.f90 && ./z
! 3EA688C4 BE9E0A9A BEB061E0 3EF8833C
!-3.24500702E+02 C3A24017
!-3.24496216E+02 C3A23F84
!-3.24496216E+02 C3A23F84
!
program k
implicit none
integer, parameter :: n = 3000000
real x(n), y1, y2, y3
call random_init(.true., .false.)
call random_number(x)
x = x - 0.5
write(*,'(4Z9)') x(1:4)
y1 = sum(x)
y2 = mysum(x)
y3 = real(sum(real(x,kind(1.d0))),kind(1.e0))
write(*,'(ES15.8,1X,Z8)') y1, y1
write(*,'(ES15.8,1X,Z8)') y2, y2
write(*,'(ES15.8,1X,Z8)') y3, y3
contains
function mysum(x) result(r)
real r
real, intent(in) :: x(:)
integer i
real c, y, t
c = 0
r = x(1)
do i = 2, size(x)
y = x(i) - c
t = r + y
c = (t - r) - y
r = t
end do
end function
end program k
```

In reality, the Kahan sum is only as accurate as the least accurate value in x(:), which is typically the largest value. Any suggestion of improved accuracy from the carry-over approach is ignoring the accuracy of the original data, although we do like to assess the accuracy of the calculation and the data seperately.

That my calculations are not significantly affected by the features you are describing is due to the robust nature of the Finite Element algorithm being used.

After solving for x in “f = K . x”, I then calculate the error vector e = abs (f - K . x), as an estimate of round-off errors for all equations due to the K matrix reduction and x calculation. (There was a time when I used “real*10 e(neq)”.

There can be significant power variation in the accumulated values in the K matrix, which is why 8-byte storage is used, even though many contributing elastic values used are accurate to only 4 or 5 sig figures.

1 Like

@skargl

I modified your example, with various values of n (3,4 or 6 million), but am not sure what the results demonstrate. They are different from your example. I didn’t use the compile options you provided, but did get considerable variation in the accumulated sum during the calculation of y4 for differing n.

Dsum is a simple version of extended precision (that would of worked well on an 8087!)

Does random_number(x) reproduce a similar sequence for varying n ?

```
!
!% gfcx -o z -O3 -march=native k.f90 && ./z
! 3EA688C4 BE9E0A9A BEB061E0 3EF8833C
!-3.24500702E+02 C3A24017
!-3.24496216E+02 C3A23F84
!-3.24496216E+02 C3A23F84
!
!% gfcx -o z -O3 -march=native -ffast-math k.f90 && ./z
! 3EA688C4 BE9E0A9A BEB061E0 3EF8833C
!-3.24494629E+02 C3A23F50
!-3.24500702E+02 C3A24017
!-3.24496216E+02 C3A23F84
!
!% gfcx -o z -O3 -march=native -ffast-math -fno-associative-math k.f90 && ./z
! 3EA688C4 BE9E0A9A BEB061E0 3EF8833C
!-3.24500702E+02 C3A24017
!-3.24496216E+02 C3A23F84
!-3.24496216E+02 C3A23F84
!
program k
implicit none
integer, parameter :: million = 1000*1000
integer, parameter :: n = 6*million
real x(n), y1, y2, y3, y4
call random_init(.true., .false.)
call random_number(x)
x = x - 0.5
write(*,'(6Z9)') x(1:6)
write(*,'(6f9.6)') x(1:6)
y1 = sum(x)
y2 = mysum(x)
y3 = real(sum(real(x,kind(1.d0))),kind(1.e0))
y4 = Dsum(x)
write(*,'(ES15.8,1X,Z8)') y1, y1
write(*,'(ES15.8,1X,Z8)') y2, y2
write(*,'(ES15.8,1X,Z8)') y3, y3
write(*,'(ES15.8,1X,Z8)') y4, y4
contains
function mysum(x) result(r)
real r
real, intent(in) :: x(:)
integer i
real c, y, t
c = 0
r = x(1)
do i = 2, size(x)
y = x(i) - c
t = r + y
c = (t - r) - y
r = t
end do
end function mysum
function Dsum(x) result(r)
real r
real, intent(in) :: x(:)
integer i,next,inc
real*10 y, xi
inc = size(x)/20
next = inc
y = 0
do i = 1, size(x)
xi = x(i)
y = y + xi
if ( i==next ) then
write (*,*) i,y
next = next+inc
end if
end do
r = y
end function Dsum
end program k
```

Summation-with-carry is simply a convenient algorithm to show that the use of -ffast-math can be dangerous. The important concept to note is that -ffast-math allows the compiler to violate parentheses (among other possibly questionable numerical shortcuts). If you have any algorithm where parentheses are important, then one must avoid -ffast-math. It will simply give you the wrong result fast. Your post seems to advocate for the use of -ffast-math without acknowledging its shortcomings. I believe that that is reckless.

As for your question,

The section of code

```
call random_init(.true., .false.)
call random_number(x)
x = x - 0.5
write(*,'(4Z9)') x(1:4)
```

guarantees the same sequence of random numbers is used when the code is recompiled with different options. The `write`

statement produces the lines `3EA688C4 BE9E0A9A BEB061E0 3EF8833C`

in my comment, which demonstrates this. More importantly are

the lines

```
y1 = sum(x)
y2 = mysum(x)
y3 = real(sum(real(x,kind(1.d0))),kind(1.e0))
```

`y1`

is computed with the gfortran intrinsic routine `sum`

. The use of -ffast-math changes its result!. `y2`

is the summation-with-carry result, which is the exact result in the precision of `real`

if -ffast-math is not used. If -ffast-math is used, then `y2`

gives a wrong result, because the parentheses are ignored and the optimizer does its job. `y3`

is also an exact result as the 24-bit `real`

values are converted to 53-bit `double precision`

values and then summed with a 53-bit accumulator. You’ll note that -ffast-math does not effect the value computed for y3, because `53 > 2 * 24`

. Finally, your Dsum is equivalent to `y3 = real(sum(real(x,10)),kind(1.e0))`

.

1 Like

I think OP is correct.

I think intel’s -O3 -xhost equals gfortran’s -O3 -march=native

I have tested some of my programs, with the above flags, usually intel and gfortran’s speed are similar.

But gfortran also have a flag called -fast, it includes -O3 -march=native I believe. It may do some more aggressive optimization. But perhaps it has drawbacks, so in gfortran I do not use it.

https://docs.oracle.com/cd/E19059-01/stud.10/819-0492/3_options.html

it says

Select options that optimize execution performance.

**Note -** This option is defined as a particular selection of other options that is subject to change from one release to another, and between compilers. Also, some of the options selected by `-fast` might not be available on all platforms. Compile with the `-v` (verbose) flag to see the expansion of `-fast` for any release.

`-fast` provides high performance for certain benchmark applications. However, the particular choice of options may or may not be appropriate for your application. Use `-fast` as a good starting point for compiling your application for best performance. But additional tuning may still be required. If your program behaves improperly when compiled with `-fast`, look closely at the individual options that make up `-fast` and invoke only those appropriate to your program that preserve correct behavior.

Note also that a program compiled with `-fast` may show good performance and accurate results with some data sets, but not with others. Avoid compiling with `-fast` those programs that depend on particular properties of floating-point arithmetic.

Because some of the options selected by `-fast` have linking implications, if you compile and link in separate steps be sure to link with `-fast` also.

`-fast` selects the following options:

`-dalign`-
`-depend`**(SPARC)** `-fns``-fsimple=2``-ftrap=common``-libmil``-xtarget=native``-O5``-xlibmopt`-
`-pad=local`**(SPARC)** -
`-xvector=yes`**(SPARC)** `-xprefetch=yes``-xprefetch_level=2`- -nofstore
**(x86)**

Details about the options selected by `-fast`:

- The
`-xtarget=native`hardware target.

If the program is intended to run on a different target than the compilation machine, follow the`-fast`with a code-generator option. For example:

`f95 -fast -xtarget=ultra …` - The
`-O5`optimization level option. - The
`-depend`option analyzes loops for data dependencies and possible restructuring.**(SPARC)** - The
`-libmil`option for system-supplied inline expansion templates.

For C functions that depend on exception handling, follow`-fast`by`-nolibmil`(as in`-fast -nolibmil``)`. With`-libmil`, exceptions cannot be detected with`errno`or`matherr`(3m). - The
`-fsimple=2`option for aggressive floating-point optimizations.

`-fsimple=2`is unsuitable if strict IEEE 754 standards compliance is required. See Section , -fsimple[={1|2|0}]. - The
`-dalign`option to generate double loads and stores for double and quad data in common blocks. Using this option can generate nonstandard Fortran data alignment in common blocks. - The
`-xlibmopt`option selects optimized math library routines. -
`-pad=local`inserts padding between local variables, where appropriate, to improve cache usage.**(SPARC)** -
`-xvector=yes`transforms certain math library calls within DO loops to single calls to a vectorized library equivalent routine with vector arguments. -
`-fns`selects non-standard floating-point arithmetic exception handling and gradual underflow. See Section , -fns[={yes|no}]. - Trapping on common floating-point exceptions,
`-ftrap=common`, is the enabled with Fortran 95. -
`-xprefetch=yes`enables the compiler to generate hardware prefetch instructions where appropriate. -
`-xprefetch_level=2`sets the default level for insertion of prefetch instructions. -
`-nofstore`cancels forcing expressions to have the precision of the result.**(x86)**

It is possible to add or subtract from this list by following the `-fast` option with other options, as in:

`f95 -fast -fsimple=1 -xnolibmopt …`

which overrides the `-fsimple=2` option and disables the `-xlibmopt` selected by `-fast`.

Because `-fast` invokes `-dalign`, `-fns`, `-fsimple=2`, programs compiled with `-fast` can result in nonstandard floating-point arithmetic, nonstandard alignment of data, and nonstandard ordering of expression evaluation. These selections might not be appropriate for most programs.

Note that the set of options selected by the `-fast` flag can change with each compiler release.