Strange behavior of `ifort`

@zaikunzhang ,

Hopefully your post at the Intel forum will catch the attention of Intel Support Team members and elicit a response that will be helpful. You can check with Intel team on the details with the compiler’s -fp compiler option and how it comes into play with your example: you will note the compiler uses as default -fp:fast.

C:\temp>ifort /standard-semantics p.f90
Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.7.0 Build 20220726_000000
Copyright (C) 1985-2022 Intel Corporation.  All rights reserved.

Microsoft (R) Incremental Linker Version 14.33.31630.0
Copyright (C) Microsoft Corporation.  All rights reserved.

-out:p.exe
-subsystem:console
p.obj

C:\temp>p.exe
 T
 0.9999999 2.2718171E-02 -0.3623963 0.9314599 -0.3845887
 2.2718171E-02 2.2718171E-02 -0.2600954 -0.2734710 0.3337145
 -0.3623963 -0.2600954 -6.2047265E-02 -0.3354508 -2.7986396E-02
 0.9314599 -0.2734710 -0.3354508 0.2075594 -0.2006450
 -0.3845887 0.3337145 -2.7986394E-02 -0.2006450 0.2823821
 F

C:\temp>ifort /standard-semantics /fp:precise p.f90
Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.7.0 Build 20220726_000000
Copyright (C) 1985-2022 Intel Corporation.  All rights reserved.

Microsoft (R) Incremental Linker Version 14.33.31630.0
Copyright (C) Microsoft Corporation.  All rights reserved.

-out:p.exe
-subsystem:console
p.obj

C:\temp>p.exe
 T
 1.000000 2.2718173E-02 -0.3623963 0.9314600 -0.3845887
 2.2718173E-02 2.2718173E-02 -0.2600954 -0.2734710 0.3337145
 -0.3623963 -0.2600954 -6.2047265E-02 -0.3354508 -2.7986396E-02
 0.9314600 -0.2734710 -0.3354508 0.2075594 -0.2006450
 -0.3845887 0.3337145 -2.7986396E-02 -0.2006450 0.2823821
 T

C:\temp>
1 Like

As you stated, could not reproduce with anything but specific optimization levels with ifort.
I generalized it to pick random numbers. Somewhat surprised when using BOZ values copied
from the original problem that they did not cause problems at 5x5. Surprised by the size of the differences; was wondering if this was very difficult to hit and you had stumbled on “magic numbers” but get inconsistencies very easily.

No matter what the values the differences occur at the same locations. So if you run it twice
the T/F values will be the same.

rm -f a.out
ifort odd.f90 -O1 -o a.out -mP2OPT_hlo -mP2OPT_hlo_level=2 && ./a.out
exit

given an array of constant values, divide it by a single value and
some of the resulting elements do not test as equal with ifort with
certain optimization levels

**CODE FOR RANDOM NUMBER TESTS**
program test
use, intrinsic :: iso_fortran_env, only : compiler_version, compiler_options
implicit none
character(len=*),parameter :: g='(5(g0,1x))'
real,allocatable :: S(:,:)
integer :: i,j,k, enough=1000
logical :: equal
print '(4a)', 'This file was compiled by ', compiler_version(),  &
              ' using the options ', compiler_options()
call random_seed()
k=0
do
   call random_number(div)
   div=div*1.0e22
   call random_number(valbefore)
   valbefore=valbefore*1.0e20
   !div=nearest(div,1.0)
   !valbefore=nearest(valbefore,1.0)
   do i=1,20
      do j=1,20
         if(allocated(S))deallocate(S)
         allocate(S(i,j))
         S = valbefore
         S = S / div
         equal=all(S.eq.S(1,1))
         equal=all(abs(S-S(1,1))<=epsilon(div))
         if(.not.equal)then
            write(*,*)'value before=',valbefore, 'div=',div, 'nominal value=',S(1,1), &
            & 'maxdelta=',maxval(abs(S-S(1,1))), 'minvalue=',minval(abs(S-S(1,1)))
            
            call printl(S.eq.S(1,1))
            k=k+1
            if(k.ge.enough) stop 'enough'
         endif
      enddo
   enddo
enddo
contains
subroutine printl(b)
implicit none
!@(#) print small 2d logical scalar, vector, matrix in row-column format
logical,intent(in)           :: b(:,:)
character(len=*),parameter   :: row='(i3," > [ ",*(l1:,","))'
character(len=*),parameter   :: all='(" ",*(g0,1x))'
integer                      :: i
   write(*,all) '>shape=',shape(b),',size=',size(b)
   do i=1,size(b,dim=1)
      write(*,fmt=row,advance='no')i,b(i,:)
      write(*,'(" ]")')
   enddo
   write(*,*)
end subroutine printl

end program test

So just confirming that it is a generalized problem and could not get ifx (or any other compiler) to produce a similar inconsistency; although a test that ignores the insignificant bytes does avoid the problems and testing float values is problematic as discussed earlier so if this were showing up with --fast I personally would not consider it an undue problem if it gave me a significant speedup; but an inconsistently appearing at the -O2 level, which is the recommended generic level to use really should be avoided.

1 Like

@zaikunzhang . Attn: @greenrongreen ,

As to the -fp:fast vs other -fp compiler options with IFORT, here’s a simpler variant to also consider:

   real, parameter :: foo = 1.2409463E+22, bar = -4.4971432E+21, baz = -3.4729614E+20 
   real :: x(6)
   x = [ foo, bar, baz, bar, baz, foo ] ; x = x / maxval(x)
   print *, abs( x - x([6,4,5,2,3,1]) ) <= 0
   print *, x
   print *, x([6,4,5,2,3,1])
   print "(*(b0,1x))", x(1), x(6)
   print "(*(b0,1x))", x(2), x(4)
   print "(*(b0,1x))", x(3), x(5)
end 
  1. With the default i.e., -fp:fast option:
C:\temp>ifort /standard-semantics p.f90
Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.7.0 Build 20220726_000000
Copyright (C) 1985-2022 Intel Corporation.  All rights reserved.

Microsoft (R) Incremental Linker Version 14.33.31630.0
Copyright (C) Microsoft Corporation.  All rights reserved.

-out:p.exe
-subsystem:console
p.obj

C:\temp>p.exe
 F T F T F F
 0.9999999 -0.3623963 -2.7986394E-02 -0.3623963 -2.7986396E-02
 1.000000
 1.000000 -0.3623963 -2.7986396E-02 -0.3623963 -2.7986394E-02
 0.9999999
111111011111111111111111111111 111111100000000000000000000000
10111110101110011000110000000001 10111110101110011000110000000001
10111100111001010100001110111001 10111100111001010100001110111010

C:\temp>
  1. With the -fp:precise option,
C:\temp>ifort /standard-semantics /fp:precise p.f90
Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.7.0 Build 20220726_000000
Copyright (C) 1985-2022 Intel Corporation.  All rights reserved.

Microsoft (R) Incremental Linker Version 14.33.31630.0
Copyright (C) Microsoft Corporation.  All rights reserved.

-out:p.exe
-subsystem:console
p.obj

C:\temp>p.exe
 T T T T T T
 1.000000 -0.3623963 -2.7986396E-02 -0.3623963 -2.7986396E-02
 1.000000
 1.000000 -0.3623963 -2.7986396E-02 -0.3623963 -2.7986396E-02
 1.000000
111111100000000000000000000000 111111100000000000000000000000
10111110101110011000110000000001 10111110101110011000110000000001
10111100111001010100001110111010 10111100111001010100001110111010

C:\temp>

@zaikunzhang , you can also review at Compiler Explorer some of the aggressive actions coming into play with -fp:fast with the arrays and vector subscripts or such situations de facto with the TRANSPOSE intrinsic.

1 Like

@zaikunzhang , next time onwards with any Intel Fortran related matter, consider first posting at the Fortran forum. Also, note the Fortran compilers are part of Intel’s oneAPI HPC toolkit, not the Base toolkit:

1 Like

This example appears to show that repeating the calculation of x/y, for identical bit pattern of x and y stored in two locations produces a different round-off outcome, in essence a random outcome.
( With ifort, the use of write (,) S should identify the final bit )

A possible explaination could be if AVX registers are used and they have different round-off for different positions in the AVX register ?

I thought that conforming to IEEE754 would exclude this outcome ?

1 Like

yeah. Ifort isn’t IEEE by default.

1 Like

This is an interesting little example for many reasons. Where to begin. Let’s start with some simple things

  1. Intel compilers when invoked without an explicit -O option defaults to O2. And Intel’s O2 is aggressive.
  2. Optimizations on most compilers come at the expense of precision. You have to balance your needs for highest FP precision and performance.
  3. Division, amongst other mathematical expressions and intrinsics, may be done differently depending on optimization level.
  4. Every compiler does optimization differently. Does not mean one is ‘right’ and the other ‘wrong’. Just different choices in balancing precision and speed.

Skip to the solution: For IFORT use

-prec-div

option to force precise divisions even under optimzation. Read up on this option HERE. -fp-model is a macro option that sets a number of different FP behaviors. PRECISE triggers -prec-div amongst other behaviors.

You can read on if you want more details.

I’ll start with #4. As noted, IFX and IFORT behave differently with this example. Our LLVM-based IFX optimization is radically different than that employed with our legacy proprietary IFORT compiler. You SHOULD expect possible FP differences between these 2 compilers under optimizations. Note in the documentation on -prec-div that this is only an option for IFORT. Why? LLVM FP behavior is different than on our Classic proprietary compiler.

On point #1 above: When you encounter questions about ‘what is the compiler doing?’ there are easier ways than dumping assembly and reading that. First, use

-dryrun

option. This will show you all internal defines and options passed to the compiler. It’s quite helpful.

Next, with IFORT use the

-qopt-report=5

option and examine the .optrpt output.
On Point # 3: Now doing this, and comparing the differences between -prec-div and not at O2 you can see the “Code gen” or code generator is different - difference in number of temporary values (“Locals”). This was a clear indicator that strategy used by the code generator for the division is quite different. Again, read the -O2 and -prec-div options in the Developer Guide

On point #1, O2 default. Intel prioritizes performance by default “out of box”, favoring that over numerical precision when you do not bother to add additional compiler options (if you don’t give the compiler any -O options). This philosophy can and has been argued in the community. Some ASSUME that without -O that you would get O0. IFORT has been around for a lot of years and is widely adopted. Where possible, we try to make IFX “out of the box” as close to behavior of IFORT as possible. So no surprises for existing customers, O2 is default. But if you are coming from FLang or gfortran then this default of O2 may be a surprise to you.

Thanks for showing us this example. It was quite interesting.

ron
#IAmIntel

5 Likes

So in the end, what exactly is it in the ifort compiler that caused the difference in those two values?

1 Like

the div is changed to mult by reciprocal in the optimized case.

Ok, but why are the two values different from each other? x*R computed twice would still be expected to result in the same value both times.

1 Like

@greenrongreen ,

I was expecting you to state there is a bug with IFORT when -prec-div is not in effect.

Please consider the trivial example in the other comment upthread. Why would x(3) and x(5) end up with different values when they are subjected to the same arithmetic operation? Whereas x(2) and x(4) match up ok?

1 Like

If you want sane IEEE behavior, don’t use ifort without telling it to give you IEEE behavior.

1 Like

It does make me think we need some review and reconsideration of the default fpm switches used on ifort for --profile debug and --profile release. There are a LOT of switches.

Sounds like the speculation on simultaneous multiply and divide was correct(?).

I am thinking of some code that for better or for worse does not condition the data nor use tolerances when testing equality of floating point values and thinking (though in development) that fpm currently allows users to build packages with essentially any flags; and curious if conditioning the data versus tolerance tests is robust enough to not have to rework some old
code to do better tests. So many people re-invent the wheel along these lines; seems like there
should be standard methods in Fortran like if(a.almost_equals.b). Maybe the IEEE standards count to some extent. Interestingly, note the effect NEAREST() has on the minimal test from above:

   real, parameter :: foo = 1.2409463E+22, bar = -4.4971432E+21, baz = -3.4729614E+20 
   real :: x(6)
   x = [ foo, bar, baz, bar, baz, foo ] 

   x =NEAREST(x,1.0)

   x = x / maxval(x)
   print *, abs( x - x([6,4,5,2,3,1]) ) <= 0
   print *, x
   print *, x([6,4,5,2,3,1])
   print "(*(b0,1x))", x(1), x(6)
   print "(*(b0,1x))", x(2), x(4)
   print "(*(b0,1x))", x(3), x(5)
end 

I think IEEE behavior is a red herring in the original case. One would expect the same results for the two values with or without IEEE. With intel hardware (and probably other hardware too), the IEEE treatment of denormals slows down the computations, so many programmers would prefer to ignore those corner cases, knowing they are irrelevant to their application, but they would still expect x*y to give the same result when computed twice (particularly when computed twice within the same expression).

1 Like

Does it make any difference if this is written as

r = 1.0 / maxval(x)
x = r * x

I’m thinking the error, whatever it is, is still going to be there regardless of that scalar optimization.

1 Like

Thank you @RonShepard for making the question clear, much clearer than I did in my original post. I am particularly amused to see the word “red herring”. I did not know it before and learn it now. Thank you :slight_smile:

A question for everybody. Philosophically, if you were about to design a new compiler, would it be acceptable as the default behavior of the compiler that

f(a, b) == f(a, b)

sometimes produces .TRUE. but sometimes produces .FALSE. (without getting NaN involved) with an intrinsic procedure f that is deterministic by nature? If yes, why, and how would you predict/trust the behavior of a piece of reasonably complicated code that uses f?

Anyone who has a basic understanding of compiler and coding may ask themselves this question. Maybe we should start a poll about it …

Talking about optimization (in the programming sense), I understand that there is no free lunch, and sometimes accuracy is sacrificed for some benefits. The question is, what is the benefit in this particular case?

Thanks.

Thanks for the detailed explanation.

It is very surprising that the calculation of S(3,5) and S(5,3) are different, when using array syntax,
S = S / maxval(abs(S)).
Hard to understand how a different optimisation approach is applied to the two elements of the array S ?

1 Like

Totally agree @kargl . Maybe we have different understandings about what is “deterministic by nature”. For me, your example of f is not deterministic by nature. Sure, the random number generator may be implemented in a deterministic way, but its nature / intention is to simulate randomness, at least in the majority of cases.

I would say that if the function is pure and does not depend on any external environment or saved internal state, and if the two evaluations are in the same statement (as above), then they should evaluate to the same value, and thus the expression should evaluate to .true.. However, there are many, many situations in which .false. would be allowed, or even expected, so it would require some careful wording to differentiate all these various situations. What if f() depends on the system clock? What if f() depends on some random number generator (a deterministic one, or a thermal temperature one). What if f() has an internal state, such as a counter that is incremented each call?

A related question is under what circumstances is it allowed for a language to make a single call, and to use that result multiple times. When both references are in the same statement? How about when they are in different nearby statements, where the compiler can determine that no changes to rounding modes have occurred? How about in far away statements, or on different nodes of a parallel thread, etc.? How about if one reference occurs within some cpu code, and the other within some SSE/AVX or GPU code? What if one reference is evaluated by the compiler at compile time and the other by the cpu at run time? The answers to these questions are not simple.

1 Like

That is precisely what I have in mind. Thank you again for making it clearer.

Or, to keep things simpler, we may focus on the intrinsic and deterministic mathematical procedures specified by the latest Fortran standard.