Strange behavior of `ifort`

RonShepard · December 6, 2022, 3:59am

In hindsight, that was not the best term. Since you have looked up the definition, you can see that it is often associated with an intentional distraction, and I did not mean to imply that. I only meant that while IEEE would indeed address the issue of reproducibility, there are many other circumstances short of full IEEE compliance where a programmer would still expect x*r to evaluate to the same value.

This reminds me of something I remember from the 1980s with CRAY arithmetic. I think the story was that there was a multiplication instruction involving a scalar register and a vector register address, and the result of that multiplication was different if the two values were switched. The brief description of this was that multiplication was not commutative, which is one possibility for what is happening in this thread.

Something else that is related to this discussion is that some computers do not have a floating point divide instruction. They instead do a table lookup of the leading mantissa bits and then two or three multiplications within a Newton iteration to compute the reciprocal. Some computers (I think the Cray-1 was one), did just two Newton steps, leaving some low-order bits incorrect for some fp values. This is why results computed with a reciprocal approximation might be different from those computed with a divide instruction. However, as discussed above, that would not explain why two evaluations of the same x*r value would differ, it would only explain why x/y is different from x*r.

everythingfunctional · December 6, 2022, 2:19pm

zaikunzhang:

would it be acceptable as the default behavior of the compiler that
f(a, b) == f(a, b)
sometimes produces .TRUE. but sometimes produces .FALSE. (without getting NaN involved) with an intrinsic procedure f that is deterministic by nature?

I would say that, given your constraints that the function is deterministic in nature (i.e. depends only on its inputs), two identical expressions should evaluate to identical values. However, things get tricky if what you really mean is two expressions using identical values should always produce the same identical value. For example

a = 2.0
b = 3.0
call foo(a, b)
contains
subroutine foo(a, b)
parameter rhs = f(2.0, 3.0)
print *, f(a, b) == rhs
end subroutine
end

do you expect that to always print T no matter what? If the answer is yes, then what you are saying is that the compiler must always emit the exact same set of instructions for evaluating f at run time as it used to evaluate f at compile time. That would eliminate whole classes of potential optimizations.

zaikunzhang · December 6, 2022, 3:09pm

everythingfunctional:

a = 2.0
b = 3.0
call foo(a, b)
contains
subroutine foo(a, b)
parameter rhs = f(2.0, 3.0)
print *, f(a, b) == rhs
end subroutine
end
do you expect that to always print T no matter what? If the answer is yes, then what you are saying is that the compiler must always emit the exact same set of instructions for evaluating f at run time as it used to evaluate f at compile time. That would eliminate whole classes of potential optimizations.

Yes, I agree that the f(2.0, 3.0) computed during compilation (if possible) does not need to be identical to f(a, b) with a=2.0 and b=3.0 computed at runtime.

However, if both a and b are undecidable during compilation time, and both sides of f(a, b) == f(a,b) are computed during runtime, I still think that this expression should always return .TRUE. (assuming no randomness, no NaN, etc.).

everythingfunctional · December 6, 2022, 3:17pm

Agreed.

RonShepard · December 6, 2022, 4:34pm

I gave several counter examples above where this would not occur. What if the programmer intentionally resets the rounding mode before the function evaluations as a way to determine how sensitive the function is for those arguments. In this case the programmer wants and expects a different result for the two evaluations. Should the language allow him to do that?

What if the program is running in a heterogeneous parallel environment where different nodes have different cpus, say a mixture of intel, arm64, and PowerPC hardware, some big endian, some little endian, all with different gpu processors. The programmer wants to evaluate the function on all the nodes in order to measure the differences in all of the hardware combinations. Should constraints in the language prevent the programmer from doing that?

Or say a program sometimes generates denormals, and the IEEE treatment of those denormals slows down the program by a factor of 10. The programmer has the option to disable that part of IEEE and get a 10x speedup. So he runs the function with denormals enabled and again with denormals disabled to see what is the difference. Should constraints in the language prevent the programmer from doing that?

I think you can see that in some cases, the more a language constrains the programmer, the less useful it becomes.

greenrongreen · December 6, 2022, 7:29pm

I have some insight into this issue. The mathematics is actually reproducible. The insight came from changing the data in the 5x5, and simplifying the scaling to a simple division. Then using -qopt-report=5 and -S and examining the data in both hex and default float format
Here is my modified test

program test                                                                                                                         
                                                                                                                                     
implicit none                                                                                                                                                                                                        
integer :: row , col                                                                                                                                    
! Why these strange numbers? Because they come from a real project.                                                                  
real, parameter :: A(25) = &                                                                                                         
    & [-3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, &                                                
    & -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, &                                                 
    & -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20,&                                               
    & -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, &                                                
    & -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20]                                                  
                                                                                                                                     
real :: S(5, 5)                                                                                                                      
                                                                                                                                     
S = reshape(A, [5, 5])                                                                                                               
                                                                                                                                     
! The numbers in S are horrible. Let's scale it.                                                                                     
!S = S / maxval(abs(S))                                                                                                               
do col=1,5
  do row=1,5
   S(row,col) = S(row,col) / 1.2409463E+22
  end do
end do
write (*,*) "S after"
do row=1,5
  write(*,'(5(Z8,3x))') S(row,:)
end do
print *, ""
                                                                                                                                     
write (*, *) S                                                                                                                       
                                                                                                                                     
! Is S symmetric?                                                                                                                    
write (*, *)'Is S symmetric?', all(abs(S - transpose(S)) <= 0)                                                                                         

end program test

Now just run a simple compilation

]$ ifort test3.f90
$ ./a.out
 S after
BCE543B9   BCE543B9   BCE543B9   BCE543B9   BCE543B9
BCE543B9   BCE543B9   BCE543B9   BCE543B9   BCE543B9
BCE543B9   BCE543B9   BCE543B9   BCE543B9   BCE543B9
BCE543B9   BCE543B9   BCE543B9   BCE543B9   BCE543B9
BCE543BA   BCE543BA   BCE543BA   BCE543BA   BCE543BA
 
 -2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986396E-02
 -2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986396E-02
 -2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986396E-02
 -2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986396E-02
 -2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986396E-02
 Is S symmetric? F

Do you see the pattern? It is stochastic. Not random.
Now, with inside knowledge I know that default, no -O option, gives us -O2 which vectorizes with SSE2, 128 bit registers. You can also verify this by using -dryrun:

$ ifort test3.f90 -dryrun |& grep -i sse
    -D__SSE__ \
    -D__SSE_MATH__ \
    -D__SSE2__ \
    -D__SSE2_MATH__ \

Here we are telling the backend it MAY use SSE2. Not demanding, but suggesting to the backend. You’ll see the same define in the -prec-div case - but remember, it’s a suggestion not a directive. It may use SSE2, that is all the define is doing. the backend will see -prec-div and decide how to process the loop nest.

How many 32bit data items in the XMM registers? 4. Go down the row, in memory order - see the pattern? at optimization, we’re packing 4 elements into an XMM register and doing vectorized math. For the 5th element in each row we have a remainder that is not done in a packed vector register with vector instructions. Repeat this for each column. You can see this in the OPT REPORT.
Here’s our loop of interest:

     19 !S = S / maxval(abs(S))
     20 do col=1,5
     21   do row=1,5
     22    S(row,col) = S(row,col) / 1.2409463E+22
     23   end do
     24 end do

and the opt-report on that loop

LOOP BEGIN at test3.f90(20,1)
   remark #15542: loop was not vectorized: inner loop was already vectorized

   LOOP BEGIN at test3.f90(21,3)
      remark #15389: vectorization support: reference s(row,col) has unaligned access   [ test3.f90(22,4) ]
      remark #15389: vectorization support: reference s(row,col) has unaligned access   [ test3.f90(22,4) ]
      remark #15381: vectorization support: unaligned access used inside loop body
      remark #15305: vectorization support: vector length 4
      remark #15309: vectorization support: normalized vectorization overhead 0.158
      remark #15300: LOOP WAS VECTORIZED
      remark #15450: unmasked unaligned unit stride loads: 1
      remark #15451: unmasked unaligned unit stride stores: 1
      remark #15475: --- begin vector cost summary ---
      remark #15476: scalar cost: 30
      remark #15477: vector cost: 9.500
      remark #15478: estimated potential speedup: 2.020
      remark #15486: divides: 1
      remark #15488: --- end vector cost summary ---
      remark #25015: Estimate of max trip count of loop=1
   LOOP END

   LOOP BEGIN at test3.f90(21,3)
   <Remainder loop for vectorization>
      remark #25436: completely unrolled by 1
   LOOP END
LOOP END

Reading this, lots of good stuff. It’s vectorizing the inner loop, the 5 elements in a column. But of course the XMM registers process 128 bits or 4 elements. So there is a leftover, which is done in differently that the other 4. Note also that we didn’t use -align array32byte and tag the loop as aligned, which might help efficiency here. but that is just an additional optimization we could do manually, not necessary for this discussion.

In the original example, element (3,5) is in a vector mode. Element (5,3) is a remainder, not in a vector mode. Examining the assembly with both cases with and without -prec-div you can see the vector mode, specifically this:
< rcpps %xmm2, %xmm0 #22.4
which is a reciprocal vector instruction of lesser precision. This is done for the first 4 elements. The remainder is done with a div.
< divss %xmm1, %xmm3 #22.4

whereas the code with -prec-div uses div everywhere

    divps     %xmm1, %xmm2                                  #22.4

So - predictable, reproducible results.

Turning on prec-div OR you can also use -no-vec, same effect.

everythingfunctional · December 6, 2022, 7:54pm

Well those last two really seem to reinforce my general rule, “never expect floating point math operations to end up equal”.

FedericoPerini · December 6, 2022, 8:14pm

And that’s why equalling real numbers is usually discouraged, in lieu of using epsilon(0.0) based considerations…but I do acknowledge that this is a very good example that shows FP math’s far from trivial implications

RonShepard · December 6, 2022, 9:25pm

This is a nice post that explains the original results. One question is why aren’t all of the divisions eliminated and replaced with a multiplication using the single scalar reciprocal computation? I would have thought that would have been the “first” optimization to occur at any optimization level, including in particular the “aggressive” default, -O2.

Maybe one lesson from this is, if you want something done right, you have to do it yourself?

r = 1.0 / maxval(x)
x = r * x

zaikunzhang · December 7, 2022, 12:43am

Thank you very much @greenrongreen for this wonderful explanation. Everything makes sense now. So the behavior is a side effect of optimization, not a bug.

urbanjost:

After eliminating the array syntax and using DO loops it got even odder, as these two
loops did not produce the same answer; where the second one had consistent results;
S2=S
big=maxval(abs(S))

do j=1,5,1 
do i=1,5,1 
      S(i,j)=S(i,j)/big
   enddo
enddo

do j=5,1,-1
do i=5,1,-1
      S2(i,j)=S2(i,j)/big
   enddo
enddo
and at the top I printed everything with a b32.32 format and the elements were identical, but
before computation if I did “S=nearest(S)” then the results became consistent in the original.

It also explains why things change when the loop order changes in the experiments of @urbanjost . However, what about the effect of ‘nearest’? Is it because ‘nearest’ slightly perturbates the numbers and then the difference in the divisions disappears by chance due to the perturbation?

Many thanks again!

urbanjost · December 7, 2022, 1:50am

In addition to providing a solid explanation I think many found useful, this is a
really nice example of using some powerful ifort switches to gain insight into the code behavior. Much appreciated.

It brings back memories of the first time I encountered the perils of floating-point operations; although I learned more about human nature than computation (long story) that day. Its not always the best idea to prove your instructor wrong.

greenrongreen · December 7, 2022, 3:12pm

I am not a mathematical or numerical analyst, but here is my understanding. the reciprocal calculation accuracy can be impacted by the value. For example, in decimal the reciprocal of 2 is .5. Easy to represent in 1 digit. However, 3, 0.3333333 … for a fixed number of digits what do you do? You will get some inaccuracy whether you truncate, round the last digit to 0 or 4. As we saw, the vector mode rcpps instruction caused the last bit in the mantissa to differ from a more accurate div operation. This is why the value of S can impact the accuracy. And why only that one value -3.4729614E+20 in the original problem was causing the symmetry test to fail. ANY of the values in column 5 except (5,5) had the potential to, after scaling, break symmetry since they appear in row (5, which was computed non-vectorized.

Why not be consistent in using DIV versus a reciprocal and multiplication? Historical, practical reasons. Older processors were quite slow at division. And replacement for div with a reciprocal and multiplication was faster. It takes more transistors to design circuitry to do a division, whereas the reciprocal has several algorithms that can be implemented in hardware. And mult is also not a lot of circuitry, relatively speaking. That said, research went into this problem and with shrinking feature dimensions in integrated circuits has resulted in modern processors having division units that are significantly faster. So why not have the compiler default to division always? Legacy. There are a LOT of really really old processors still in use that are important to their users. I’m still getting questions from users with original Pentium processors, and I had a question 2 weeks ago involving an Itanium server. Occasionally we get a user with a MicroVAX. Compilers default to the least common denominator in their support. For example, without a -x, -ax, -mtune, -march option Ifort defaults to SSE2! So if you only use -O options you’re tuning for something in the SSE2 era - egads! That is a Pentium 4 from 22 years ago! This is why I recommend that if you primarily run on one computer, or on one cluster, using -xhost to tune for that host. At SSE2 you’re only using 128 bits of your vector registers which are probably 512 wide at this point for most of us. And aside from the size of the registers, newer hardware can vectorize more loops, such as loops with conditionals whereas older technology could not.
Compilers do move with the times as well. We saw that IFX did not have this issue. I’d have to look, but I’d be curious to see the opt report and assembly for IFX. I suspect it may differ. The backend is LLVM-based and it may be more modern and realize that it doesn’t need to bow to the limitations of a Pentium 4.

oscardssmith · December 7, 2022, 9:42pm

You’re overlooking the main reason for reciprocal and multiplication. A lot of the time you are dividing a lot of numbers by the same value, in which case you only need 1 reciprocal. instead of n divisions

zaikunzhang · December 8, 2022, 2:05am

Thank you again @greenrongreen for this detailed and informative explenation.

I forgot to mention another interesting point: the same behavior does not occur if the real kind was real64 or real128. Do you think this is merely by chance or there is some explenation for it?

In addition, as you mentioned, ifx does not have this behavior. Would you mind elaborating more on this fact?

Sorry for asking these more questions. This discussion has been much more friutful than expected, and I think many others would agree that your observations have been particularly insightful. So I do not want to let any question go. Thank you!

zaikunzhang · December 8, 2022, 2:14am

[Update: sorry, I think I misunderstood your comment. You were talking about the compiler part rather than the human part. Then I agree with you. ]

I agree and do not agree.

I do agree that it is more efficient to take a reciprocal and perform a multiplication than to do a division, especiailly if the division is done repeatedly.

However, I do believe that mordern compilers should be smart enough to choose the best way to evaluate divisions, particularly in the case discussed here. Programming represents a collaboration between humans and compilers. Given ideal compilers, humans should focus on higher level logics of the code rather than how to evaluate a division, at least in the early stage of a project. Although I am well aware that we are not living in an ideal world, I stay optimistic.

zaikunzhang · December 8, 2022, 8:55am

If you find the discussion here interesting, you might have a look at my new question on another surprising behavior of ifort. In that example, we see a similar pattern of computed results, but there is something even more surprising.

greenrongreen · December 8, 2022, 2:44pm

@zaikunzhang
The easy one - real128 arithmetic is not supported in IA hardware. All FP128 calculations are done in software emulation. Hence no vectorization, and the software can calculate out to the last mantissa bit.

Real64, could have enough bits to accurately represent the reciprocal. Or more than likely it may decide that it’s not profitable to do the expression in vector mode. I’d have to look at the -qopt-report=5 output and get the assembly with -S and look at the resulting .s file for ‘div’ instructions vs rcpps instructions.

Changing data representation affects many many things in the vectorizer.

zaikunzhang · December 23, 2022, 1:19am

If you find the discussion here interesting, check this out: Ifort (IFORT) 2021.8.0 20221119: `-1.7633055E+37 / 1.0E+38 = 0`

Topic		Replies	Views
Ifort (IFORT) 2021.8.0: `1.0E+37 / 1.0E+38 = 0`	34	1420	December 25, 2022
1/inf =? ifort says NaN Help	7	1006	December 9, 2022
An error raised by `ifort -check shape`	5	717	September 24, 2021
Trying to implement the Miller-Rabin test in Fortran but encountered a bug Help	11	474	November 2, 2022
Ifort namelist bug? Help	4	505	March 20, 2023

Strange behavior of `ifort`

Related topics