I have some insight into this issue. The mathematics is actually reproducible. The insight came from changing the data in the 5x5, and simplifying the scaling to a simple division. Then using -qopt-report=5 and -S and examining the data in both hex and default float format
Here is my modified test
program test
implicit none
integer :: row , col
! Why these strange numbers? Because they come from a real project.
real, parameter :: A(25) = &
& [-3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, &
& -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, &
& -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20,&
& -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, &
& -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20, -3.4729614E+20]
real :: S(5, 5)
S = reshape(A, [5, 5])
! The numbers in S are horrible. Let's scale it.
!S = S / maxval(abs(S))
do col=1,5
do row=1,5
S(row,col) = S(row,col) / 1.2409463E+22
end do
end do
write (*,*) "S after"
do row=1,5
write(*,'(5(Z8,3x))') S(row,:)
end do
print *, ""
write (*, *) S
! Is S symmetric?
write (*, *)'Is S symmetric?', all(abs(S - transpose(S)) <= 0)
end program test
Now just run a simple compilation
]$ ifort test3.f90
$ ./a.out
S after
BCE543B9 BCE543B9 BCE543B9 BCE543B9 BCE543B9
BCE543B9 BCE543B9 BCE543B9 BCE543B9 BCE543B9
BCE543B9 BCE543B9 BCE543B9 BCE543B9 BCE543B9
BCE543B9 BCE543B9 BCE543B9 BCE543B9 BCE543B9
BCE543BA BCE543BA BCE543BA BCE543BA BCE543BA
-2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986396E-02
-2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986396E-02
-2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986396E-02
-2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986396E-02
-2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986394E-02 -2.7986396E-02
Is S symmetric? F
Do you see the pattern? It is stochastic. Not random.
Now, with inside knowledge I know that default, no -O option, gives us -O2 which vectorizes with SSE2, 128 bit registers. You can also verify this by using -dryrun:
$ ifort test3.f90 -dryrun |& grep -i sse
-D__SSE__ \
-D__SSE_MATH__ \
-D__SSE2__ \
-D__SSE2_MATH__ \
Here we are telling the backend it MAY use SSE2. Not demanding, but suggesting to the backend. Youāll see the same define in the -prec-div case - but remember, itās a suggestion not a directive. It may use SSE2, that is all the define is doing. the backend will see -prec-div and decide how to process the loop nest.
How many 32bit data items in the XMM registers? 4. Go down the row, in memory order - see the pattern? at optimization, weāre packing 4 elements into an XMM register and doing vectorized math. For the 5th element in each row we have a remainder that is not done in a packed vector register with vector instructions. Repeat this for each column. You can see this in the OPT REPORT.
Hereās our loop of interest:
19 !S = S / maxval(abs(S))
20 do col=1,5
21 do row=1,5
22 S(row,col) = S(row,col) / 1.2409463E+22
23 end do
24 end do
and the opt-report on that loop
LOOP BEGIN at test3.f90(20,1)
remark #15542: loop was not vectorized: inner loop was already vectorized
LOOP BEGIN at test3.f90(21,3)
remark #15389: vectorization support: reference s(row,col) has unaligned access [ test3.f90(22,4) ]
remark #15389: vectorization support: reference s(row,col) has unaligned access [ test3.f90(22,4) ]
remark #15381: vectorization support: unaligned access used inside loop body
remark #15305: vectorization support: vector length 4
remark #15309: vectorization support: normalized vectorization overhead 0.158
remark #15300: LOOP WAS VECTORIZED
remark #15450: unmasked unaligned unit stride loads: 1
remark #15451: unmasked unaligned unit stride stores: 1
remark #15475: --- begin vector cost summary ---
remark #15476: scalar cost: 30
remark #15477: vector cost: 9.500
remark #15478: estimated potential speedup: 2.020
remark #15486: divides: 1
remark #15488: --- end vector cost summary ---
remark #25015: Estimate of max trip count of loop=1
LOOP END
LOOP BEGIN at test3.f90(21,3)
<Remainder loop for vectorization>
remark #25436: completely unrolled by 1
LOOP END
LOOP END
Reading this, lots of good stuff. Itās vectorizing the inner loop, the 5 elements in a column. But of course the XMM registers process 128 bits or 4 elements. So there is a leftover, which is done in differently that the other 4. Note also that we didnāt use -align array32byte and tag the loop as aligned, which might help efficiency here. but that is just an additional optimization we could do manually, not necessary for this discussion.
In the original example, element (3,5) is in a vector mode. Element (5,3) is a remainder, not in a vector mode. Examining the assembly with both cases with and without -prec-div you can see the vector mode, specifically this:
< rcpps %xmm2, %xmm0 #22.4
which is a reciprocal vector instruction of lesser precision. This is done for the first 4 elements. The remainder is done with a div.
< divss %xmm1, %xmm3 #22.4
whereas the code with -prec-div uses div everywhere
divps %xmm1, %xmm2 #22.4
So - predictable, reproducible results.
Turning on prec-div OR you can also use -no-vec, same effect.