Is accessing array elements in storage order necessary?

Is accessing array elements in storage order necessary nowadays? Can compilers optimize away this difference?

do j = 1, n
    do i = 1, n
        ! operate on a(i, j) ...
    end do
end do

vs.

do i = 1, n
    do j = 1, n
        ! operate on a(i, j) ...
    end do
end do
1 Like

For a little program I wrote to add two matrices, I am not seeing much difference when traversing a matrix by row or column, using gfortran -O3 or ifort -O3

program test_loop_order
implicit none
integer, parameter :: n = 5*10**3, wp = kind(1.0d0), niter = 100
real(kind=wp) :: a(n,n), b(n,n), c(n,n), t1, t2, loop_time, &
                 tinit, tfinal
character (len=1) :: order
integer :: i, j, iter
call cpu_time(tinit)
call get_command_argument(1,order)
loop_time = 0.0_wp
do iter=1,niter
   call random_number(a)
   call random_number(b)
   call cpu_time(t1)
   if (order == "c") then ! traverse arrays one column at a time
      do j = 1, n
         do i = 1, n
             c(i,j) = a(i,j) + b(i,j)
         end do
      end do
   else
      do i = 1, n
          do j = 1, n
             c(i,j) = a(i,j) + b(i,j)
          end do
      end do
   end if
   call cpu_time(t2)
   loop_time = loop_time + t2 - t1
   if (maxval(c) > 2.0) stop "bad variates" ! force c(:,:) to be computed
end do
call cpu_time(tfinal)
print "(a,3(1x,g0),2(1x,f0.2))", "order, n, niter, loop_time, total_time =", &
      merge("col","row",order=="c"), n, niter, loop_time, tfinal-tinit
end program test_loop_order
c:\fortran\test>ifort -O3 loop_order.f90
Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.4.0 Build 20210910_000000
Copyright (C) 1985-2021 Intel Corporation.  All rights reserved.

Microsoft (R) Incremental Linker Version 14.16.27027.1
Copyright (C) Microsoft Corporation.  All rights reserved.

-out:loop_order.exe 
-subsystem:console 
loop_order.obj 

c:\fortran\test>loop_order c
order, n, niter, loop_time, total_time = col 5000 100 2.78 32.48

c:\fortran\test>loop_order r
order, n, niter, loop_time, total_time = row 5000 100 2.56 32.28
c:\fortran\test>gfortran -O3 -o a.exe loop_order.f90

c:\fortran\test>a c
order, n, niter, loop_time, total_time = col 5000 100 3.44 26.59

c:\fortran\test>a r
order, n, niter, loop_time, total_time = row 5000 100 3.50 26.84

With optimization turned off, however, column-major access is much faster, as we were taught:

c:\fortran\test>ifort -Od loop_order.f90
Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.4.0 Build 20210910_000000
Copyright (C) 1985-2021 Intel Corporation.  All rights reserved.

Microsoft (R) Incremental Linker Version 14.16.27027.1
Copyright (C) Microsoft Corporation.  All rights reserved.

-out:loop_order.exe 
-subsystem:console 
loop_order.obj 

c:\fortran\test>loop_order c
order, n, niter, loop_time, total_time = col 5000 100 8.73 76.62

c:\fortran\test>loop_order r
order, n, niter, loop_time, total_time = row 5000 100 59.86 128.31
c:\fortran\test>gfortran -O0 -o a.exe loop_order.f90

c:\fortran\test>a.exe c
order, n, niter, loop_time, total_time = col 5000 100 7.17 29.91

c:\fortran\test>a.exe r
order, n, niter, loop_time, total_time = row 5000 100 59.34 81.94
3 Likes

I guess this should be clarified in the best practices on fortran-lang.org. I see a lot of people think this is a must-do.

This is also used as an answer in many SO posts. See, eg., this one by Ian Bush: openmp - How to optimize nested for-loop in Fortran when the order of the loop matters? - Stack Overflow

Optimizing compilers reorder nested loops if they can figure out from the source code that the result is unaffected by reordering. Many compilers have an option to issue informative messages that such reordering was performed.

1 Like

Does that mean that any call to a non-pure procedure that is not defined in the current scope will prevent reordering?

1 Like

Loop reordering and similar “code-rewriting” optimizations were the subject of a nice talk by Thomas Koenig at FortranCon 2020: Front-end optimization in gfortran.

1 Like

There is an old NASA NAS report circa 1997 that looked at the effect of array order and various optimization techniques on early cache based commodity processors. Somewhat dated but the code fragments are in Fortran and probably are useful in the context of what effect array access patterns have on performance (at least for a fragment of a CFD code). The report is:

Rob F. Van der Wijngaart, “Efficacy of Code Optimizations on Cache-based Processors”, NAS-97-012. Link to PDF is

If someone has some free time (i don’t) it might be interesting to see if the results obtained in 1997 still hold up on modern Intel and AMD processors.

2 Likes

I cannot access the paper (can others?). Could you upload it somewhere?

Here is a link: nas-97-012.pdf - Google Drive

2 Likes

@Beliavsky - Not sure why you can’t access it because I see no US distribution only restrictions on the web site. Also, I forgot to mention that the code fragements in the report are part of the original NAS Parallel Benchmarks. Recent versions can be downloaded from

I think that the conclusions reached in the paper are still valid. It’s almost impossible to define a single “best” approach because any thing you do will be problem, hardware, and compiler dependent.

1 Like

My experience is that for simple loops like above in this thread, the order does not matter. But for more complicated algorithms the order matters. Typically you define and fill in the multidimensional array elsewhere and then you access it from some subroutine. You want to make sure the access is contiguous as much as possible. Often times you have to figure out which order is better, say whether accessing orbitals as C(:,i) or as C(i,:), depending on the operations that you do on them.

1 Like

Designing Fortran data structures and ordering DO loops for sequential memory access is a “no-brainer”.
It will not produce a worse outcome and when multi-threading is utilised for larger memory problems, it should mitigate significant memory-cache bottlenecks.
If the arrays can fit in L3 cache, the problem is less significant, but there still are L1 cache advantages for vector instructions.
Compiler optimisation can help in some cases, but not all more complex practical calculations.

If you are aware of the problem, why ignore it ?

I think a problem with the test example is the initialisation stage is significant and CPU_TIME is not a good elapsed time accumulator.

3 Likes

I don’t intend to ignore it. I just want this issue to be clarified in the Fortran best practices mini-book so users know it.

2 Likes

This problem was much more significant in 70’s and 80’s with Pr1me / Vax virtual memory computers, especially for codes that were developed on CDC and other fixed memory architectures. I helped with addressing this issue for a number of code transfers.

It is now certainly more of an issue for OpenMP, where memory access is shared by threads.
Interesting that for gFortran, -O3 appears to attempt re-ordering loops, while -O2 does not.

When testing efficiency, elapsed time, rather than CPU time should be more relevant

1 Like

This is my experience too. Whilst loop order might not matter for code speed, I still think it’s a good best practice to maintain to make sure you don’t miss the cases where it does matter.

7 Likes

Checking whether the compiler does this is pretty easy:

$> gfortran -O2 -fopt-info-loop test.f90
<no output  -- no re-ordering >

$> gfortran -O3 -fopt-info-loop test.f90
test.f90:22:12: optimized: loops interchanged in loop nest
test.f90:17:15: optimized: loop vectorized using 16 byte vectors
test.f90:23:16: optimized: loop vectorized using 16 byte vectors
test.f90:12:24: optimized: basic block part vectorized using 16 byte vectors
test.f90:12:24: optimized: basic block part vectorized using 16 byte vectors
test.f90:13:24: optimized: basic block part vectorized using 16 byte vectors
test.f90:13:24: optimized: basic block part vectorized using 16 byte vectors
test.f90:34:70: optimized: basic block part vectorized using 16 byte vectors

It should be pretty clear what happens and when.
As has been noted this optimization depends on the complexity of the internal loop content. So not so easily generalized.

1 Like

Basically yes. When the compilers sees a function/subroutine call it can’t know if it has side-effects and thus ensures loop progression in a linear fashion, no loop-interchanges, and no vectorizations.

This is what pure or elemental is for. To tell the compiler there are no side-effects.

2 Likes