Advice for approaching data locality improvements in legacy code

Why do you think so? Array syntax was introduced into Fortran 90 for the specific purpose of enabling compilers to exploit SIMD features of processors more easily. You can learn more about this from the Fortran 90 edition of the well-known “Numerical Recipes” book.

In your example you are relying on the compiler to first inline your function, and then to optimize the result further. With what I suggested you are in complete control of the performance of your code.

1 Like

I read the part on Numerical Recipes (Fortran 90 edition) about array syntax, but I think it is a bit outdated. I tried coding up a simple dynamic programming example in the post linked below and it turned out that the versions using array syntax instead of loops are slower. The code of the example below is also available on github

Please note that you don’t have to use array syntax to vectorize Fortran code, i.e. to make a compiler emit vector instructions.

You can achieve the same effect also with simple do loops (and possibly some compiler directives) or by using do concurrent within your vectorized function. What is important is that the function operates on entire arrays, rather than scalars.

When it comes to drawing general conclusions on the efficiency of array- vs. loop-syntax, I would also be cautious in quoting timings hat were obtained from the use of a single vendor’s compiler – in particular if that compiler is still somewhat of a work in progress (performance-wise).

1 Like

What do you mean exactly by not having to use array syntax?

I came up with a MWE. The bottleneck is the three nested loops in the main program. How would you vectorize this part of the code? The function is defined in a module and accepts scalar arguments and returns a scalar result.

! Construct 3D array to hold results using nested loops
do iz = 1, n_z
  do ia = 1, n_a
    do iap = 1, n_a
        payoff(iap,ia,iz) = return_fn(a_grid(iap),a_grid(ia),z_grid(iz),r,w)
    enddo
  enddo
enddo

Here is the complete MWE:

module mymod
  implicit none

contains

function linspace(a, b, n) result(x)
    implicit none
    real(kind=8), intent(in) :: a, b
    integer, intent(in)      :: n
    real(kind=8) :: x(n)

    integer :: i
    real(kind=8) :: step

    if (n < 1) then
        ERROR STOP 'linspace: n must be at least 1'
    endif

    if (n == 1) then
        x(1) = a
    else
        step = (b - a) / real(n - 1, kind=8)
        do i = 1, n
            x(i) = a + step * real(i - 1, kind=8)
        enddo
    endif
end function linspace

function return_fn(aprime,a,z,r,w) result(res)
    implicit none
    ! Declare input arguments
    real(8), intent(in) :: aprime, a, z
    real(8), intent(in) :: r, w
    ! Declare function result
    real(8) :: res
    ! Declare local variables
    real(8) :: consumption

    consumption = (1.0d0+r)*a + w*z - aprime
    if (consumption > 0.0d0) then
        res = log(consumption)
    else
        res = -1.0d10
    endif
end function return_fn

end module mymod

program main
use mymod, only: return_fn, linspace
implicit none
 
integer, parameter :: n_a = 1000, n_z = 50
real(8), parameter :: r = 0.04d0, w = 1.0d0 
integer :: iap,ia,iz, alloc_status
real(8) :: a_grid(n_a), z_grid(n_z)
real(8), allocatable :: payoff(:,:,:)
real(8) :: t_start, t_end

! Build vectors for a and z
a_grid = linspace(0.001d0, 50.0d0, n_a)
z_grid = linspace(0.5d0, 1.5d0, n_z)

! Allocate 3D array to hold results
allocate(payoff(n_a, n_a, n_z), stat=alloc_status)
if (alloc_status /= 0) then
    print *, "Error allocating memory for payoff array"
    stop
endif 

call cpu_time(t_start)
! Construct 3D array to hold results using nested loops
do iz = 1, n_z
  do ia = 1, n_a
    do iap = 1, n_a
        payoff(iap,ia,iz) = return_fn(a_grid(iap),a_grid(ia),z_grid(iz),r,w)
    enddo
  enddo
enddo

call cpu_time(t_end)
print *, "CPU time (s): ", t_end - t_start

!do iap=1,5
!  do ia=1,5
!    do iz=1,5
!      print *, "payoff(",iap,",",ia,",",iz,") = ", payoff(iap,ia,iz)
!    enddo
!  enddo
!enddo

deallocate(payoff)

end program main


This MWE is also available on github: GitHub - aledinola/FortranVec2: Loops vs vectorized code in Fortran

Any help is really appreciated!

P.S. I can create a separate post if users think it would be more appropriate

Function ReturnFN is too trivial to be used in this way.
Adopting OpemMP for this trivial computation example is unlikely to be effective, given the values for Nz, Na and Nap.

My recommendation would be to rempve ReturnFN and convert DO iap to vector syntax when possible.

Hopefully there is more computation in the real case, than in the MWE, which might justify OpenMP. If Nz = 50, collapse would not provide significant improvement for typical thread counts.

1 Like

I modified return_fn to make the MWE more similar to the actual application I am working on

Basicaslly you had

program main_program

 implicit none
 integer, parameter :: Nap = 1000, Na = 500, Nz = 50
 integer :: iap, ia, iz, iaz

 real :: array(Nap, Na, Nz)

! Loop order optimized for Fortran column-major storage

!$OMP PARALLEL DO SHARED (array) PRIVATE (iz,ia,iap,iaz)
 do iz = 1, Nz
  do ia = 1, Na
   iaz = ia+iz
   do iap = 1, Nap
     array(iap, ia, iz) = iap + iaz
   end do
  end do
 end do
!$OMP END PARALLEL DO

end program main_program

There is no justificastion for OpenMP in this calculation load with a 95 MByte array.
What is the new calc?

1 Like

I wrote another post (in this thread) where I improved the MWE, you can see the new function above.

I second that. This function needs to be doing sufficient work. Otherwise both the function call overhead, and the overhead due to the multi-threading won’t be negligible.

Please do, rather than hijacking the OP’s thread.

1 Like

I will do a separate post, then. (No intention to hijack anything of course!)

1 Like

1: if the iteration count of the outer loops is significantly larger than the number of threads, there’s no need to collapse the loops

3: it can help; but you will get the same effect with Link Time Optimization (-flto in gfortran) with the procedure in a separate module.

Indeed, but array syntax requires more complex analysis by the optimizers. The F77 compilers were very good at optimizing regular loops, and it took quite a long time to get F90 compilers having the the same efficiency with array syntax. A major problem for the compilers was to determine if an array expression could simply be replaced by loops or if temporary arrays were needed, and the conservative choice was to use temporary arrays (with the associated performance penalty).

Even today, in the most critical parts of the codes, I tend using loops or simple array expressions, and not complex array expressions where a compiler may struggle optimizing them.

1 Like

Sure. There are cases where a compiler would introduce array temporaries when confronted with array syntax, and in these cases it is better to use loops.

But there are also cases where a compiler won’t vectorize a loop because it will falsely believe that there are loop-carried dependencies present in that loop. Rather than using a (non-portable) compiler directive to enforce vectorization, one could be better off using array syntax in such cases.

It all depends on the situation.

1 Like

To reduce the number of function calls, an alternative approach might be something like the following, although some of the arguments aprime,a,z,r or w may need to be re-defined as arrays.
However, when we are worrying about the number of function calls, the computation load must be trivial.

subroutine return_sub ( Nap, Vector, aprime,a,z,r,w )
    implicit none
    ! Declare function result
    integer(4) :: Nap, k
    real(8)    :: Vector(Nap)
    ! Declare input arguments
    real(8), intent(in) :: aprime, a, z
    real(8), intent(in) :: r, w
    ! Declare local variables
    real(8) :: consumption, res

    do k = 1,nap
      consumption = (1.0d0+r)*a + w*z - aprime
      if (consumption > 0.0d0) then
        res = log(consumption)
      else
        res = -1.0d10
      end if
      Vector(k) = res
    end do

end subroutine return_sub

I did some tests to see if any approach was better, but the inner loop needs more work !
This is the code I used to test this approach, which shows:
3 loops vs 2 loops + vector subroutine is no difference
OpenMP vs Serial is no difference ( using 12 threads )

module timer
 implicit none
   integer*8 :: start_tick, end_tick, tick_rate

   contains

   subroutine start_my_timer (desc)
    character desc*(*)
    write (*,*) 'start timer ',desc
    call system_clock ( start_tick, tick_rate )
   end subroutine start_my_timer

   subroutine end_my_timer
    call system_clock ( end_tick, tick_rate )
    write (*,*) 'end timer ',real(end_tick-start_tick) / real(tick_rate)
   end subroutine end_my_timer

end module timer

program main_program
 use timer
 implicit none

 integer, parameter :: Nap = 1000, Na = 500, Nz = 50
 integer :: ia, iz, iaz, kk

 common /zzz/ array
 real :: array(Nap, Na, Nz)

 do kk = 1,2
! Loop order optimized for Fortran column-major storage
 call start_my_timer ( 'openMP' )
!$OMP PARALLEL DO SHARED (array) PRIVATE (iz,ia,iaz)
 do iz = 1, Nz
  do ia = 1, Na
   iaz = ia+iz
   call return_sub( Nap, array(1, ia, iz), iaz )
  end do
 end do
!$OMP END PARALLEL DO

 call end_my_timer
 end do

 call start_my_timer ('Serial')
 do iz = 1, Nz
  do ia = 1, Na
   iaz = ia+iz
   call return_sub( Nap, array(1, ia, iz), iaz )
!   do iap = 1, Nap
!     array(iap, ia, iz) = iap + iaz
!   end do
  end do
 end do
 call end_my_timer
 
end program main_program

subroutine return_sub( Nap, Vector, iaz )
    implicit none
    ! Declare function result
    integer :: Nap, k, iaz
    real    :: Vector(Nap)

    do k = 1,nap
      Vector(k) = iaz + k
    end do

end subroutine return_sub

These are the timing results, for 3 tests, which show no significant changes

# no serial test
 start timer 
 end timer    9.63599980E-03
 start timer 
 end timer    4.68889996E-03

# using 3 X loops
 start timer 
 end timer    1.06913997E-02
 start timer 
 end timer    4.64260019E-03
 start timer 
 end timer    4.84750001E-03

# using return_SUB
 start timer openMP
 end timer    9.62499995E-03
 start timer openMP
 end timer    4.70399996E-03
 start timer Serial
 end timer    5.02829999E-03
1 Like