Performance impact of how a large array is accessed

Nothing prevents a compiler to generate a runtime test about the contiguity of an argument.

If you already had an array in memory and needed to performs operations on non contiguous sections, it was actually more efficient to use the inc* and ld* arguments rather than making copies 1) that were using even more memory (which was very limited in that times, so increasing the risk of paging) and 2) that were taking tike on their own.

This was particularly true on the machines that had no cache memory (i.e. Cray): accessing non-contiguous data was as fast as accessing contiguous data, so making copies had actually only drawbacks.

Going from disk based paging to today, the equivalent to “data on disk delays” is now the clash between data in memory vs data in cache, although there are multiple levels of cache.

Although “nothing prevents a compiler to generate a runtime test about the contiguity of an argument”, I am not familiar with alternative machine instructions being provided.
However there has been suggestions about alternative machine instructions where a .exe must support a variety of hardware types, so this is a bit vague.

What do you mean precisely by “disk based paging”? And what was different then from today? memory paging has always been on disk in practice, and this is true even today (just, this is much faster since SSDs came into the equation). This is orthogonal to cache issues.

My point is that it was often preferable to use strides (incx, etc) rather than duplicating the data.

Sorry, but I really don’t get what you mean here…

It use to be that having data paged out to the disk resulted in a performance penalty. (it is amazing we tolerated this performance delay, as today if data is paged to disk, I am checking to see if the pc has crashed)
Now, having data paged out of cache to memory results in a significant performance penalty.
Using strides imposes a reduced efficiency of cache availability so a noticeable performance hit.

My understanding that when linking Gfortran, it is possible to create an .exe that supports multiple processor instruction sets. This may be via flexible libraries, demonstrating that alternative instructions can be supported in the .exe. I was suggesting this could be an example where alternative instruction set solutions were provided.
I was attempting to equate this compiler capability to providing alternatives of allocating an array on the stack or the heap, with the decision at run time depending on the available stack size. After all, it is only an alternative address to associate with the array, plus the required alternative clean-up at the end of the routine.
Imagine a friendly OS without stack overflows.

I hope this clarifies my thoughts

It’s still the case as of today.

If for instance you have a 2D array and you need to alternatively work on columns and on rows, whatever the reason, you have two options:

  • each time copy/transpose the array, with the associated cost and occupying even more memory
  • use strides when working on the rows

There is no “best option”, all depends on each specific case, and you cannot say “don’t use strides”.

Yes, I was programming in the 1970s, and the programmer was required to make choices between temporary arrays (either statically allocated or manually allocated out of a workspace) and the use of nonunit strides, not unlike the present situation. Then a decade later, RISC architectures also had performance penalties for nonunit strides, and the programmer had to make the same kinds of decisions in those cases. I always regarded these on a case-by-case basis, depending on the number of times the nonunit stride vector was accessed, the size of the temporary copy, and so on. I never just avoided nonunit stride vectors as a general rule. Nowadays, the same questions arise, but instead of actual hard disk access, it is more likely involves an SSD access, and instead of register access, it is probably more related to multilevel cache access. I still think that generally avoiding the use of array slices by making manual copies is not the best approach. Sometimes, a nonunit stride array slice might be optimal, sometimes a contiguous compiler-generated local copy is optimal, and other times a programmer-created contiguous local copy is optimal.

1 Like

I started this thread because I wondered about how best to access a large subarray and because I was curious about a 4% performance change. But in the process of this discussion and modifying the code, I observed an orders-of-magnitude slowdown with the gnu (gfortran) compiler on Perlmutter. So, to pursue this to its conclusion, I have developed a simple (50 line) toy code that illustrates the issue.

The are 3 Fortran files:
aamain.f90 : main program
probcons.f90 : module that defines a 3D array (called hist), and number of history points (called nhistpoints)
blahroutines.f90 : module containing subroutine blah that needs to access a 2D subarray of hist

One issue raised earlier involves whether the call to blah contains just colons (indicating the whole array dimension), or whether the call contains explicit array indices that are in fact the same as the whole array dimensions. (Here I’m referring to the left 2 dimensions, not the 3rd right-most dimension).

Another issue involves whether or not one uses the CONTIGUOUS directive in subroutine blah.

It appears that this has nothing to do with the code being parallel, so I ran it on a front end node of Perlmutter with 4 compilers: gfortran, crayftn, ifort, and nvfortran.

After running the code I’ve concluded the following: There’s nothing unexpected happening with the crayftn, ifort, and nvfortran compilers. In all cases that I ran, the code finished in a few times 10^-3 or 10^-4 seconds.

But something unexpected is happening with gfortran. It executes in around 10^-4 sec except in one case: If I run it with explicit array indices in the call to blah, along with the CONTIGUOUS attribute within blah, the code makes an array copy, and it slows down by many orders of magnitude (it ran in 30 seconds!). The other compilers don’t do this.

I find this particularly strange because array hist is allocated the main program — so it knows the array size there — and the call to blah occurs in the main program of this toy code. But, call blah( hist( :,0:,i )…) produces no array copy while call blah( hist( 1:6,0:nhistpoints,i ),…) does produce an array copy. It’s as though the presence of CONTIGUOUS in subroutine blah causes gfortran to make an array copy even though it has all the information needed (in main where blah is called) to see that the array argument is contiguous and a copy is not needed.

Here is the code for anyone who wants to try it.

aamain.f90:
  program arraytemptest
  use probcons, only : hist,nhistpoints
  use blahroutines, only : blah
  implicit none
  integer :: npart_lcl,i,j
  real(8) :: rslt
  integer(8) :: ihertz,iticks0,iticks1
  call system_clock(count_rate=ihertz)

  npart_lcl=256
  write(6,*)'Running with nhistpoints,npart_lcl=',nhistpoints,npart_lcl

!allocate arrays:
  allocate(hist(6,0:nhistpoints,npart_lcl))
  hist(:,0:,:)=1.d0
 
! do the calculation:
  call system_clock(count=iticks0)

  do i=1,npart_lcl
    do j=1,npart_lcl
      call blah(hist(1:6,0:nhistpoints,i),rslt,i,j)
!     call blah(hist(:,0:,i),rslt,i,j)
      if(rslt.eq.-123456.d0)write(6,*)rslt
    enddo
  enddo

  call system_clock(count=iticks1)
  write(6,*)'Done. Elapsed time=',(iticks1-iticks0)/(1.d0*ihertz),' sec'
  end

probcons.f90:
module probcons
      real(8), allocatable, dimension(:,:,:) :: hist
      integer, parameter :: nhistpoints=50000
contains
end module probcons

blahroutines.f90:
  module blahroutines
  implicit none
  contains
      subroutine blah(hist,res,i,j)
      real(8), contiguous, dimension(:,0:) :: hist
!     real(8), dimension(:,0:) :: hist
      real(8) :: res
      integer :: i,j,k,n
      res=1.d0
      do k=1,10
        n=min(j,k)
        res=res*hist(i,n)
      enddo
      if(res.eq.-12345.d0)write(6,*)'res=',res
      return
      end
  end module
1 Like

CONTIGUOUS certainly causes a problem in Gfortran

I have included a modified code that demonstrates some call alternatives.

!! probcons.f90:
  module probcons
   use iso_fortran_env, only: dp=>real64
      real(dp), allocatable, dimension(:,:,:) :: hist
      integer, parameter :: nhistpoints=50000
! contains
  end module probcons

!! blahroutines.f90:
  module blahroutines
   use iso_fortran_env, only: dp=>real64
   implicit none
   contains
    subroutine blah_orig (hist,res,i,j)
      real(dp), contiguous, dimension(:,0:) :: hist
!     real(8), dimension(:,0:) :: hist
      real(dp) :: res
      integer :: i,j,k,n
      res=1.d0
      do k=1,10
        n=min(j,k)
        res=res*hist(i,n)
      enddo
      if(res.eq.-12345.d0)write(6,*)'res=',res
      return
    end subroutine blah_orig

    subroutine blah (hist,res,i,j)
!      real(8), contiguous, dimension(:,0:) :: hist
      real(dp), dimension(:,0:) :: hist
      real(dp) :: res
      integer :: i,j,k,n
      res=1.d0
      do k=1,10
        n=min(j,k)
        res=res*hist(i,n)
      enddo
      if(res.eq.-12345.d0)write(6,*)'res=',res
      return
    end subroutine blah

    subroutine blah_star (hist,res,i,j)
!      real(8), contiguous, dimension(:,0:) :: hist
      real(dp), dimension(6,0:*) :: hist
      real(dp) :: res
      integer :: i,j,k,n
      res=1.d0
      do k=1,10
        n=min(j,k)
        res=res*hist(i,n)
      enddo
      if(res.eq.-12345.d0)write(6,*)'res=',res
      return
    end subroutine blah_star

    function delta_sec ()
      ! high precision timer.
      implicit none
      real(dp) :: delta_sec
      integer*8 :: tick, last=-1, rate=-1

      if ( rate < 0 ) call system_clock ( last, rate )
      call system_clock (tick)
      delta_sec = dble (tick-last) / dble (rate)
      last      = tick
    end function delta_sec

  end module blahroutines

! aamain.f90:
  program arraytemptest
  use probcons    ! , only : hist,nhistpoints
  use blahroutines   ! , only : blah, blah_star, blah_orig
  implicit none
   integer     :: npart_lcl,i,j
   real(dp)    :: rslt, sec

   sec = delta_sec ()

   npart_lcl=256
   write(6,*)'Running with nhistpoints,npart_lcl=',nhistpoints,npart_lcl

!allocate arrays:
   allocate(hist(6,0:nhistpoints,npart_lcl))
   hist(:,0:,:)=1.d0
 
! do the calculation: with contiguous
   sec = delta_sec ()

   do i=1,npart_lcl
     do j=1,npart_lcl
       call blah_orig (hist(1:6,0:nhistpoints,i),rslt,i,j)
!      call blah(hist(:,0:,i),rslt,i,j)
       if (rslt.eq.-123456.d0) write(6,*)rslt
     end do
   end do

   sec = delta_sec ()
   write(6,*)'BLAH_orig Done. Elapsed time=',sec,' sec', rslt

! do the calculation: with contiguous
   sec = delta_sec ()

   do i=1,npart_lcl
     do j=1,npart_lcl
       call blah_orig (hist(:,:,i),rslt,i,j)
!      call blah(hist(:,0:,i),rslt,i,j)
       if (rslt.eq.-123456.d0) write(6,*)rslt
     end do
   end do

   sec = delta_sec ()
   write(6,*)'BLAH_o:,: Done. Elapsed time=',sec,' sec', rslt

 ! do the calculation: no contiguous
   sec = delta_sec ()

   do i=1,npart_lcl
     do j=1,npart_lcl
       call blah (hist(1:6,0:nhistpoints,i),rslt,i,j)
 !     call blah(hist(:,0:,i),rslt,i,j)
       if (rslt.eq.-123456.d0) write(6,*)rslt
     end do
   end do

   sec = delta_sec ()
   write(6,*)'BLAH(1:6  Done. Elapsed time=',sec,' sec', rslt

 ! do the calculation: no array section definition
   sec = delta_sec ()

   do i=1,npart_lcl
     do j=1,npart_lcl
 !      call blah (hist(1:6,0:nhistpoints,i),rslt,i,j)
       call blah (hist(:,0:,i),rslt,i,j)
       if (rslt.eq.-123456.d0) write(6,*)rslt
     end do
   end do

   sec = delta_sec ()
   write(6,*)'BLAH(:,:) Done. Elapsed time=',sec,' sec', rslt

 ! do the calculation: F77 wrapper approach
   sec = delta_sec ()

   do i=1,npart_lcl
     do j=1,npart_lcl
       call blah_star (hist(1,0,i),rslt,i,j)
       if (rslt.eq.-123456.d0) write(6,*)rslt
     end do
   end do

   sec = delta_sec ()
   write(6,*)'BLAH_star Done. Elapsed time=',sec,' sec', rslt

  end program arraytemptest
Target: x86_64-w64-mingw32
Thread model: win32
gcc version 11.1.0 (GCC) 
COLLECT_GCC_OPTIONS='-v' '-fimplicit-none' '-fallow-argument-mismatch' '-O2' '-march=native' '-ffast-math' '-o' 'blah.exe'
 Running with nhistpoints,npart_lcl=       50000         256
 BLAH_orig Done. Elapsed time=   31.638044000000001       sec   1.0000000000000000     
 BLAH_o:,: Done. Elapsed time=   4.9350000000000002E-004  sec   1.0000000000000000     
 BLAH(1:6  Done. Elapsed time=   4.6920000000000002E-004  sec   1.0000000000000000     
 BLAH(:,:) Done. Elapsed time=   4.7029999999999999E-004  sec   1.0000000000000000     
 BLAH_star Done. Elapsed time=   3.4120000000000000E-004  sec   1.0000000000000000     

Use of contiguous combined with array section call certainly fails in this case
F77 wrapper works best, as it enforces no temporary arrays, but this is not always the best case.

2 Likes

If you’re correct, you should fill a bug report. It’s not a bug per se, but at least an unexpected behavior.

Thanks very much for making a minimal test code! I’ve also tried a slightly modified version with gfortran-12 (on Ubuntu22), and got similar speed-down when passing array slices (with explicit lower bounds) to dummy arguments with contiguous.

!! var.f90
module var_mod
    real(8), allocatable :: hist(:,:,:)
end module

!! proc.f90
module proc_mod
implicit none
contains

subroutine blah( hist, res, i, j )

    !! real(8) :: hist( :, 0: )
    real(8), contiguous :: hist( :, 0: )

    real(8) :: res
    integer :: i, j, k, n

    if (i == 1 .and. j == 1) print *, "blah: ", is_contiguous( hist )

    res = 1
    do k = 1, 10
        n = min( j, k )
        res = res * hist( i, n )
   enddo
end

end module

!! main.f90
program main
  use var_mod,  only : hist
  use proc_mod, only : blah
  implicit none

  integer :: npts, npart, i, j
  real(8) :: res
  integer(8) :: tstart, tend, trate
  call system_clock( count_rate=trate )

  npts = 10000
  npart = 256
  print *, 'npts, npart =',npts, npart

  allocate( hist( 6, 0:npts, npart ) )
  hist(:,:,:) = 1

  print *, is_contiguous( hist( :,   :,      1 ) )
  print *, is_contiguous( hist( :,   0:,     1 ) )
  print *, is_contiguous( hist( 1:6, 0:npts, 1 ) )
  print *, is_contiguous( hist( 1:,  :,      1 ) )
  print *, is_contiguous( hist( :6,  :,      1 ) )
  print *, is_contiguous( hist( 1:6, :,      1 ) )

  call system_clock( count= tstart )

  do i = 1, npart
  do j = 1, npart

      !! call blah( hist( :,   :,      i ), res, i, j )   !! fast
      !! call blah( hist( :,   0:,     i ), res, i, j )   !! fast
      !! call blah( hist( 1:6, 0:npts, i ), res, i, j )   !! slow (array-temp)
      !! call blah( hist( 1:,  :,      i ), res, i, j )   !! slow (array-temp)
      !! call blah( hist( :6,  :,      i ), res, i, j )   !! slow (array-temp)
      call blah( hist( 1:6, :,      i ), res, i, j )   !! slow (array-temp)

  enddo
  enddo

  call system_clock( tend )
  print *, 'time = ', dble( tend - tstart ) / trate, ' sec'
  print *, "res = ", res

end program

$ gfortran-12 -O2 var.f90 proc.f90 main.f90

Because is_contiguous() always returns true both at the main and the subroutine, I think the compiler fails to use that information when creating an array “descriptor” to be passed to the subroutine. Also, because other compilers do not make array temporaries,
it looks like an optimization issue of gfortran <= 12 (but not tried >= 13 yet…)

1 Like