Effect of `contiguous` attribute for dummy array arguments

Hello,

I have never used the contiguous attribute for dummy array arguments so far, but in the following test code, gfortran-12 seems to give 2x speed-up as compared to the code with no such attribute. Is this kind of speed-up typical of other (more complicated) codes, or am I doing something wrong in the test code…? (FYI, I tried attaching contiguous to my own codes much ago, but I observed no speed-up at that time.)

[ test code ]
sumexp.f90:

module test_m
    use iso_fortran_env, only: dp => real64
    implicit none
contains

subroutine sumexp_1d( vec, val )
    real(dp), intent(in) _CONT_ :: vec(:)
    real(dp), intent(out) :: val

    val = sum( exp(vec) )
end

subroutine sumexp_2d( vecs, ans )
    real(dp), intent(in) _CONT_ :: vecs(:,:)
    real(dp), intent(out) :: ans

    ans = sum( exp( vecs ) )
end

subroutine sumexp_2d_loop( vecs, ans )
    real(dp), intent(in) _CONT_ :: vecs(:,:)
    real(dp), intent(out) :: ans
    integer :: i1, i2

    ans = 0
    do i2 = 1, size( vecs, 2 )
    do i1 = 1, size( vecs, 1 )
        ans = ans + exp( vecs( i1, i2 ) )
    enddo
    enddo
end
end module

main.f90:

program main
    use test_m
    implicit none
    integer, parameter :: ndat = 10 ** 6
    real(dp), allocatable :: arr(:,:)
    real(dp) :: ans, val, c_start, c_end
    integer :: idat

    allocate( arr( 4, ndat ) )
    call random_number( arr )

    print *, "---- 1d ----"
    call cpu_time(c_start)
    ans = 0
    do idat = 1, ndat
        call sumexp_1d( arr( :, idat ), val )
        ans = ans + val
    enddo
    call cpu_time(c_end)

    print *, "ans  = ", ans
    print *, "time = ", c_end - c_start

    print *, "---- 2d ----"
    call cpu_time(c_start)
    call sumexp_2d( arr, ans )
    call cpu_time(c_end)

    print *, "ans  = ", ans
    print *, "time = ", c_end - c_start

    print *, "---- 2d ----"
    call cpu_time(c_start)
    call sumexp_2d( arr, ans )
    call cpu_time(c_end)

    print *, "ans  = ", ans
    print *, "time = ", c_end - c_start

    print *, "--- 2d_loop ---"
    call cpu_time(c_start)
    call sumexp_2d_loop( arr, ans )
    call cpu_time(c_end)

    print *, "ans  = ", ans
    print *, "time = ", c_end - c_start
end

[ Result (gfortran-12) ]
Note1: machine = Ryzen 7 5700X + Ubuntu22
Note2: Timing fluctuates to some extent, but the fluctuation seems within 20 % or so.

$ gfortran-12 -O3 -march=native -cpp -D_CONT_= sumexp.f90 main.f90 && ./a.out
 ---- 1d ----
 ans  =    6873536.5875237575
 time =    1.5854000000000000E-002
 ---- 2d ----
 ans  =    6873536.5875230227
 time =    1.0854000000000003E-002
 --- 2d_loop ---
 ans  =    6873536.5875230227
 time =    1.1753000000000000E-002

$ gfortran-12 -O3 -march=native -cpp -D_CONT_= -flto sumexp.f90 main.f90 && ./a.out
 ---- 1d ----
 ans  =    6871727.5196960503
 time =    6.3569999999999946E-003   <-- becomes faster with -flto (due to inlining?)
 ---- 2d ----
 ans  =    6871727.5196956908
 time =    1.1256000000000002E-002
 --- 2d_loop ---
 ans  =    6871727.5196956908
 time =    1.1327999999999998E-002

$ gfortran-12 -O3 -march=native -cpp -D_CONT_=",contiguous" -flto sumexp.f90 main.f\
90 && ./a.out
 ---- 1d ----
 ans  =    6872925.9581768718
 time =    7.0910000000000001E-003
 ---- 2d ----
 ans  =    6872925.9581758250
 time =    6.7070000000000046E-003  <-- becomes faster with "contiguous"
 --- 2d_loop ---
 ans  =    6872925.9581758250
 time =    6.6529999999999992E-003 <-- becomes faster with "contiguous"

(Result of CompilerExplorer for comparison with ifort/ifx)

gfortran-14.2 : options =  -O3 -march=native -cpp -D_CONT_=
 ---- 1d ----
 ans  =    6873122.9830409009     
 time =    6.3490000000000005E-003
 ---- 2d ----
 ans  =    6873122.9830414364     
 time =    1.0876000000000004E-002
 --- 2d_loop ---
 ans  =    6873122.9830414364     
 time =    1.1318000000000002E-002

gfortran-14.2 : options =  -O3 -march=native -cpp -D_CONT_=",contiguous"
---- 1d ----
 ans  =    6873095.4191845488     
 time =    7.2559999999999986E-003
 ---- 2d ----
 ans  =    6873095.4191840515     
 time =    7.7909999999999993E-003  <-- now speed-up is not so much?
 --- 2d_loop ---
 ans  =    6873095.4191840515     
 time =    9.9569999999999936E-003

[ Result (ifort2021.11.0 with CompilerExplorer) ]

ifort options = -O3 -march=native -cpp -D_CONT_=
 ---- 1d ----
 ans  =    6873735.16051230     
 time =   8.883999999999996E-003
 ---- 2d ----
 ans  =    6873735.16051221     
 time =   7.436999999999999E-003
 --- 2d_loop ---
 ans  =    6873735.16051221     
 time =   7.478000000000005E-003

ifort options = -O3 -march=native -cpp -D_CONT_=",contiguous"
---- 1d ----
 ans  =    6873735.16051230     
 time =   6.016000000000001E-003  <-- somewhat faster with `contiguous`
 ---- 2d ----
 ans  =    6873735.16051221     
 time =   5.009000000000000E-003   <-- somewhat faster with `contiguous`
 --- 2d_loop ---
 ans  =    6873735.16051221     
 time =   5.124999999999998E-003  <-- somewhat faster with `contiguous`

[ Result (ifx2024.0.0) with CompilerExplorer

ifx options = -O3 -march=native -cpp -D_CONT_=
---- 1d ----
 ans  =    6873735.16051203     
 time =   6.585000000000008E-003
 ---- 2d ----
 ans  =    6873735.16051203     
 time =   1.010300000000000E-002
 --- 2d_loop ---
 ans  =    6873735.16051203     
 time =   9.993000000000002E-003

ifx options = -O3 -march=native -cpp -D_CONT_=",contiguous"
---- 1d ----
 ans  =    6873735.16051203     
 time =   9.002000000000010E-003  <-- slower with `contiguous` (?)
 ---- 2d ----
 ans  =    6873735.16051203     
 time =   9.013000000000007E-003  <-- not much different with `contiguous`
 --- 2d_loop ---
 ans  =    6873735.16051203     
 time =   8.839000000000000E-003  <-- not much different with `contiguous`
  • As compared to gfortran, the speed-up seems moderate for ifort and very little for ifx. So the result seems to vary depending on compilers.
  • Another surprise to me is that sumexp_1d() is rather fast as compared to the other 2D routines. I expected that the 1D routine would be very slow because a short vector is passed to a routine many times, so with a lot of overhead. But in this simple case, inlining actually eliminates such an overhead…?
1 Like

A quick test reveals that the Fortran-77 old-style assumed-size (*) method always achieves the fastest result (test it vs compiler here):

subroutine sumexp_2d_n( n, vecs, ans )
    integer, intent(in) :: n
    real(dp), intent(in)  :: vecs(4,n)
...

Produces:

 ---- 1d ----
 ans  =    6873735.16051203     
 time =   1.890900000000001E-002
 ---- 2d ----
 ans  =    6873735.16051203     
 time =   1.926500000000000E-002
 ---- 2d - assumed size ----
 ans  =    6873735.16051221     
 time =   1.329300000000000E-002
 --- 2d_loop ---
 ans  =    6873735.16051203     
 time =   1.879100000000000E-002

To what extent, it depends of course on the machine and the compiler this is being run on.

Actually, marking array(*) has probably a similar effect to declaring it contiguous: the compiler may turn on vectorization more easily because it knows that all elements are contiguous at compile time rather than at runtime (the assumed-shape input(:) will carry over a stride descriptor that is not necessarily 1)

2 Likes

Thanks very much for trying the assumed-size (or explicit-shape) version! I think it is very likely & reasonable that the F77 style will give the best performance. And then, my next question is how well various compilers can come close to that limit with assumed-shape arrays (hopefully without the contiguous attribute, e.g., by dynamically recognizing the contiguity of the passed array (e.g. via some info in array descriptors…?).


BTW, I noticed that the result of CompilerExplorer varies significantly from run to run, sometimes over a factor of 3 (!!), so not very much suited for comparing the performance of different compilers / options from different runs… I think this is also because the problem size & time is too small. Initially, I used ndat = 10 ** 7 on my PC, but the job was killed by CompilerExplorer due to too much allocation (seemingly). So, I now added an external loop for performing the same calculation nloops = 20 times. (But here again, CompilerExplorer kills the calculation if the run time is “too long”, so I cannot use, say nloops = 100. So, that site may not be very good for performance tests, as expected :sweat_smile:


With the updated code here with nloops = 20, the timing is like below (with gfortran-14.2), but the result still varies to some extent. (BTW, by pressing the round-arrow mark at the bottom of the page, we can redo the calculation repeatedly to roughly estimate the mean value, if necessary.)

gfortran-14.2 -O3 -march=native
(Note: (1) `contiguous` already in the code; (2) timing still varies considerably
---- 1d ----
 ans  =    26873446.303585552     
 time =   0.17595700000000000     
 ---- 2d ----
 ans  =    6873446.3035847349     
 time =   0.13273800000000002     
 ---- 2d - assumed size ----
 ans  =    6873446.3035847349     
 time =   0.12937199999999999     
 --- 2d_loop ---
 ans  =    6873446.3035847349     
 time =   0.13063300000000000 

Code without contiguous (nloops = 20)
(Note: Results still varies significantly from run to run.)

---- 1d ----
 ans  =    26872730.721662957     
 time =   0.12188599999999999     
 ---- 2d ----
 ans  =    6872730.7216632087     
 time =   0.26507399999999998     
 ---- 2d - assumed size ----
 ans  =    6872730.7216632087     
 time =   0.13548399999999999     
 --- 2d_loop ---
 ans  =    6872730.7216632087     
 time =   0.28365899999999999  

When contiguous is missing, ifort and ifx automatically generate two separate versions of the code to which the program dispatches at runtime, depending on whether the input argument is contiguous. gfortran does not do so (in versions that I have experimented with). This makes the interface design somewhat difficult regarding the contiguous attribute because the code performance depends on the compiler choice. I have chosen to add contiguous attribute in all my source codes (preferring good performance with gfortran at all times) and add extra interfaces where non-contiguity matters. I confess it bloats the source code, but no better alternative exists. Here is an example of a generic interface that ensures input argument contiguity through extra interfaces that take optional row and column offset arguments (roffA, coffA, …).

1 Like

A runtime test of the contiguity could indeed do, but the complexity and size of the binary code may grow up exponentially with the number of arrays that are passed to the routine.

1 Like

From my experience, it does when -O3 is set (perhaps other heuristics also come into play). With -fopt-info the optimizer emits a message like:

/app/example.f90:4:15: optimized: versioned this loop for when certain strides are 1

I posted an example here: What's the purpose of array size inside subroutine arguments? - #8 by ivanpribec

It goes without saying that machine architecture-specific optimizations need to be enabled too, -march=..., for vectorization to come into play.

2 Likes

Very interesting tests! But do you mean assumed size or explicit shape arrays?

1 Like

It would be really interesting and significant if it did! But I am positive that gfortran lacked this capability when I tested it about two years ago. The compiler may have improved since then. I used all relevant optimization flags at the time. But then again, I did not check the output optimization report mentioned in your comment; I only looked at the final benchmark results.

1 Like

This almost certainly depends on whether the actual argument is itself contiguous. If the actual argument is not contiguous, then the argument association involves expensive copy-in/copy-out steps. If the number of arithmetic operations per each vector element is small, then this overhead will dominate the overall effort.

If the actual argument is not contiguous, then it might well happen that an assumed shape declaration without the contiguous attribute results in the optimal code.

Then there is the question of whether the compiler knows that the actual argument is contiguous at compile time, or if it can only be determined at run time. The contiguous attribute is not something that can be used in the TKR resolution of generic procedures. If that attribute could be used in that way, then the compiler, either at compile time or at run time, could identify which specific procedure to call, and it would even give the programmer control over the various algorithm choices that the contiguous and noncontiguous versions of the procedure could use internally.

So the various cases can be complicated:

  1. compile-time knowledge of the actual argument.
  2. run-time knowledge of the actual argument.
  3. compile-time knowledge of the dummy argument (explicit shape, assumed size, contiguous attribute).
  4. run-time knowledge of the dummy argument by the compiler.
  5. explicit testing within the procedure by the programmer to adapt/switch algorithms depending on the contiguous nature of one or more dummy arguments. This could use both compile-time and run-time information of the arguments.
2 Likes

My experience of similar tests is that ifort often produces slower performance with the Fortran-77 old-style assumed-size (*) method, less so with Gfortran.

I would also like to recommend that SYSTEM_CLOCK is preferable / more accurate than CPU_TIME, especially for tests that take less than 0.015 seconds.

Based on this thread, I will now try tests of the contiguous attribute, especially to identify the array types where this may help.

3 Likes