Optimizing vectorized array operations

@PierU @sshestov @ivanpribec @VladimirF @JohnCampbell @certik
Thank you very much for all of your answers and your help! I definitely see that the difference in times is because of the number of times I update the arrays. My general question would be whether, in my understanding, easy operations like setting an array to zero or adding an array to another are expected to take such an amount of time.
To clarify, in the MRE, I perform all operations 1e5 times because, ultimately, the final code needs to handle up to 5e10 iterations. Therefore, I am trying to optimize every possible aspect to achieve the best performance, and currently, these two operations (line_pack = 0.0d0 and pack_tot = pack_tot + line_tot) are a huge bottleneck.

So the original question transfers from “Optimizing vectorized array operations” into “Why two nested loops are slow”. I’ve commented out almost everything, keeping random_number in the outer loop and it stays at ~20 s.

Though I haven’t understood the meaning of the calculation, is it possibly useful to change line_pack(:) to an array of size N_samples such that only the necessary part of pack_tot(:) will be updated (to reduce the amount of overall memory access)?

!! test1.f90
    subroutine sample_lines(N_lines_per_pack, N_grid_pack, S_pack, pack_tot)

        implicit none

        ! input parameters
        integer, intent(in) :: N_lines_per_pack, N_grid_pack

        real(8), intent(in), dimension(N_lines_per_pack) :: S_pack
        real(8), intent(inout), dimension(N_grid_pack) :: pack_tot

        ! temporary array
        integer, parameter :: N_samples = 1000
        real(8) :: line_pack( N_samples )

        ! floats for the sampling elements
        real(8) :: w, tot, rand_num, tmp

        ! integers needed for loops and indexing
        integer :: i, j, k, rand_index

       !------------------------------
        do i = 1, N_lines_per_pack

            call random_number( rand_num )

            rand_index = int( rand_num * (N_grid_pack - N_samples - 1) ) + 1

            tot = 0.0d0

            do j = 1, N_samples

                w = 1.0d0  !! may be j-dependent

                line_pack( j ) =  w

                tot = tot + w
            end do

            tmp = S_pack( i ) / tot

            do j = 1, N_samples

                k = rand_index + j

                pack_tot( k ) = pack_tot( k ) + tmp * line_pack( j )
            end do

        end do
        !------------------------------

    end subroutine

program main
    !! .....

    integer nseed
    integer, allocatable :: seed(:)
    call random_seed( size=nseed )
    allocate( seed( nseed ), source=100 )
    call random_seed( put=seed )

    ! Time the computation
    call cpu_time(t1)
    call sample_lines(N_lines_per_pack, N_grid_pack, S_pack, pack_tot)
    call cpu_time(t2)

    write(*,*) 'Time: ', t2 - t1
    print *, pack_tot( N_grid_pack/2 ), sum(pack_tot)

end program    

Or, if the index is more complicated, it might be useful to use some separate index array (to reduce overall memory access, at the cost of indirect addressing).

!! test2.f90
        ...
        integer :: grid_index( N_samples )

        !------------------------------
        do i = 1, N_lines_per_pack

            call random_number( rand_num )

            rand_index = int( rand_num * (N_grid_pack - N_samples - 1) ) + 1

            tot = 0.0d0

            do j = 1, N_samples

                w = 1.0d0  !! may be j-dependent

                grid_index( j ) = rand_index + j   !! may be more complicated
                line_pack( j ) =  w

                tot = tot + w
            end do

           tmp = S_pack( i ) / tot

            do j = 1, N_samples

                k = grid_index( j )

                pack_tot( k ) = pack_tot( k ) + tmp * line_pack( j )
            end do

        end do
        !------------------------------
        
    end subroutine

The timing on my PC (Ubuntu) becomes shorter, but I hope I am not doing something basically wrong in the code… (like something is optimized away).

$ gfortran-11 -O3 tets1.f90 && ./a.out
 Time:   0.12617700000000001     
  0.12500000000000008        100000.00000001158   

$ gfortran-11 -O3 test2.f90 && ./a.out
 Time:   0.15730600000000000     
  0.12500000000000008        100000.00000001158 

$ gfortran-11 -O3 orig_code.f90 && ./a.out
 Time:    56.627208000000003     
  0.12500000000000008        100000.00000001158 

Sadly I have also found this to be the case myself. Array operations for very simple things like small fixed size arrays doing basic arithmetic are usually optimized well. Beyond that… it’s kind of a crap shoot.

2 Likes

This is why I never understood enthusiasm for the array syntax. Once performance becomes non-negotiable (as in HPC) you end up rewriting everything with C-style loops and manual memory management

2 Likes

I tried a few approaches to improve speed, with minimal effect, including:

  • moving arrays pack_tot(), S_pack() and line_pack() to the module,
  • modify sample line to introduce subroutines for do_inner_loop and DAXPY
  • introduce subroutine accumulate_ticks (id) to report timing

All these had minimal effect ( as Gfortran is smarter than we give it credit)
-O3 is including avx instructions, so try -O1 to check !

However to improve repeated “line_pack = 0.0d0”, I noted this could become
“line_pack(rand_index+1:rand_index+N_samples) = 0.0d0”
and a similar change to DAXPY, with a dramatic change to run times.

Utilising sparsity has been an important approach for solution efficiency.

Is this an ok fix ?

! set options=-O3 -march=native -ffast-math -funroll-loops --param max-unroll-times=4
! gfortran %1.f90 %options% -o %1.exe
 
  module calc_line_profile

    implicit none
    integer, parameter :: r8 = kind (1.0d0)

!  place key arrays in module and allocate once
    real(r8), allocatable :: pack_tot(:), S_pack(:)

  ! temporary line array
    real(r8), allocatable :: line_pack(:)

    integer*8 :: tick_sum(5), tick_count(5)
    integer*8 :: clock, clock_rate, last_tick
    character*15 :: stages(6) = [ 'initialise     ', &
                                  'zero line_pack ', &
                                  'do_inner_loop  ', &
                                  'add pack_tot   ', &
                                  '               ', &
                                  'Total          ' ]

contains

    real(r8) function elapsed_time ()
      integer*8 :: clock, clock_rate
        call system_clock ( clock, clock_rate )
        elapsed_time = dble (clock) / dble (clock_rate)
    end function elapsed_time

    subroutine accumulate_ticks (id)
      integer :: id, K
      real*8  :: sec

      select case (id )
        case (-1)
          call system_clock ( last_tick, clock_rate )
          tick_sum   = 0
          tick_count = 0
        case (-2)
          do k = 1, size(tick_count)
            if ( tick_count(k) < 1 ) cycle
            sec = dble(tick_sum(k)) / dble (clock_rate)
            write (*,11) k, stages(k), sec
          end do
            write (*,11) k, stages(k), sum(tick_sum) / dble (clock_rate)
   11       format ( i4,2x,a,f12.6)
        case (1:10)
          call system_clock ( clock )
          tick_count(id) = tick_count(id) + 1
          tick_sum(id)   = tick_sum(id)   + (clock-last_tick)
          last_tick      = clock
      end select
    END subroutine accumulate_ticks

    subroutine sample_lines (N_lines_per_pack, N_grid_pack )

        implicit none

        ! input parameters
        integer, intent(in) :: N_lines_per_pack, N_grid_pack

!z mod       real(r8), intent(in), dimension(N_lines_per_pack) :: S_pack
!z mod       real(r8), intent(inout), dimension(N_grid_pack)   :: pack_tot

        ! temporary line array
!z mod        real(r8), dimension(N_grid_pack) :: line_pack

        ! floats for the sampling elements
        real(r8) :: tot, rand_num, s_pack_i, norm_factor

        ! integers needed for loops and indexing
        integer :: i, N_samples, rand_index

        line_pack = 0.0d0

        do i = 1, N_lines_per_pack

                N_samples = 1000

            !  select start position in line_pack
                call random_number (rand_num)
                rand_index = int (rand_num * (N_grid_pack - N_samples - 1)) + 1
                 call accumulate_ticks (2)

                call do_inner_loop ( line_pack(rand_index+1), N_samples, tot )
                 call accumulate_ticks (3)

                s_pack_i = S_pack(i)
                norm_factor = S_pack_i / tot

                call DAXPY ( pack_tot(rand_index+1), line_pack(rand_index+1), norm_factor, N_samples )
                 call accumulate_ticks (4)

                line_pack(rand_index+1:rand_index+N_samples) = 0.0d0 ! time in call accumulate_ticks (2)

        end do
        
    end subroutine sample_lines

end module calc_line_profile

program main
    use calc_line_profile
    implicit none

    integer :: N_grid_pack, N_lines_per_pack

    real(r8) :: t1, t2, e1, e2
    
                     call accumulate_ticks (-1)

    N_lines_per_pack = 100000         ! Amount times the outer loop is executed

    N_grid_pack      = 800000         ! Size of the arrays that are updated

    allocate ( pack_tot(N_grid_pack) )
    allocate ( line_pack(N_grid_pack) )

    allocate ( S_pack(N_lines_per_pack) )

    pack_tot = 0.0
    S_pack = 1.0
                 call accumulate_ticks (1)
    
    ! Time the computation
      call cpu_time(t1)
      e1 = elapsed_time ()
      
    call sample_lines ( N_lines_per_pack, N_grid_pack )
    
      call cpu_time(t2)
      e2 = elapsed_time () - e1

    write(*,*) 'Time: ', t2 - t1, e2
                 call accumulate_ticks (-2)

end program main

    subroutine do_inner_loop ( line_pack_j, N_samples, tot )
       use calc_line_profile

        real(r8) :: line_pack_j(*)
        integer  :: N_samples
        real(r8) :: tot

        integer  :: j
        real(r8) :: w

        tot = 0.0d0

        do j = 1, N_samples

            w = 1.0d0

            ! for fast version comment the next line and uncomment the line after
            line_pack_j(j) = line_pack_j(j) + w

            tot = tot + w

        end do

    end subroutine do_inner_loop

    subroutine DAXPY ( Y, X, a, n )
!
!   Performs the vector operation  [Y] = [Y] + a * [X]
!
      integer n 
      real*8  Y(n), X(n), a 
! 
      if ( n == 1 ) then
         Y(1) = Y(1) + a * X(1)

      else if ( n > 0 ) then
         Y = Y + a * X

      end if

    end subroutine DAXPY
 Time:    6.2500000000000000E-002   5.2247499999793945E-002
   1  initialise         0.000198
   2  zero line_pack     0.015773
   3  do_inner_loop      0.018354
   4  add pack_tot       0.018128
   6  Total              0.052454
1 Like

Very interesting! I have some very basic questions:

  1. why putting the two subroutines Daxpi and do_inner_loop outside of the module. I learnt on this website that the best practice is always to put all subroutines and functions in modules.

  2. why using assumed size array when passing the array line_pack_j to the subroutine? I thought that assumed size arrays were not a good practice

Thanks!

I placed them outside the module, as I did not want to check any conflicts with variable scope through the module. The only variables transferred into “do_inner_loop” were via the argument list. Similarly, daxpy is a library routine I use elsewhere, outside a module.

I wanted to simplify the argument list being used. Not sure about “good practice”, as both approaches can be used to advantage in the appropriate situation. N_samples is a routine argument ( in these F77 wrapper approaches )
However the alternative is quite verbose ! ( I wouldn’t use “rand_index” as a subscript name.)

call DAXPY ( pack_tot(rand_index+1:rand_index+N_samples), &
line_pack(rand_index+1:rand_index+N_samples), norm_factor, N_samples )

perhaps use:
j1 = rand_index+1
jn = rand_index+N_samples

The alternatives you imply by your questions are valid approaches.

1 Like

This is a rather harsh assessment of Fortran array syntax. I do not agree.

With large arrays in OpenMP, where memory bandwith is limiting, I am not aware of the benefits of “rewriting everything with C-style loops and manual memory management” Remember that subscript order differs between C and Fortran. Getting this wrong can destroy performance.

As a coding approach, I find replacing Fortran inner DO loops with Fortran array syntax to be helpful for efficiency and for generating SIMD instructions.

6 Likes

This is also my experience. Array indexing and “functional” Matlab-style operations are great for prototyping, cause they give you decent performance that is still far better than that of scripting languages, but for efficiency you have to go back to no allocations, explicit loops, and possibly reference-based arguments.

1 Like

A tool to translate array operations to efficient loops would be good. Maybe LLMs can already do this. Modern Fortran programmers should not have to manually translate their code back to Fortran-77-like code.

2 Likes

I think you’re right. I am no compiler expert but it seems like most compilers nowadays optimize against some dialect of C, or to some similar middle language, that all language “frontends” translate to.

Imagine optimizing a code that uses a find function like this:

pure function find(mask)
   logical, intent(in) :: mask(:)
   integer, allocatable :: find(:)
   integer :: j
   find = pack([(j,j=1,size(mask))],mask)
end function find

That can always be unrolled with a loop. However, the compiler must:

  1. access find’s contents (must be inlinable, i.e., it must be in the same module)
  2. understand that the implicit loop in pack is just a loop (ok, we can assume this was manually coded as a loop instead)
  3. find says that an allocatable variable should be returned: so, the compiler should recognize that, remove the variable, replace it with a loop.
  4. all of this should be done by the Fortran frontend.

Not even mentioning higher dimensions.

So, not exactly easy. But the hardest part is that the compiler must know the contents of find outside of the conventional build->object->link workflow, and as far as I know, lfortran is the only compiler that is designed to do that.

But array syntax should in principle be great for HPC. When using array syntax you are implicitly telling the compiler that the loop iterations are independent and can be parallelized: it’s like using a forall or do concurrent construct

Compare

do i=1,size(xvec)
yvec(i) = sin(xvec(i))
enddo

to

yvec = sin(xvec)

Indeed, the introduction to the famous “Numerical Recipes in Fortran 90”, which was written in the 90s, highlights array syntax as an important step towards writing parallel code

It’s a real pity that compilers haven’t adapted to take advantage of this

Moreover, array syntax is also an advantage of Fortran with respect to C: it is really useful for programmers who come from Matlab or Python

In my non expert opinion, close to the core of the issue is always going to be the fact that anything higher level (i.e. array operations) are simply not how any current hardware functions.

This means that using such constructs is levying an additional translation step on the compiler, turning the language syntax into semantics that can be executed in a largely serial manner on the CPU. Transforming a loop is simply starting at a place closer to what the CPU will ultimately end up doing, as I understand it.

I am therefore led to believe that theoretically a language offering intrinsic types that more closely resemble the intended executing hardware would have an easier time achieving maximum performance. For x86 architecture today that means vector types, 128-, 256-, and 512-bits long.

I am still an amateur Fortraner, so can be totally off🙂. From my C/C++ experience, the dominant bottleneck in numerical codes for PDE modeling is memory access. Once you start optimizing for L1/L2 cache utilization, avoiding mempage contention across OpenMP threads etc you gravitate toward C-style loops. This way you can predict data movement from RAM to CPU caches. It requires fine-grained control over memory layout and access patterns, which in my view makes array syntax less critical for HPC. I appreciate its expressiveness, it’s a nice feature, just not absolutely essential for the performance-critical sections.

1 Like

This isn’t exactly correct, and I think this has been one of the problems with array syntax since it was introduced in f90. Array syntax and forall both adopt the semantics that the rhs of the expression is fully evaluated before the assignment to the lhs. That sometimes requires the compiler to allocate temporary array storage in order to evaluate the rhs expression. This is a relatively expensive operation, sometimes requiring more machine cycles for the allocation and deallocation steps than the array operations themselves. In some cases, a clever compiler can determine that only scalar register storage is required and can avoid the memory allocation overhead, but sometimes not, particularly when the array on the lhs appears in the expression on the rhs. In contrast, with do concurrent the programmer tells the compiler than the rhs can be evaluated in any order and that the implied loop operations within the rhs expression are independent. That seems like only a minor detail, but it has had enormous consequences over the last 40 years or so since f90 was introduced. Ironically, in the 1980s, many f77 compilers had directives for normal do loops to tell the compiler that the loop iterations were independent, and we all knew from experience the remarkable performance improvements that were possible with that approach on the vector and pipelined scalar hardware of that era. Hindsight is 20/20, but I think the compiler writers of that era were too optimistic about the performance that would be achieved from array syntax and forall. Of course, compilers did get better over time since then, but that was why do concurrent was added to the language (f2008), and it is why do concurrent, rather than array syntax for forall, is now the primary focus for offloading instructions to GPU hardware.

This was one of many examples where the compiler writers could not achieve the imagined performance goals of array syntax. One of the reasons that final approval of f8x/f88/f90 took so long was that established vendors had invested much effort into optimizing standard fortran do loops, and they feared that array syntax would give their new competitors (and there were many of them in the 1980s) an unfair short cut to achieve optimal performance on the new, cheap, vector and parallel hardware. So by slowing the fortran standards process, they could delay what they considered to be unfair market competition. One of the (presumably) unintended consequences of that was the fall of fortran as the dominant computing language in science, engineering, numerical analysis, statistics, and high-performance computing to what it is today.

I don’t agree with this at all. Array operations, including such things as the level-1, -2, and -3 BLAS, have always been easy to map to whatever simd or spmd hardware is available. In the past, that included pipelined scalar hardware such as the various RISC cpus, low-level fused multiply-add instructions, vector hardware such as CRAY, Convex, and Cyber machines, AltiVec and SSE hardware, and for the last 20 years or so GPU hardware.

I agee with the first part of that statment, that it does require the compiler to translate the code to the hardware, but not the last part. The translated code is not largely serial in nature, it is inherently parallel, meaning both multiple functional units (fp adds, fp multiplies) operating simultaneously and also multiple threads (e.g. OpenMP) and multiple processes (MPI, coarrays, etc) running simultaneously.

3 Likes

Thanks for your comments, very interesting!

Besides do concurrent, what are other ways to “tell” a compiler that the loop iterations are independent?

You’re right, it is significantly incorrect to say “largely serial on the CPU.” They are very parallel nowdays, even within a single core, ILP and all of that being as complicated as it is.

I still think that a language designed today to operate on vectors would end up looking rather different from the Fortran that we have at the moment. All of the optimized BLAS implementations are hardware specific, handwritten assembly kernels at the lowest level, and there doesn’t seem to be much effort currently ongoing to design a higher level language that reliably compiles to match that.

1 Like

Here is a link to OpenBLAS, one of the projects that attempt to produce optimized BLAS routines with automatic tuning without handwritten code. OpenBLAS : An optimized BLAS library I think in the future that AI technology will be used more often for these kinds of coding projects.

As for high level languages that compile directly to optimal code, that has been the challenge for high level languages, and fortran in particular, since they were invented in the 1950s. However, I would say that in the last 20-30 years, much emphasis has shifted away from optimal efficient performance to other goals, such as clarity, or reusability with object oriented methods, and so on. Sometimes these goals conflict with each other, in which case optimal performance is not achieved. There has also been a trend among programmers, especially in the last couple of decades with more powerful computers, that many projects simply do not need optimal performance, that they perform fast enough, so why put in any extra effort to improve them. That is why scripting languages, such as perl, python, matlab, mathematica, and so on have become popular.

I think in the future that AI technology will be used more often for these kinds of coding projects.

I’ve just recalled this article (about the usage of ML for matrix multiplication):

Discovering faster matrix multiplication algorithms with reinforcement learning (2022)
https://www.nature.com/articles/s41586-022-05172-4