Optimizing vectorized array operations

Hi everyone,

I’m working on optimizing a Fortran code that performs calculations on large double-precision arrays (around 800,000 elements). The code includes loops that calculate different weights that are added to a temporary array. After the loop, I scale and add the temporary array to the permanent array:

pack_tot(:) = pack_tot(:) + (S_pack / tot) * line_pack(:)

and set the temporary array to zero:

line_pack(:) = 0.0d0

These two operations, especially the first one take the most time in my calculation, even though I perform many other operations to calculate the weights that I add to the temporary array. To minimize the time I tried using aggressive optimization and vectorization of the code. I compile it using

gfortran -O3 -ftree-vectorize -funroll-loops -ffast-math -march=native

and checked with “-fopt-info-vec”, if the line actually gets vectorized, and it does. However, even with vectorization enabled, the performance remains very slow (see the time difference below). If I skip the addition of the temporary array to the permanent array and use the permanent array directly in every calculation, the code becomes 25 times faster (from 21s to 0.8s). Additionally, I tried precomputing the scaling factor, but that didn’t help. I also ran the code on a different environment and still encountered the same issue.
Questions:
Do you know what the issue could be?
Could this be related to memory bandwidth limitations or cache inefficiencies due to the large size of the arrays?
I am using an M2 pro chip (ARM architecture) and GNU Fortran (Homebrew GCC 14.2.0_1) 14.2.0.
Many thanks in advance and any suggestions would be very appreciated!

1 Like

Welcome to the forum!

I have no immediate answer to your question, but what happens if you leave out the “(:)” bit in the above statements? That would mean that you work on the entire arrays, instead of sections that just happen to cover the whole arrays. It may be that the compiler can optimise that better. Just a thought, mind you.

2 Likes

It’s quite difficult to analyse this performance issue without seeing more code and context. A possible guess and related questions:

  • is your “temporal (temporary?) array” an allocatable or pointer array?
  • if yes, do you frequently allocate/deallocate it?

If the answer to these 2 questions is “yes”, then it’s a likely explanation. Without going into details, allocating an array has a cost that is not negligible (*), and which can get dominant if only few operations are performed using this array.

(*) in practice the cost is not the allocation in itself, but the first access to each element of the array.

allocate( a(10**6) )   ! the OS just takes note of the request 
                       ! but does not attribute any physical memory at this point
a(1) = 0.0   ! a(1) is not in physical memory, so it triggers a "page fault"
             ! the OS searches and attributes a new memory page. 
             ! A page is usually 4 kB large, so it can handle
             ! the elements a(1:1000) (is the type of a(:) is a default real/integer)
a(2) = 0.0   ! a(2) is already in physical memory, so this one is fast
...
a(1001) = 0.0   ! (1001) is not in physical memory, so page fault etc...
2 Likes

Hey, thank you very much! I tried that, but it did not change anything.

What happens if you write your code as a do loop rather than with array operations?

3 Likes

I recommend createding an MRE (minimal reproducible example) of this performance issue. Then people can try it out (with different compilers also) and figure out how to fix it.

2 Likes

800,000*8 = 6.4 Megabytes (Not really a lot, but still)

You are loading 6.4 MB for line_pack, another 6.4 MB for the pack_tot. Then you are writing 6.4 MB for pack_tot. Then you are writing 6.4 MB for line_pack.

In total, you are reading 12.8 MB and writing 12.8 MB.

6.4 MB is roughly 100,000 cache lines. 6.4 MB may or may not fit in the L3 cache, depending on the hardware. I’d say you are probably just beyond the point where this becomes a memory-bound problem.

My first thought it to do ‘Loop Fusion’ - Rewrite it into loops, and then combine the two loops together. This way, you don’t have to reload all 100,000 caches lines just to reset the temporary array to 0.

If s_pack and tot are scalars - to be safe, precompute the division and use that instead. Then see if the vectorization is turning this into an FMA instruction.

As others have said - make sure the temporary array is only allocated once for the entire run of the program. That’ll kill your performance for sure.

Is there some downside to using the final destination array directly?

5 Likes

This is a very good suggestion. I had similar issues and I would be happy to try out a MWE on my Fortran setup

1 Like

@certik @aledinola @Arjen @rfarmer @PierU @dwwork
Thank you so much for your quick reactions and suggestions, that is really amazing! Some answers: I do not frequently allocate the array, so I don’t think that should be the problem. Apart from that, using a loop instead of the array operation does not speed up the code, also not if I precompute the s_pack / tot.
I need the temporary array because I want to perform loop index-specific normalization and other features with it after computing it. Therefore, I cannot use the permanent array for everything.
I created an MWE, which you can see below. I hope that works for you and is what you imagined, otherwise, I am happy to change something!

module calc_line_profile

    implicit none
contains

    subroutine sample_lines(N_lines_per_pack, N_grid_pack, S_pack, pack_tot)

        implicit none

        ! input parameters
        integer, intent(in) :: N_lines_per_pack, N_grid_pack

        real(8), intent(in), dimension(N_lines_per_pack) :: S_pack
        real(8), intent(inout), dimension(N_grid_pack) :: pack_tot

        ! temporary line array
        real(8), dimension(N_grid_pack) :: line_pack

        ! floats for the sampling elements
        real(8) :: w, tot, rand_num, s_pack_i, norm_factor

        ! integers needed for loops and indexing
        integer :: i, j, k, N_samples, rand_index, index_del_nu1


        do i = 1, N_lines_per_pack

                tot = 0.0d0

                ! for 'fast' version comment the next line
                line_pack = 0.0d0

                N_samples = 1000
                s_pack_i = S_pack(i)

                call random_number(rand_num)

                rand_index = int(rand_num * (N_grid_pack - N_samples - 1)) + 1

                do j = 1, N_samples

                    w = 1.0d0
                    index_del_nu1 = rand_index + j

                    ! for fast version comment the next line and uncomment the line after
                    line_pack(index_del_nu1) = line_pack(index_del_nu1) + w
                    !pack_tot(index_del_nu1) = pack_tot(index_del_nu1) + w

                    tot = tot + w

                end do

                ! for 'fast' version comment the next line, this is the most expensive operation
                pack_tot = pack_tot + (s_pack_i / tot) * line_pack


                ! Version to use precomputed values and loop instead of array operations

                !norm_factor = (S_pack_i / tot)
                !do k=1, N_grid_pack
                !    pack_tot(k) = pack_tot(k) + norm_factor * line_pack(k)
                !end do

        end do
        
    end subroutine sample_lines

end module calc_line_profile


program main
    use calc_line_profile
    implicit none

    integer :: N_grid_pack, N_lines_per_pack
    real(8), allocatable :: pack_tot(:), S_pack(:)

    real(8) :: t1, t2
    
    ! Amount times the outer loop is executed
    N_lines_per_pack = 100000
    
    ! Size of the arrays that are updated
    N_grid_pack = 800000

    allocate(pack_tot(N_grid_pack))
    allocate(S_pack(N_lines_per_pack))

    pack_tot = 0.0
    S_pack = 1.0
    
    ! Time the computation
    call cpu_time(t1)
    call sample_lines(N_lines_per_pack, N_grid_pack, S_pack, pack_tot)
    call cpu_time(t2)

    write(*,*) 'Time: ', t2 - t1

end program main

And I compile that with

gfortran -O3 -ftree-vectorize -funroll-loops -ffast-math -march=native

I hope that the example makes sense to you and is useful. Thanks again so much for your help!!

3 Likes

The reason in this MRE is actually obvious :slight_smile:

At each iteration of the outer loop on i:

  • In the fast version, the inner loop on j is executed only N_samples=1000 times, which means that only 1000 elements of pack_tot are updated.
  • In the slow version all the N_grid_pack=800000 elements of pack_tot are updated (and the whole line_pack is set to zero, although you really need only 1000 elements).
2 Likes

@David_Haegele let us know if @PierU’s suggestion fixes the issue. If not and this MRE is not the right one, go ahead and send us another MRE that doesn’t have this issue but it is still slow.

1 Like

You should consider monitoring elapsed time, using SYSTEM_CLOCK, as this may show the improvement you are wanting.

  real function elapsed_time ()
    integer(8) :: clock, rate     !  64 bit integer
      call system_clock ( clock, rate )
      elapsed_time = dble (clock) / dble (rate)
  end function elapsed_time

I recommend avoiding using the word vectorization in two different meanings here. You use it

  1. for array operations in array notation,
  2. for SIMD CPU operations when calling the compiler with -ftree-vectorize and examining what gets vectorized.

These are really two completely different things and may give you a false impression that the array operations (1) should be somehow faster, as they are in interpreted languages such as Python. However, the opposite is often true, raw loops can be fast in Fortran and the compiler can be able to vectorize them (in the SIMD sense perfectly well. Some forms of array notation may make it even harder for the compiler to figure out how to do the SIMD vectorization.

If you are trying to make the Fortran code more concise and easier to read, array operations (sense 1) are a great thing to go to. If you are trying to make the code faster, it is a dubious way how to try to achieve that, loops are usually fast enough if not faster in some more complicated cases.

1 Like

I would like to add that in theory array expressions are meant to be SIMD vectorizable, see (J3/24-007, 5.4.6, paragraph 2):

Any intrinsic operation defined for scalar objects may be applied to conformable objects. Such operations are performed elementally to produce a resultant array conformable with the array operands. If an elemental operation is intrinsically pure or is implemented by a pure elemental function (15.9), the element operations can be performed simultaneously or in any order.

and (J3/24-007, 10.1.4, paragraph 5):

When an elemental unary operator operates on an array operand, the operation is performed element-by-element, and the result is the same shape as the operand. If an elemental operation is intrinsically pure or is implemented by a pure elemental function (15.9), the element operations may be performed simultaneously or in any order.

In practice performance can vary depending on the optimization capabilities of a particular compiler.

1 Like

That is correct, but that rule also makes array temporaries for some more complicated expressions necessary as the right hand side has to be fully evaluated before assigning.

I think that in practice the array expressions have always been harder to optimize for existing compilers. For simpler cases he compilers can optimize them as well as loops, but it is dubious that array expressions bring more performance except perhaps for some fringe cases.

2 Likes

The sizes your are speaking about (~800 k entries) are really not big. There should not be a problem connected with that. I run your MRE and it works very slow, ~70s on my PC … however, if I simplify your code (just put in to the end of your main), remove procedure call and do it with array notation, I get to ~8e-4s.

    MM=N_grid_pack
    allocate(arr(MM),tmp(MM))
    arr=pack_tot
    tmp=0.5
    call random_number(A)
    call random_number(B)
    B=B+0.5
    write(*,*) "A and B are", A, B

    ! Time the computation
    call cpu_time(t1)
    arr=arr+A*tmp/B
    call cpu_time(t2)

    write(*,*) 'Time: ', t2 - t1

Why bother do the computation if there is no side-effect?

Ok, you are right:

    ! Time the computation
    call cpu_time(t1)
    arr=arr+A*tmp/B
    write(*,*) sum(arr)
    call cpu_time(t2)

and I get 2.9e-3 seconds :slight_smile: tmp is also random right now.
My original point was that the MRE is unbelievably slow, there must be some good reasons to be so slow.

1 Like

I have actually explained the reason

I tend to agree, but compilers can also sometimes struggle optimizing some complex loops. Hence the directives that were popular such as !$DIR IVDEP, which tell the compiler that there’s no dependency between the iterations, or the OpenMP simd directive that somehow forces the compiler to vectorize a loop.

1 Like