Please, No More Loops (Than Necessary): New Patterns in Fortran 2023

@rouson will your talk be available in recording or pdf ? I couldn’t attend but would love to hear/see it

2 Likes

It’s already online on Youtube. I’ve just watched during my lunch break. Here’s the link.

5 Likes

LFortran can also offload some subset of do concurrent for NVIDIA GPUs.

1 Like

@rouson , this helped indeed. My post on the Intel Fortran Forum ( Re: Problem running a simple program using OpenMP and GPU - Intel Community ) has a copy of the code and the problem that occurred.

I changed the double loop to a DO CONCURRENT and compiled using:

ifx diffu_gpu_concur.f90 -Qopenmp -Qopenmp-targets=spir64 -Qopenmp-target-do-concurrent -Qopt-report

The program runs with n = 12800, while my naïve OpenMP version crashed. I can see there is a lot of activity on the GPU (using the Task Manager on Windows). The only thing I am worrying about right now is that the time loop (the big one, see the fragment below) will copy the result back to the CPU at every iteration.

do k=1,1000
    write(*,*) k
    do concurrent( i = 2:n-1, j = 2:n-1 )
        unew(i,j) = u(i,j) + delt * (u(i-1,j) + u(i+1,j) + u(i,j-1) + u(i,j+1) &
                    - 4.0 * u(i,j) + alpha * exp(u(i,j)) )
    enddo

    u = unew  !<== copied back after each iteration?
enddo  

How can you control this? For instance, have it copied back only each 50 iteration. Or even only at the end?

2 Likes

@certik that’s outstanding news! Thanks for letting us know. I’ll be sure to mention lfortran on this topic from now on.

1 Like

amdflang can offload do concurrent to GPUs using OpenMP as the backend! @mklemm here has helped and verified this

Using do concurrent to offload to GPUs is very cool and all, but sometimes a lot of the performance will come from efficiently using memory, i.e. allocating things on the device from the start and reusing buffers, etc.

Currently the only way to do this with Fortran + do concurrent is using openmp or openacc to handle memory. Have you (@rouson ) had any experience with this?

@jorgeg I’m anxiously awaiting the start of my GPU journey. For historical and funding reasons, my focus is mainly LLVM flang. The code that I’m currently most interested in offloading is in Fiats. Aspects of Fiats and flang both evolved with changes on either side for one to support the other. I suspect (but haven’t verified) that flang can offload the inference code in Fiats. I’ve been holding out for the ability to offload neural-net training in Fiats using flang. I suspect that will happen sometime this year.

I can probably try that with amdflang, which should be pretty close to flang flang…whenever you start the GPU journey please let me know. I’ve been doing some Fortran GPU offloading using OpenMP/Do-concurrent (with nvfortran) and there’s quite a bit of fun to be had!

@jorgeg thanks for looking into this!

The “Running aerosol inference” section of the following paper includes a link to download an aerosol model:

FWIW, the prior “Running microphysics training“ section of the paper describes how to generate a model.json file, but the aerosol one is bigger and likely better for offloading.

Also, in case it’s of interest, the following paper shows parallel scalability results of automatically parallelizing inference on a CPU using the same model:

https://escholarship.org/uc/item/5z08n75q

1 Like

Nice talk @rouson !

The PRIMA project followed strictly the “No More Loops (Than Necessary)” principle. It is nice to know that the same rule has been adopted in Berkeley Lab.

I also take the viewpoint that loops should be replaced with array (matrix/vector) operations whenever possible.

Quoting my old post “Automatic arrays and intrinsic array operations: to use or not to use?”:

2 Likes

I think an interesting question is: when you define an elemental and it automatically generates maybe around 40 ish procedures (as specified in the talk), what happens if there’s a compiler bug in one of them? how do you go about debugging, diagnosing it if you only have the elemental to work with?

You have to create an MRE (minimal reproducible example) and report it to the compiler and they fix it.

Can someone who knows more about how array syntax works in Fortran explain why 30 years or so after they were introduced in the language we still see examples where they don’t perform as well as the do loops they are suppose to replace.

2 Likes

I think it is an optimization issue. The language requires the assignment to occur “as if” the rhs of the array expression is fully evaluated before assigning to the lhs array. If the compiler cannot determine that there are no memory access conflicts in the rhs expression evaluation, then that requirement can most easily be satisfied by allocating one or more temporary arrays to evaluate the rhs, then copy from those temporary arrays to the output vector, and then deallocate the temporary arrays. Memory allocations are expensive, worse on the heap than on the stack, but still relatively expensive in either case compared to the very fast arithmetic operations that occur on modern processors. On the other hand, with explicit do loops, the compiler only uses registers to store the required intermediate values, bypassing the expensive heap/stack allocation step. Or with do concurrent, the programmer is telling the compiler that no memory conflicts will occur, again allowing the compiler to freely optimize the expression evaluation with that new information. So the answer, given the state of the language at this time, is for compilers to do a better job of optimizing away the unnecessary temporary array allocation/deallocation steps in array expressions.

1 Like

Thanks. That makes a lot of sense. Still is disconcerting that after 30 years the compiler developers haven’t found a way (or maybe they have and its just too expensive and disruptive of their current compiler code base) to implement optimization strategies that will eliminate temporary array allocations in all instances.

2 Likes

How would you evaluate the general rank-1 upate A = A + \alpha x y^T, where A is a matrix and x and y are vectors? This is one of the building blocks of the LU factorization algorithm.

The rank-1 update is encapsulated by the BLAS Level 2 call:

call sger(m, n, alpha, x, 1, y, 1, a, lda)

It is possible to express this operation with array intrinsics:

a(1:m,1:n) = a(1:m,1:n) + spread(x,2,n) * spread(y,1,m)

(This statement was used in the data parallel maspar BLAS, presumably named after the MasPar parallel computer.)

I used an LLM to generate a micro-benchmark measuring 7 different variants (loops, matmul and spread) and an 8th variant which just forwards the call to BLAS:

dger_abstraction_penalty.f90
! dger_abstraction_penalty.f90 -- Evaluate speed of SGER operation
!
! Most of this example has been generated using Google Gemini.
! Ivan Pribec, 25/01/2026

module sger_kernels
implicit none
contains
    ! --- KERNELS ---

    subroutine sger_nested(m, n, alpha, x, y, a)
        integer, intent(in) :: m, n
        real, intent(in) :: alpha, x(m), y(n)
        real, intent(inout) :: a(m, n)
        integer :: i, j
        do j = 1, n
            do i = 1, m
                a(i, j) = a(i, j) + alpha * x(i) * y(j)
            end do
        end do
    end subroutine

    subroutine sger_col(m, n, alpha, x, y, a)
        integer, intent(in) :: m, n
        real, intent(in) :: alpha, x(m), y(n)
        real, intent(inout) :: a(m, n)
        integer :: j
        do j = 1, n
            a(:, j) = a(:, j) + (alpha * y(j)) * x
        end do
    end subroutine

    subroutine sger_concurrent(m, n, alpha, x, y, a)
        integer, intent(in) :: m, n
        real, intent(in) :: alpha, x(m), y(n)
        real, intent(inout) :: a(m, n)
        integer :: i, j
        do concurrent (j=1:n, i=1:m)
            a(i, j) = a(i, j) + alpha * x(i) * y(j)
        end do
    end subroutine

    subroutine sger_ptr(m, n, alpha, x, y, a)
        integer, intent(in) :: m, n
        real, intent(in) :: alpha
        real, intent(in), target :: x(m), y(n)
        real, intent(inout) :: a(m, n)
        real, pointer :: xm(:,:), ym(:,:)
        xm(1:m, 1:1) => x
        ym(1:1, 1:n) => y
        a = a + alpha * matmul(xm, ym)
    end subroutine

    subroutine sger_ptr_wrap(m, n, alpha, x, y, a)
        integer, intent(in) :: m, n
        real, intent(in) :: alpha, x(m), y(n)
        real, intent(inout) :: a(m, n)
        call sger_ptr(m, n, alpha, x, y, a)
    end subroutine

    subroutine sger_reshape(m, n, alpha, x, y, a)
        integer, intent(in) :: m, n
        real, intent(in) :: alpha, x(m), y(n)
        real, intent(inout) :: a(m, n)
        a = a + alpha * matmul(reshape(x, [m, 1]), reshape(y, [1, n]))
    end subroutine

    subroutine sger_spread(m, n, alpha, x, y, a)
        integer, intent(in) :: m, n
        real, intent(in) :: alpha, x(m), y(n)
        real, intent(inout) :: a(m, n)
        a = a + alpha * spread(x, 2, n) * spread(y, 1, m)
    end subroutine

    elemental function ef(al, val, xi, yj)
        real, intent(in) :: al, val, xi, yj
        real :: ef
        ef = val + al * xi * yj
    end function

    subroutine sger_elem(m, n, alpha, x, y, a)
        integer, intent(in) :: m, n
        real, intent(in) :: alpha, x(m), y(n)
        real, intent(inout) :: a(m, n)
        a = ef(alpha, a, spread(x, 2, n), spread(y, 1, m))
    end subroutine

    subroutine sger_blas(m, n, alpha, x, y, a)
        integer, intent(in) :: m, n
        real, intent(in) :: alpha, x(m), y(n)
        real, intent(inout) :: a(m, n)
        external :: sger
        call sger(m,n,alpha,x,1,y,1,a,m)
    end subroutine

end module

program abstraction_penalty
    use sger_kernels
    implicit none
    integer, parameter :: M = 100, N = 100
    integer, parameter :: TRIALS = 10
    integer, parameter :: NUM_VARIANTS = 8

    real(8), parameter :: G_OPS = (2.0d0 * M * N) / 1.0d9

    real, allocatable, target :: A(:,:), x(:), y(:)
    real    :: alpha, ref_time
    real    :: times(NUM_VARIANTS), ratios(NUM_VARIANTS), gflops(NUM_VARIANTS)
    character(len=25) :: names(NUM_VARIANTS)

    character(len=*), parameter :: fmt = '(A25, F12.4, F12.3)'

    allocate(A(M, N), x(M), y(N))
    call random_number(x); call random_number(y); alpha = 1.1

    names = [character(len=25) :: "Classic Nested Loops", &
                                  "Column-wise (1-loop)", &
                                  "Do Concurrent", &
                                  "Pointer Remap + Matmul", &
                                  "Reshape + Matmul", &
                                  "Spread (High-Level)", &
                                  "Elemental (Mapped)",&
                                  "BLAS"]

    print '(A, I0, A, I0, A)', "Benchmarking Matrix ", M, " x ", N
    print '(A)', "-----------------------------------------------------------------------"
    print '(A25, A12, A12, A12)', "Variant", "Avg Time(s)", "Speedup (x)", "GFLOPS"


    ! Reference run
    ref_time = run_kernel(sger_nested)

    call record_stats(1, ref_time)
    call record_stats(2, run_kernel(sger_col))
    call record_stats(3, run_kernel(sger_concurrent))
    call record_stats(4, run_kernel(sger_ptr_wrap))
    call record_stats(5, run_kernel(sger_reshape))
    call record_stats(6, run_kernel(sger_spread))
    call record_stats(7, run_kernel(sger_elem))
    call record_stats(8, run_kernel(sger_blas))

    print '(A)', "-----------------------------------------------------------------------"
    print '(A, F12.4)', "Global Abstraction Penalty (Log-Avg Speedup): ", &
          exp(sum(log(ratios)) / NUM_VARIANTS)

contains

    subroutine record_stats(idx, t_avg)
        integer, intent(in) :: idx
        real, intent(in)    :: t_avg
        times(idx)  = t_avg
        ratios(idx) = ref_time / t_avg
        gflops(idx) = real(G_OPS / t_avg)
        print '(A25, E12.4, F12.3, G12.3)', &
            names(idx), times(idx), ratios(idx), gflops(idx)
    end subroutine

    function run_kernel(kernel) result(avg_t)
        interface
            subroutine kernel(m, n, alpha, x, y, a)
                integer, intent(in) :: m, n
                real, intent(in) :: alpha, x(m), y(n)
                real, intent(inout) :: a(m, n)
            end subroutine
        end interface
        integer :: current_trials, t
        real(kind(1.0d0)) :: tstart, tend, total
        real :: avg_t

        current_trials = TRIALS
        do
            A = 0.0
            call cpu_time(tstart)
            do t = 1, current_trials
                call kernel(M, N, alpha, x, y, A)
            end do
            call cpu_time(tend)
            total = tend - tstart
            if (total >= 0.1) exit
            if (current_trials > 1000) exit
            current_trials = current_trials * 2
        end do
        avg_t = total / current_trials
    end function

end program

The results I got on an Apple M2 processor (using BLAS from Apple Accelerate):

$ gfortran -O3 -mcpu=native dger_abstraction_penalty.F90 -lblas && ./a.out
Benchmarking Matrix 100 x 100
-----------------------------------------------------------------------
                  Variant Avg Time(s) Speedup (x)      GFLOPS
Classic Nested Loops       0.8633E-06       1.000    23.2
Column-wise (1-loop)       0.8570E-06       1.007    23.3
Do Concurrent              0.8641E-06       0.999    23.1
Pointer Remap + Matmul     0.2793E-05       0.309    7.16
Reshape + Matmul           0.4941E-05       0.175    4.05
Spread (High-Level)        0.1045E-04       0.083    1.91
Elemental (Mapped)         0.1044E-04       0.083    1.91
BLAS                       0.5812E-06       1.485    34.4
-----------------------------------------------------------------------
Global Abstraction Penalty (Log-Avg Speedup):       0.3914
$ flang -O3 -mcpu=native dger_abstraction_penalty.F90 -lblas && ./a.out
Benchmarking Matrix 100 x 100
-----------------------------------------------------------------------
                  Variant Avg Time(s) Speedup (x)      GFLOPS
Classic Nested Loops       0.4844E-06       1.000    41.3
Column-wise (1-loop)       0.4742E-06       1.021    42.2
Do Concurrent              0.4734E-06       1.023    42.2
Pointer Remap + Matmul     0.2844E-05       0.170    7.03
Reshape + Matmul           0.2849E-05       0.170    7.02
Spread (High-Level)        0.1385E-03       0.003   0.144
Elemental (Mapped)         0.1404E-03       0.003   0.142
BLAS                       0.5578E-06       0.868    35.9
-----------------------------------------------------------------------
Global Abstraction Penalty (Log-Avg Speedup):       0.1540

For this particular example, using spread or matmul is slower than using a do loop. The BLAS result on the other hand is consistent when switching compilers.

I suppose the matmul results could be improved by using a special kernel for the scenario (m x 1) * (1 x n) (inner dimension of product is 1). The same goes for spread() * spread(). But it is a lot extra work to add such corner cases to a compiler.

(Note: I’ve brought this example up previously - `target` attribute seemingly affecting performances - #25 by ivanpribec)

1 Like

I think this is due to the temporary memory allocation that is required by both those approaches when the code is implemented in a straightforward way. It is only if the compiler optimizer can determine that the memory allocation is not required and perform the operation without it that optimal performance can be achieved.

I would guess that this operation could also be performed efficiently with a (now obsolete) forall array expression. This was another way for the programmer to tell the compiler that temporary array allocations are unnecessary and that the array references can be done in any order. For a simple expression like this, I would expect the same performance as do concurrent, since the compiler has basically the same information in both cases.

The BLAS routine probably works by subblocking the matrix and/or by unrolling the loops. I would expect a compiler optimizer to recognize this possibility too with do or do concurrent.

@Arjen that’s a great question – probably one to ask in the Intel Forum. There’s no explicit way in Fortran to control data movement between CPU and GPU, but @JeffH is likely to know the best way to handle this.

1 Like

@jorgeg I don’t know how compilers implement elemental, but I suspect they aren’t actually generating all those additional procedures. I know very little about the internals of compilers, but I can imagine some better ways to handle elemental. For example, if the array arguments are contiguous, then each array element can be accessed via one subscript as if the array were one dimension (much like is possible with the pointer rank remapping feature that Fortran 2008 introduced) so only one additional implementation must be generated.

1 Like

In the NVIDIA implementation, allocatable memory is CUDA managed memory so you use cudaMemPrefetchAsync to cause data to be migrated explicitly rather than as generated by page faults.