Simplify loop on an array of derived type?

ivanpribec · January 2, 2024, 9:51pm

Elemental already implies pure.

Parallel in which sense, across SIMD units or threads? Most Fortran compilers tend to do a good job with auto-vectorization and inlining of elemental procedures, as long as these are visible. For example on,

module foo
implicit none
contains
    elemental function sqr(x)
        real, intent(in) :: x
        real :: sqr
        sqr = x**2
    end function
end module

program bar
    use foo
    implicit none
    integer, parameter :: n = 1000
    real, allocatable :: x(:), y(:)
    allocate(x(n),y(n))
    call random_number(x)
    y = sqr(x)
    print *, maxval(y)
end program

compiled with gfortran 13.2 and the flags -O3 -mavx2, the assembly contains the main computational loop:

.L7:
        vmovups ymm2, YMMWORD PTR [rsi+rax]   # load eight reals from x
        add     rdx, 1
        vmulps  ymm0, ymm2, ymm2              # vector multiply packed single
        vmovups YMMWORD PTR [rbx+rax], ymm0   # store eight reals in y
        add     rax, 32
        cmp     rdi, rdx
        jne     .L7

For parallelism across CPU threads my personal favorite is to use OpenMP constructs, as they provide better control over the parallelism.

In principle it is also possible to use do concurrent and the right compiler flags:

with gfortran use -ftree-parallelize-loops=N where N is the number of threads.
with ifort use -parallel
with nvfortran use -stdpar=multicore

When using this form of automatic parallelization it’s always a good idea to check if parallelization occurred. A few options to inspect this are:

for gfortran: -fopt-info
for ifort -qopt-report or -parallel-source-info (for some reason this option doesn’t work for me)
for nvfortran: -Minfo=accel

More information about the use of do concurrent can be found in @sumseq’s work: Can Fortran’s ‘do concurrent’ replace directives for accelerated computing? (PDF, 3.78 MB)

Just for demonstration, I took the program above and called the function within a do concurrent loop:

program bar
    use foo, only: sqr
    implicit none
    integer, parameter :: n = 100000
    real, allocatable :: x(:), y(:)
    integer :: i
    allocate(x(n),y(n))
    call random_number(x)
    do concurrent(i=1:n)
        y(i) = sqr(x(i))
    end do
    print *, maxval(y)
end program

The result with gfortran:

$ gfortran -O3 -march=native -fopt-info bar.f90
bar.f90:20:0: optimized:  Inlining sqr/0 into bar/1.
bar.f90:20:0: optimized: loop vectorized using 64 byte vectors
bar.f90:17:0: optimized: basic block part vectorized using 32 byte vectors
$ gfortran -O3 -march=native -ftree-parallelize-loops=2 -fopt-info bar.f90
bar.f90:20:0: optimized:  Inlining sqr/0 into bar/1.
bar.f90:20:0: optimized: parallelizing inner loop 1
bar.f90:17:0: optimized: basic block part vectorized using 32 byte vectors

With ifort:

$ ifort -O2 -parallel -qopenmp -qopt-report=5 -qopt-report-phase=openmp,par bar.f90
ifort: remark #10397: optimization reports are generated in *.optrpt files in the output location

The generated file bar.optrpt states,

LOOP BEGIN at bar.f90(19,5)
   remark #17108: loop was not parallelized: insufficient computational work
LOOP END

When the parallelization threshold is lowered,

$ ifort -O2 -parallel -par-threshold=80 -qopenmp -qopt-report=5 -qopt-report-phase=openmp,par bar.f90
ifort: remark #10397: optimization reports are generated in *.optrpt files in the output location

the optimization report changes to

LOOP BEGIN at bar.f90(19,5)
   remark #17109: LOOP WAS AUTO-PARALLELIZED
   remark #17101: parallel loop shared={ } private={ } firstprivate={ i X Y i } lastprivate={ } firstlastprivate={ } reduction={ }
LOOP END

With nvfortran:

$ nvfortran -fast -stdpar=multicore -Minfo=accel,vect bar.f90
bar:
     19, Generating Multicore code
         19, Loop parallelized across CPU threads
     19, Loop not vectorized/parallelized: contains call
     22, Generated vector simd code for the loop containing reductions

Whether this actually leads to any speed-up, and under what conditions, has to be measured at runtime. It’s worth noting that ifort was able to both vectorize and parallelize the loop across threads, but with gfortran or nvfortran this didn’t happen.

Topic		Replies	Views
Fortran: Array Language (video) Advocacy	20	1021	February 3, 2024
Pure functions not deterministic in serial and parallel loops	20	1628	November 9, 2022
Automatic object of parameterized derived type Help	24	2789	June 2, 2021
Map vs elemental procedures	12	1063	April 17, 2021
Fortran is dead – Long live Fortran! Tutorials	9	837	June 27, 2025

Simplify loop on an array of derived type?

Related topics