Simplify loop on an array of derived type?

Elemental already implies pure.

Parallel in which sense, across SIMD units or threads? Most Fortran compilers tend to do a good job with auto-vectorization and inlining of elemental procedures, as long as these are visible. For example on,

module foo
implicit none
contains
    elemental function sqr(x)
        real, intent(in) :: x
        real :: sqr
        sqr = x**2
    end function
end module

program bar
    use foo
    implicit none
    integer, parameter :: n = 1000
    real, allocatable :: x(:), y(:)
    allocate(x(n),y(n))
    call random_number(x)
    y = sqr(x)
    print *, maxval(y)
end program

compiled with gfortran 13.2 and the flags -O3 -mavx2, the assembly contains the main computational loop:

.L7:
        vmovups ymm2, YMMWORD PTR [rsi+rax]   # load eight reals from x
        add     rdx, 1
        vmulps  ymm0, ymm2, ymm2              # vector multiply packed single
        vmovups YMMWORD PTR [rbx+rax], ymm0   # store eight reals in y
        add     rax, 32
        cmp     rdi, rdx
        jne     .L7

For parallelism across CPU threads my personal favorite is to use OpenMP constructs, as they provide better control over the parallelism.

In principle it is also possible to use do concurrent and the right compiler flags:

When using this form of automatic parallelization it’s always a good idea to check if parallelization occurred. A few options to inspect this are:

More information about the use of do concurrent can be found in @sumseq’s work: Can Fortran’s ‘do concurrent’ replace directives for accelerated computing? (PDF, 3.78 MB)

Just for demonstration, I took the program above and called the function within a do concurrent loop:

program bar
    use foo, only: sqr
    implicit none
    integer, parameter :: n = 100000
    real, allocatable :: x(:), y(:)
    integer :: i
    allocate(x(n),y(n))
    call random_number(x)
    do concurrent(i=1:n)
        y(i) = sqr(x(i))
    end do
    print *, maxval(y)
end program

The result with gfortran:

$ gfortran -O3 -march=native -fopt-info bar.f90
bar.f90:20:0: optimized:  Inlining sqr/0 into bar/1.
bar.f90:20:0: optimized: loop vectorized using 64 byte vectors
bar.f90:17:0: optimized: basic block part vectorized using 32 byte vectors
$ gfortran -O3 -march=native -ftree-parallelize-loops=2 -fopt-info bar.f90
bar.f90:20:0: optimized:  Inlining sqr/0 into bar/1.
bar.f90:20:0: optimized: parallelizing inner loop 1
bar.f90:17:0: optimized: basic block part vectorized using 32 byte vectors

With ifort:

$ ifort -O2 -parallel -qopenmp -qopt-report=5 -qopt-report-phase=openmp,par bar.f90
ifort: remark #10397: optimization reports are generated in *.optrpt files in the output location

The generated file bar.optrpt states,

LOOP BEGIN at bar.f90(19,5)
   remark #17108: loop was not parallelized: insufficient computational work
LOOP END

When the parallelization threshold is lowered,

$ ifort -O2 -parallel -par-threshold=80 -qopenmp -qopt-report=5 -qopt-report-phase=openmp,par bar.f90
ifort: remark #10397: optimization reports are generated in *.optrpt files in the output location

the optimization report changes to

LOOP BEGIN at bar.f90(19,5)
   remark #17109: LOOP WAS AUTO-PARALLELIZED
   remark #17101: parallel loop shared={ } private={ } firstprivate={ i X Y i } lastprivate={ } firstlastprivate={ } reduction={ }
LOOP END

With nvfortran:

$ nvfortran -fast -stdpar=multicore -Minfo=accel,vect bar.f90
bar:
     19, Generating Multicore code
         19, Loop parallelized across CPU threads
     19, Loop not vectorized/parallelized: contains call
     22, Generated vector simd code for the loop containing reductions

Whether this actually leads to any speed-up, and under what conditions, has to be measured at runtime. It’s worth noting that ifort was able to both vectorize and parallelize the loop across threads, but with gfortran or nvfortran this didn’t happen.

1 Like