Simplify loop on an array of derived type?

In my application, I often need to operate on every element of an array of derived type.

To this end, I find this construct very convenient :

type(my_type), dimension(:) :: my_array
integer i

do i=1,size(my_array)
  associate(el => my_array(i))
  ! operate on el...
  end associate

However, C++ has now a nice way to iterate over the elements of a std::vector :

for(auto el : my_array)

Similarly python has for el in my_array.

Couldn’t we have something similar in fortran ? Maybe with the syntax :

do (el => my_array)
  ! operate on el...

That would save declaring i and calling size, and automatically make the association.

I discovered fortran very recently and I love it, so I apologize in advance if my suggestion is off-topic.

What do you think ?

Well, Fortran has elemental routines, which allow you to do all manner of operations on the array without having to specify a loop. It depends a bit on what your operations are exactly, but it may be worthwhile to check that feature. (No time right now to concoct an example :slight_smile: but there should be plenty information about them)

Here is a simple example of an elemental function operating on an array of a derived type.

program elemental_example
  implicit none
  type derived
    real :: x=0.0
    integer :: i=2
  end type derived

  integer i
  type(derived) :: data(10)
  real, allocatable :: y(:)
  do i = 1, size(data)
    data(i) = derived(x=real(i))
  end do

  y = f(data)
  write(*,'(*(f6.1,:,1x))') y
  real elemental function f(der) 
    type(derived), intent(in) :: der
    f = der%x**der%i
  end function f
end program elemental_example

In F2023, I believe it’s possible to do:

do integer :: i = 1, size(my_array)
end do

but I don’t know if any compilers support it yet.

Depending on the elements of the derived type and nature of operations, you could also think of “inverting” the structure of your program, doing the association outside, and using elemental operations inside the association scope:

do i=1,size(my_array)
  associate(el => my_array(i))
     if (el%y > 5) el%x = el%y + 2
  end associate

associate(x => my_array%x, y => my_array%y)
   where(y > 5) x = y + 2
end associate

If the derived type is C-interoperable you can also do the processing in C++, although this increases the complexity of building the code. Borrowing the example of @DavidB:

// range_based_for.cpp
#include <iostream>
#include <span> // C++20
#include <cmath>

extern "C" {
struct derived 
    float x;
    int i;

void range_based_for(int n, derived *a_) 
    std::span<derived> a(a_,n);
    for (auto el: a) {
        std::cout << std::pow(el.x,el.i) << '\n';

} // extern "C"
! cpp_example.f90 --
!      Demo of calling a C++ routine using range-based for
program cpp_example
  use, intrinsic :: iso_c_binding
  implicit none
  type, bind(c) :: derived
    real(c_float) :: x = 0.0
    integer(c_int) :: i = 2
  end type

    subroutine range_based_for(n,a) bind(c,name="range_based_for")
      import c_int, derived
      integer(c_int), value :: n
      type(derived) :: a(n)
    end subroutine
  end interface

  type(derived) :: data(10)
  integer :: i 

  ! initialize array
  do i = 1, size(data)
    data(i) = derived(x=real(i))
  end do

  ! perform work in C++
  call range_based_for(size(data),data)

end program cpp_example
~/fortran/20240102_range_for$ make run
g++-13 -Wall -std=c++20 -c range_based_for.cpp
gfortran-13 -Wall -o cpp_example cpp_example.f90 range_based_for.o -lstdc++

If we were to do this the “Fortranic” way,

  print '(F6.1)', data%x**data%i

the program is shorter than the C++ one-liners:

// range-based for with structured binding
    for (auto [x,i]: a) std::cout << std::pow(x,i) << '\n';
// STL "for each"
    std::ranges::for_each(a,[](const auto& el){ std::cout << std::pow(el.x,el.i) << '\n'; });

Anyways, besides using elemental routines, depending on the application, you can also use transformational intrinsic functions (pack, unpack, merge, findloc), or one of the two-forms of where statements:

where (mask_expr) y = ...

where (mask_expr)
end where

An exercise exploring this topic can be found in this thread, where the challenge was to implement a “filter”-type program without explicit do loops.

In modern C++ they rely heavily on range-based for since it solves subtle problems such as,

for (int i = 1; i <= n; i++) ... // off by 1 on lower on upper bounds
for (size_t i = 1; ... ) ...     // size_t, unsigned, or int for index variable?
for (...; ++i) ...               // i++ or ++i, does it even matter?

In Fortran, missing the bounds can be a problem when using allocatable arrays with custom bounds,

subroutine process_array(a)
   real, intent(in), allocatable :: a(:)

   do i = 1, size(a) !! potentially wrong
   end do

   do i = lbound(a), ubound(a)
   end do
end subroutine

But typically you’d use custom arrays bounds for indexing purposes, so I’m not sure a range-based loop syntax would be useful here.

The integer size can be problematic in programs dealing with very large arrays, so that 64-bit integers are needed for indexing. In such cases, it can be painful to remember and consistently use a kind specifier:

subroutine process_array(a)
  real, intent(inout) :: a(:)  ! a huge array
  ! covers range -10**12 (exclusive) to 10**12 (exclusive)
  integer, parameter :: ip = selected_int_kind(12) 
  integer(ip) :: i

  do i = 1_ip, size(a,ip)  ! don't forget the kind!!!
  end do
end subroutine

With that said, personally I wouldn’t mind having some syntactic sugar for looping over elements. But some good examples of the new syntax (and counter-examples of why the existing options are insufficient) are needed.

1 Like

You can create an impure elemental subroutine to operate on el and then call that subroutine with argument my_array. The subroutine will effectively be called with successive elements of my_array. Here is an example of an impure elemental subroutine from my FortranTips:

module m
implicit none
impure elemental subroutine print_pow(i,pow)
integer, intent(in) :: i, pow
print "(i0,'^',i0,' = ',i0)",i, pow, i**pow
end subroutine print_pow
end module m
program main
use m, only: print_pow
call print_pow(i=[1,3,5], pow=3) ; print*
call print_pow(i=2, pow=[1,2,3]) ; print*
call print_pow(i=[2,3], pow=[4,5])
end program main


1^3 = 1
3^3 = 27
5^3 = 125

2^1 = 2
2^2 = 4
2^3 = 8

2^4 = 16
3^5 = 243

Thank you very much for these excellent answers!

Elemental routines are very interesting. After a bit of mental gymnastic (to unlearn years of C and explicit loops), my program is much smoother. This is an awesome feature of fortran, one that makes it really different from C/C++.

Syntactic sugar was exactly what I had in mind. I’ll try to think about it more deeply.

One additional question:

If I’m not mistaken, this function could be declared pure too. To run each call to f in parallel, I’d have to revert to an explicit loop:

! Assuming y is allocated
  y(i) = f(data(i))
end forall

or use a !$OMP or similar $!-prefixed instruction. Right?

According to elemental in Fortran Wiki elemental implies pure so no need to add it. You should declare impure if you need some flexibility IMPURE

This feature is obsolescent since Fortran 2018. You might preferer a simple do, or a do concurrent.

1 Like

Elemental already implies pure.

Parallel in which sense, across SIMD units or threads? Most Fortran compilers tend to do a good job with auto-vectorization and inlining of elemental procedures, as long as these are visible. For example on,

module foo
implicit none
    elemental function sqr(x)
        real, intent(in) :: x
        real :: sqr
        sqr = x**2
    end function
end module

program bar
    use foo
    implicit none
    integer, parameter :: n = 1000
    real, allocatable :: x(:), y(:)
    call random_number(x)
    y = sqr(x)
    print *, maxval(y)
end program

compiled with gfortran 13.2 and the flags -O3 -mavx2, the assembly contains the main computational loop:

        vmovups ymm2, YMMWORD PTR [rsi+rax]   # load eight reals from x
        add     rdx, 1
        vmulps  ymm0, ymm2, ymm2              # vector multiply packed single
        vmovups YMMWORD PTR [rbx+rax], ymm0   # store eight reals in y
        add     rax, 32
        cmp     rdi, rdx
        jne     .L7

For parallelism across CPU threads my personal favorite is to use OpenMP constructs, as they provide better control over the parallelism.

In principle it is also possible to use do concurrent and the right compiler flags:

When using this form of automatic parallelization it’s always a good idea to check if parallelization occurred. A few options to inspect this are:

More information about the use of do concurrent can be found in @sumseq’s work: Can Fortran’s ‘do concurrent’ replace directives for accelerated computing? (PDF, 3.78 MB)

Just for demonstration, I took the program above and called the function within a do concurrent loop:

program bar
    use foo, only: sqr
    implicit none
    integer, parameter :: n = 100000
    real, allocatable :: x(:), y(:)
    integer :: i
    call random_number(x)
    do concurrent(i=1:n)
        y(i) = sqr(x(i))
    end do
    print *, maxval(y)
end program

The result with gfortran:

$ gfortran -O3 -march=native -fopt-info bar.f90
bar.f90:20:0: optimized:  Inlining sqr/0 into bar/1.
bar.f90:20:0: optimized: loop vectorized using 64 byte vectors
bar.f90:17:0: optimized: basic block part vectorized using 32 byte vectors
$ gfortran -O3 -march=native -ftree-parallelize-loops=2 -fopt-info bar.f90
bar.f90:20:0: optimized:  Inlining sqr/0 into bar/1.
bar.f90:20:0: optimized: parallelizing inner loop 1
bar.f90:17:0: optimized: basic block part vectorized using 32 byte vectors

With ifort:

$ ifort -O2 -parallel -qopenmp -qopt-report=5 -qopt-report-phase=openmp,par bar.f90
ifort: remark #10397: optimization reports are generated in *.optrpt files in the output location

The generated file bar.optrpt states,

LOOP BEGIN at bar.f90(19,5)
   remark #17108: loop was not parallelized: insufficient computational work

When the parallelization threshold is lowered,

$ ifort -O2 -parallel -par-threshold=80 -qopenmp -qopt-report=5 -qopt-report-phase=openmp,par bar.f90
ifort: remark #10397: optimization reports are generated in *.optrpt files in the output location

the optimization report changes to

LOOP BEGIN at bar.f90(19,5)
   remark #17101: parallel loop shared={ } private={ } firstprivate={ i X Y i } lastprivate={ } firstlastprivate={ } reduction={ }

With nvfortran:

$ nvfortran -fast -stdpar=multicore -Minfo=accel,vect bar.f90
     19, Generating Multicore code
         19, Loop parallelized across CPU threads
     19, Loop not vectorized/parallelized: contains call
     22, Generated vector simd code for the loop containing reductions

Whether this actually leads to any speed-up, and under what conditions, has to be measured at runtime. It’s worth noting that ifort was able to both vectorize and parallelize the loop across threads, but with gfortran or nvfortran this didn’t happen.

1 Like