Please, No More Loops (Than Necessary): New Patterns in Fortran 2023

Hi @ivanpribec , you have provided a perfect example supporting @rwmsu 's viewpoint quoted below, with which I totally agree.

1 Like

I don’t know enough compiler stuff to interpret output from compiler explorer, but @ivanpribec does, so I will bother him now

Thanks for the suggestion. I added a forall variant.

gfortran:

                  Variant Avg Time(s) Speedup (x)      GFLOPS
Classic Nested Loops       0.8813E-06       1.000    22.7
Column-wise (1-loop)       0.8953E-06       0.984    22.3
Do Concurrent              0.8953E-06       0.984    22.3
...
Forall                     0.8953E-06       0.984    22.3

flang:

Classic Nested Loops       0.4938E-06       1.000    40.5
Column-wise (1-loop)       0.5359E-06       0.921    37.3
Do Concurrent              0.4852E-06       1.018    41.2
...
Forall                     0.1696E-05       0.291    11.8

One thing which I believe my example shows is that even a simple operation can be expressed in many different ways. It is a challenging task for the compiler to ā€œsee throughā€ the programming language abstractions and deliver the optimally performing code. Not to mention that different optimizations may be needed depending on the sizes (m, n) (which may only be determinable at runtime) and for different hardware.

Procedures have always offered a convenient way of capturing the important building blocks, which can be optimized separately (or even implemented by other means than Fortran).


I’ve done a few experiments (and so has @hkvzjal) - if your elemental procedure is not in the same translation unit as the site where it is called, the compilers I’ve checked (gfortran, flang) fallback to calling the default scalar variant in a loop.

Here’s an example:

! stencil_helpers.f90
module stencil_helpers
  implicit none
  integer, parameter :: dp = kind(1.0d0)
contains
  elemental function stencil_point(c, w, e, s, n, cx, cy) result(val)
    real(dp), intent(in) :: c, w, e, s, n, cx, cy
    real(dp) :: val
    val = c + cx*(e - 2.0_dp*c + w) + cy*(n - 2.0_dp*c + s)
  end function stencil_point
end module

When compiled with gfortran -O3 -mcpu=native -S stencil_helpers.f90, the assembly contains a scalar function (we can infer it is scalar, because all registers start with d for double precision scalar):

___stencil_helpers_MOD_stencil_point:
LFB0:
	ldr	d31, [x0]
	fmov	d28, 2.0e+0
	ldr	d29, [x2]
	ldr	d30, [x4]
	ldr	d26, [x1]
	fmsub	d29, d31, d28, d29
	ldr	d27, [x3]
	fmsub	d30, d31, d28, d30
	ldr	d28, [x5]
	ldr	d0, [x6]
	fadd	d29, d29, d26
	fadd	d30, d30, d27
	fmadd	d31, d28, d29, d31
	fmadd	d0, d0, d30, d31
	ret

In a second file we can now call the elemental function:

! stencil.f90
module stencil
  use stencil_helpers, only: dp, stencil_point
  implicit none
contains
  subroutine stencil_elemental(uold, unew, nx, ny, coeffx, coeffy)
    integer, intent(in) :: nx, ny
    real(dp), intent(in) :: uold(0:nx+1,0:ny+1)
    real(dp), intent(out) :: unew(0:nx+1,0:ny+1)
    real(dp), intent(in) :: coeffx, coeffy
    unew(1:nx,1:ny) = stencil_point( &
      c=uold(1:nx,1:ny), &
      w=uold(0:nx-1,1:ny), &
      e=uold(2:nx+1,1:ny), &
      s=uold(1:nx,0:ny-1), &
      n=uold(1:nx,2:ny+1), &
      cx=coeffx, cy=coeffy)
  end subroutine stencil_elemental
end module

The assembly now contains a loop applying the scalar function, element after element:

___stencil_MOD_stencil_elemental:

; ... omitted ...

L3:
	mov	x0, x27
	sub	x1, x27, #8
	add	x27, x27, 8
	add	x4, x0, x19
	add	x3, x0, x20
	mov	x2, x27
	add	x6, x29, 152
	add	x5, x29, 144
	bl	___stencil_helpers_MOD_stencil_point
	str	d0, [x26], 8
	cmp	x21, x27
	bne	L3

If we pack the two modules together and compile it (Compiler Explorer):

! stencil_fused.f90
module stencil
  implicit none
  integer, parameter :: dp = kind(1.0d0)
contains
  elemental function stencil_point(c, w, e, s, n, cx, cy) result(val)
    real(dp), intent(in) :: c, w, e, s, n, cx, cy
    real(dp) :: val
    val = c + cx*(e - 2.0_dp*c + w) + cy*(n - 2.0_dp*c + s)
  end function stencil_point
  subroutine stencil_elemental(uold, unew, nx, ny, coeffx, coeffy)
    use stencil_helpers, only: dp, stencil_point
    integer, intent(in) :: nx, ny
    real(dp), intent(in) :: uold(0:nx+1,0:ny+1)
    real(dp), intent(out) :: unew(0:nx+1,0:ny+1)
    real(dp), intent(in) :: coeffx, coeffy
    unew(1:nx,1:ny) = stencil_point( &
      c=uold(1:nx,1:ny), &
      w=uold(0:nx-1,1:ny), &
      e=uold(2:nx+1,1:ny), &
      s=uold(1:nx,0:ny-1), &
      n=uold(1:nx,2:ny+1), &
      cx=coeffx, cy=coeffy)
  end subroutine stencil_elemental
end module

the elemental function gets inlined and the compiler generates neat looking SIMD instructions (the v stands for vector registers, and postfix .2d means there are two doubles in the vector registers). Here it looks like it applied loop unrolling to a depth of 4:

.L6:
        add     x7, x7, 2
        add     x8, x8, 32
        ldp     q25, q24, [x6], 32
        ldp     q5, q2, [x2], 32
        ldp     q1, q23, [x5], 32
        ldp     q21, q26, [x4], 32
        mov     v4.16b, v25.16b
        mov     v0.16b, v24.16b
        mov     v3.16b, v5.16b
        mov     v20.16b, v2.16b
        fmls    v4.2d, v5.2d, v29.2d
        fmls    v0.2d, v2.2d, v29.2d
        fmls    v23.2d, v2.2d, v29.2d
        fmls    v1.2d, v5.2d, v29.2d
        fadd    v4.2d, v4.2d, v27.2d
        fadd    v0.2d, v0.2d, v25.2d
        mov     v27.16b, v24.16b
        fadd    v26.2d, v23.2d, v26.2d
        fadd    v21.2d, v1.2d, v21.2d
        fmla    v20.2d, v0.2d, v31.d[0]
        fmla    v3.2d, v4.2d, v31.d[0]
        fmla    v20.2d, v26.2d, v30.d[0]
        fmla    v3.2d, v21.2d, v30.d[0]
        stp     q3, q20, [x3], 32
        cmp     x9, x7
        bne     .L6

TL;DR, just because a function is marked as elemental, doesn’t mean it uses parallel evaluation.

8 Likes

very nice and informative talk @rouson (not surprised by it, of course). For those who missed it, here is a recording of the talk: https://www.youtube.com/watch?v=DKV2Whf4MKg