Hi @ivanpribec , you have provided a perfect example supporting @rwmsu 's viewpoint quoted below, with which I totally agree.
I donāt know enough compiler stuff to interpret output from compiler explorer, but @ivanpribec does, so I will bother him now
Thanks for the suggestion. I added a forall variant.
gfortran:
Variant Avg Time(s) Speedup (x) GFLOPS
Classic Nested Loops 0.8813E-06 1.000 22.7
Column-wise (1-loop) 0.8953E-06 0.984 22.3
Do Concurrent 0.8953E-06 0.984 22.3
...
Forall 0.8953E-06 0.984 22.3
flang:
Classic Nested Loops 0.4938E-06 1.000 40.5
Column-wise (1-loop) 0.5359E-06 0.921 37.3
Do Concurrent 0.4852E-06 1.018 41.2
...
Forall 0.1696E-05 0.291 11.8
One thing which I believe my example shows is that even a simple operation can be expressed in many different ways. It is a challenging task for the compiler to āsee throughā the programming language abstractions and deliver the optimally performing code. Not to mention that different optimizations may be needed depending on the sizes (m, n) (which may only be determinable at runtime) and for different hardware.
Procedures have always offered a convenient way of capturing the important building blocks, which can be optimized separately (or even implemented by other means than Fortran).
Iāve done a few experiments (and so has @hkvzjal) - if your elemental procedure is not in the same translation unit as the site where it is called, the compilers Iāve checked (gfortran, flang) fallback to calling the default scalar variant in a loop.
Hereās an example:
! stencil_helpers.f90
module stencil_helpers
implicit none
integer, parameter :: dp = kind(1.0d0)
contains
elemental function stencil_point(c, w, e, s, n, cx, cy) result(val)
real(dp), intent(in) :: c, w, e, s, n, cx, cy
real(dp) :: val
val = c + cx*(e - 2.0_dp*c + w) + cy*(n - 2.0_dp*c + s)
end function stencil_point
end module
When compiled with gfortran -O3 -mcpu=native -S stencil_helpers.f90, the assembly contains a scalar function (we can infer it is scalar, because all registers start with d for double precision scalar):
___stencil_helpers_MOD_stencil_point:
LFB0:
ldr d31, [x0]
fmov d28, 2.0e+0
ldr d29, [x2]
ldr d30, [x4]
ldr d26, [x1]
fmsub d29, d31, d28, d29
ldr d27, [x3]
fmsub d30, d31, d28, d30
ldr d28, [x5]
ldr d0, [x6]
fadd d29, d29, d26
fadd d30, d30, d27
fmadd d31, d28, d29, d31
fmadd d0, d0, d30, d31
ret
In a second file we can now call the elemental function:
! stencil.f90
module stencil
use stencil_helpers, only: dp, stencil_point
implicit none
contains
subroutine stencil_elemental(uold, unew, nx, ny, coeffx, coeffy)
integer, intent(in) :: nx, ny
real(dp), intent(in) :: uold(0:nx+1,0:ny+1)
real(dp), intent(out) :: unew(0:nx+1,0:ny+1)
real(dp), intent(in) :: coeffx, coeffy
unew(1:nx,1:ny) = stencil_point( &
c=uold(1:nx,1:ny), &
w=uold(0:nx-1,1:ny), &
e=uold(2:nx+1,1:ny), &
s=uold(1:nx,0:ny-1), &
n=uold(1:nx,2:ny+1), &
cx=coeffx, cy=coeffy)
end subroutine stencil_elemental
end module
The assembly now contains a loop applying the scalar function, element after element:
___stencil_MOD_stencil_elemental:
; ... omitted ...
L3:
mov x0, x27
sub x1, x27, #8
add x27, x27, 8
add x4, x0, x19
add x3, x0, x20
mov x2, x27
add x6, x29, 152
add x5, x29, 144
bl ___stencil_helpers_MOD_stencil_point
str d0, [x26], 8
cmp x21, x27
bne L3
If we pack the two modules together and compile it (Compiler Explorer):
! stencil_fused.f90
module stencil
implicit none
integer, parameter :: dp = kind(1.0d0)
contains
elemental function stencil_point(c, w, e, s, n, cx, cy) result(val)
real(dp), intent(in) :: c, w, e, s, n, cx, cy
real(dp) :: val
val = c + cx*(e - 2.0_dp*c + w) + cy*(n - 2.0_dp*c + s)
end function stencil_point
subroutine stencil_elemental(uold, unew, nx, ny, coeffx, coeffy)
use stencil_helpers, only: dp, stencil_point
integer, intent(in) :: nx, ny
real(dp), intent(in) :: uold(0:nx+1,0:ny+1)
real(dp), intent(out) :: unew(0:nx+1,0:ny+1)
real(dp), intent(in) :: coeffx, coeffy
unew(1:nx,1:ny) = stencil_point( &
c=uold(1:nx,1:ny), &
w=uold(0:nx-1,1:ny), &
e=uold(2:nx+1,1:ny), &
s=uold(1:nx,0:ny-1), &
n=uold(1:nx,2:ny+1), &
cx=coeffx, cy=coeffy)
end subroutine stencil_elemental
end module
the elemental function gets inlined and the compiler generates neat looking SIMD instructions (the v stands for vector registers, and postfix .2d means there are two doubles in the vector registers). Here it looks like it applied loop unrolling to a depth of 4:
.L6:
add x7, x7, 2
add x8, x8, 32
ldp q25, q24, [x6], 32
ldp q5, q2, [x2], 32
ldp q1, q23, [x5], 32
ldp q21, q26, [x4], 32
mov v4.16b, v25.16b
mov v0.16b, v24.16b
mov v3.16b, v5.16b
mov v20.16b, v2.16b
fmls v4.2d, v5.2d, v29.2d
fmls v0.2d, v2.2d, v29.2d
fmls v23.2d, v2.2d, v29.2d
fmls v1.2d, v5.2d, v29.2d
fadd v4.2d, v4.2d, v27.2d
fadd v0.2d, v0.2d, v25.2d
mov v27.16b, v24.16b
fadd v26.2d, v23.2d, v26.2d
fadd v21.2d, v1.2d, v21.2d
fmla v20.2d, v0.2d, v31.d[0]
fmla v3.2d, v4.2d, v31.d[0]
fmla v20.2d, v26.2d, v30.d[0]
fmla v3.2d, v21.2d, v30.d[0]
stp q3, q20, [x3], 32
cmp x9, x7
bne .L6
TL;DR, just because a function is marked as elemental, doesnāt mean it uses parallel evaluation.
very nice and informative talk @rouson (not surprised by it, of course). For those who missed it, here is a recording of the talk: https://www.youtube.com/watch?v=DKV2Whf4MKg