Personally, I think this rule is too strong, and both assumed-size and assumed-shape arrays have valid use-cases.
When it comes to explicit-size array arguments (including assumed-size)
- the known length and array contiguity can offer better optimization opportunities
- very commonly used for interoperability with C APIs
Just to give you a simple example:
subroutine sqrsum(a,b,c)
real, intent(in) :: a(:), b(:)
real, intent(out) :: c(:)
c = a**2 + b**2
end subroutine
subroutine sqrsumn(n,a,b,c)
integer, intent(in) :: n
real, intent(in) :: a(n), b(n)
real, intent(out) :: c(n)
c = a**2 + b**2
end subroutine
When compiled with gfortran -O2 -march=skylake
the instructions produced are:
1 sqrsum_: | sqrsumn_:
2 push rbx | movsx rdi, DWORD PTR [rdi]
3 mov ebx, 1 | test edi, edi
4 mov r8, QWORD PTR [rdx+40] | jle .L5
5 mov r10, QWORD PTR [rdi+40] | sal rdi, 2
6 mov r9, QWORD PTR [rsi+40] | xor eax, eax
7 mov r11, QWORD PTR [rdi+56] | .L3:
8 test r8, r8 | vmovss xmm1, DWORD PTR [rdx+rax]
9 mov rax, QWORD PTR [rdx] | vmovss xmm0, DWORD PTR [rsi+rax]
10 mov rcx, QWORD PTR [rdi] | vmulss xmm1, xmm1, xmm1
11 cmove r8, rbx | vfmadd132ss xmm0, xmm1, xmm0
12 test r10, r10 | vmovss DWORD PTR [rcx+rax], xmm0
13 mov rdx, QWORD PTR [rsi] | add rax, 4
14 cmove r10, rbx | cmp rdi, rax
15 test r9, r9 | jne .L3
16 cmove r9, rbx | .L5:
17 sub r11, QWORD PTR [rdi+48] | ret
18 js .L11
19 sal r10, 2
20 xor esi, esi
21 sal r9, 2
22 sal r8, 2
23 .L6:
24 vmovss xmm1, DWORD PTR [rdx]
25 vmovss xmm0, DWORD PTR [rcx]
26 mov rdi, rsi
27 add rcx, r10
28 inc rsi
29 add rdx, r9
30 vmulss xmm1, xmm1, xmm1
31 vfmadd132ss xmm0, xmm1, xmm0
32 vmovss DWORD PTR [rax], xmm0
33 add rax, r8
34 cmp rdi, r11
35 jne .L6
36 .L11:
37 pop rbx
38 ret
See how the assumed-shape version does extra register shuffling at the beginning, whereas the contiguous versions proceeds almost immediately into the computational loop. If you have lots of assumed-shape arguments this contributes to a higher register pressure, and some arguments may need to be pushed to the stack. In contrast the known-size version uses only four registers,
rdi
for the address at which the common sizen
is stored,rsi
for the base address of arraya
rdx
for the base address of arrayb
rcx
for the base address of arrayc
In the computational loop .L6
, the assumed-shape version is 13 instructions, whereas the contiguous version has 8. The additional instructions track the element position using the registers rdx
, rcx
, and rax
, as assumed-shape implies the arguments could have different strides.
When compiled with with -O3 -march=skylake -fopt-info
, the message
/app/example.f90:4:15: optimized: versioned this loop for when certain strides are 1
says the compiler generated two code-paths, one optimized for the case when both arguments are contiguous, and a second one, using a sequential loop when one or both of the strides are larger than one.
NVIDIA also recommends avoiding assumed-shape arrays when it comes to GPU kernels (that is CUDA kernels, or functions called inside of OpenACC regions). For more details see the talk by David Appelhans on Best Practices for Programming GPUs using Fortran, OpenACC, and CUDA | NVIDIA On-Demand. Here is a snapshot taken from that video:
There are however very valid use-cases for assumed-shape arrays too. One of my favourites is for situations where both SoA and AoS data-structures are commonly used. Say you have some type of spatial search tree in 3-d:
subroutine search(x,y,z, ...)
real, intent(in) :: x(:), y(:), z(:)
! ...
end subroutine
By using assumed-shape arrays, users can do the following,
type :: point
real :: x, y, z
end type
type :: point_collection(n)
real, len :: n
real :: x(n), y(n), z(n)
end type
type(point), allocatable :: pa(:)
type(point_collection(n=:)), allocatable :: pc
! ... initialize pa and pc
call search(pa%x,pa%y,pa%z, ...) ! AoS
call search(pc%x,pc%y,pc%z, ...) ! SoA
The code-path generated for the AoS layout won’t be optimal, but at least it doesn’t make a temporary copy, and the locality of the data accesses remains good.
Edit: for anyone seeking to combine two files for a side-by-side comparison, I used this script:
#!/usr/bin/env bash
# Combines two lines side-by-side and numbers them
# Usage: combine.sh <file1> <file2>
nl -w3 <(paste $1 <(awk '{print "| "$0}' $2) | (column -s $'\t' -t))