If the community is interested I can move the implementations of sum and dot_product in fast_math to stdlib. From the tests in the library, this reduces the choices to two: fastest (chunked), accurate (chunked kahan). Even the kahan one can be faster than the intrinsic (according to the test I’ve done with gfortran 13; with intel is slower)
I’ve opened a draft PR in stdlib following the discussion in this thread. For the moment I just included fsum
/fsum_kahan
and fprod
/fprod_kahan
. I poked a little bit the fast_math library before starting the PR, see some updated results.
In order to avoid hijacking this thread, I propose to follow up the discussion directly on the PR. Or a new thread about instrinsics
in stdlib would be more appropriate? @certik I feel this is also related to some of your comments here Building a Fortran compiler in pure Fortran (Mega thread) - #11 by certik
@hkvzjal excellent, thanks!
Regarding my comment that you linked, we since found it hard to maintain functions like “sum” in Fortran, because it has the optional “dim” argument, and if it is compile-time, you want to inline the loops with minimal runtime overhead. Also it has to work for many types and kinds and ranks. So we currently implemented such functions inside the compiler itself, effectively instantiating them for a particular type/kind/rank and a compile-time (if available) value of “dim”. See the above blog post for more details.
Being able to maintain them in Fortran would be a great use case for the generics subgroup.
Thanks for your feedback @certik!
So, If I were to push the current PR to handle ND arrays and the issue with dim, it could be done in stdlib thanks to fypp, here a draft that I’m starting:
header module interface for fsum
interface fsum
#:for rk, rt, rs in RC_KINDS_TYPES
pure module function fsum_1d_${rs}$(a) result(s)
${rt}$, intent(in) :: a(:)
${rt}$ :: s
end function
pure module function fsum_1d_${rs}$_mask(a,mask) result(s)
${rt}$, intent(in) :: a(:)
logical, intent(in) :: mask(:)
${rt}$ :: s
end function
#:for rank in RANKS
pure module function fsum_${rank}$d_${rs}$( x, mask ) result( s )
${rt}$, intent(in) :: x${ranksuffix(rank)}$
logical, intent(in), optional :: mask${ranksuffix(rank)}$
${rt}$ :: s
end function
pure module function fsum_${rank}$d_dim_${rs}$( x , dim, mask ) result( s )
${rt}$, intent(in) :: x${ranksuffix(rank)}$
integer, intent(in):: dim
logical, intent(in), optional :: mask${ranksuffix(rank)}$
${rt}$ :: s${reduced_shape('x', rank, 'dim')}$
end function
#:endfor
#:endfor
end interface
public :: fsum
submodule implementation details
#:for rk, rt, rs in RC_KINDS_TYPES
pure module function fsum_1d_${rs}$(a) result(s)
${rt}$, intent(in) :: a(:)
${rt}$ :: s
${rt}$ :: abatch(chunk)
integer :: i, dr, rr
! -----------------------------
dr = size(a)/chunk
rr = size(a) - dr*chunk
abatch = zero_${rs}$
do i = 1, dr
abatch(1:chunk) = abatch(1:chunk) + a(chunk*i-chunk+1:chunk*i)
end do
abatch(1:rr) = abatch(1:rr) + a(size(a)-rr+1:size(a))
s = zero_${rs}$
do i = 1, chunk/2
s = s + abatch(i)+abatch(chunk/2+i)
end do
end function
pure module function fsum_1d_${rs}$_mask(a,mask) result(s)
${rt}$, intent(in) :: a(:)
logical, intent(in) :: mask(:)
${rt}$ :: s
${rt}$ :: abatch(chunk)
integer :: i, dr, rr
! -----------------------------
dr = size(a)/chunk
rr = size(a) - dr*chunk
abatch = zero_${rs}$
do i = 1, dr
abatch(1:chunk) = abatch(1:chunk) + merge( zero_${rs}$ , a(chunk*i-chunk+1:chunk*i) , mask(chunk*i-chunk+1:chunk*i) )
end do
abatch(1:rr) = abatch(1:rr) + merge( zero_${rs}$ , a(size(a)-rr+1:size(a)) , mask(size(a)-rr+1:size(a)) )
s = zero_${rs}$
do i = 1, chunk/2
s = s + abatch(i)+abatch(chunk/2+i)
end do
end function
!============ N-D =================
#:for rank in RANKS
pure module function fsum_${rank}$d_${rs}$( x , mask ) result( s )
${rt}$, intent(in) :: x${ranksuffix(rank)}$
logical, intent(in), optional :: mask${ranksuffix(rank)}$
${rt}$ :: s
if(.not.present(mask)) then
s = sum_recast(x,size(x))
else
s = sum_recast_mask(x,mask,size(x))
end if
contains
pure ${rt}$ function sum_recast(b,n)
integer, intent(in) :: n
${rt}$, intent(in) :: b(n)
sum_recast = fsum(b)
end function
pure ${rt}$ function sum_recast_mask(b,m,n)
integer, intent(in) :: n
${rt}$, intent(in) :: b(n)
logical, intent(in) :: m(n)
sum_recast_mask = fsum(b,m)
end function
end function
pure module function fsum_${rank}$d_dim_${rs}$( x , dim, mask ) result( s )
${rt}$, intent(in) :: x${ranksuffix(rank)}$
integer, intent(in):: dim
logical, intent(in), optional :: mask${ranksuffix(rank)}$
${rt}$ :: s${reduced_shape('x', rank, 'dim')}$
integer :: j
if(.not.present(mask)) then
if(dim<${rank}$)then
do j = 1, size(x,dim=${rank}$)
#:if rank == 2
s${select_subarray(rank-1, [(rank-1, 'j')])}$ = fsum( x${select_subarray(rank, [(rank, 'j')])}$ )
#:else
s${select_subarray(rank-1, [(rank-1, 'j')])}$ = fsum( x${select_subarray(rank, [(rank, 'j')])}$, dim=dim )
#:endif
end do
else
do j = 1, size(x,dim=1)
#:if rank == 2
s${select_subarray(rank-1, [(1, 'j')])}$ = fsum( x${select_subarray(rank, [(1, 'j')])}$ )
#:else
s${select_subarray(rank-1, [(1, 'j')])}$ = fsum( x${select_subarray(rank, [(1, 'j')])}$, dim=${rank-1}$ )
#:endif
end do
end if
else
if(dim<${rank}$)then
do j = 1, size(x,dim=${rank}$)
#:if rank == 2
s${select_subarray(rank-1, [(rank-1, 'j')])}$ = fsum( x${select_subarray(rank, [(rank, 'j')])}$, mask=mask${select_subarray(rank, [(rank, 'j')])}$ )
#:else
s${select_subarray(rank-1, [(rank-1, 'j')])}$ = fsum( x${select_subarray(rank, [(rank, 'j')])}$, dim=dim, mask=mask${select_subarray(rank, [(rank, 'j')])}$ )
#:endif
end do
else
do j = 1, size(x,dim=1)
#:if rank == 2
s${select_subarray(rank-1, [(1, 'j')])}$ = fsum( x${select_subarray(rank, [(1, 'j')])}$, mask=mask${select_subarray(rank, [(1, 'j')])}$ )
#:else
s${select_subarray(rank-1, [(1, 'j')])}$ = fsum( x${select_subarray(rank, [(1, 'j')])}$, dim=${rank-1}$, mask=mask${select_subarray(rank, [(1, 'j')])}$ )
#:endif
end do
end if
end if
end function
#:endfor
#:endfor
Should I try to push it here? the fsum
implementation has proven very interesting for handling contigous 1D arrays, I have no idea how it might behave performance-wise for ND arrays when requesting to sum over dim>1
.
EDIT: in the implementation details, it can handle dim
and mask
for all ranks
Yes, send a PR, and we’ll go from there.