Some Intrinsic SUMS

As @kargl mentioned, it is important to maintain support for both dim and mask arguments of sum, as well as support input arrays of any rank.

In order to maximize performance, that means optional arguments are a no-go, along with assumed-rank or really any other runtime checks. So right off the bat, each rank of sum becomes 6 functions:

  • sum with no optionals
  • sum with dim
  • sum with scalar mask
  • sum with array mask
  • sum with dim and scalar mask
  • sum with dim and array mask

The above would also need to be repeated for each type and rank the sum replacement was meant to support - not impossible, just annoying. My initial attempt to implement something like pairsum for 2D arrays was no longer any faster than gfortran’s intrinsic implementation, even with LTO.

I haven’t done a lot of testing yet, but I am curious to see how timings will work out for things like x(1000,1000) ... sum(x, dim=2). The compiler implementation may be smarter about it than I can be since the function I write won’t have any context about how data is arranged, just that it was called to sum on dimension 2.

I’m somewhat surprised that this discussion is focusing more on performance than accuracy given the speed of modern processors and the option of doing things in parallel or on a GPU, When the intrinsic sum is taking about 6E-2 seconds to sum 100 Mil components and the more accurate sums are about the same speed, but the intrinsic sum gives wildly inaccurate results I really don’t care that much about the difference in performance. In this particular case accuracy trumps performance (and yes that is not always the case). Say you have a CFD code where you are summing global residuals or errors (usually mass or energy) and using the change in the residuals relative to their inital value to montior the convergence of a given interative scheme. In an ideal world you would like to be able to drive the residuals to machine zero, however for engineering applications stopping when they reduce down to around 10E-8 or so will give you an accurate estimation of drag. If SUM is giving you a bogus value (even if its only a little off as in double precision) you stand the chance of burning a lot more CPU cycles than you need to (or conversely you know you need to run longer to get an accurate answer or there might be a bug in your solver).
Yes, performance matters but as I stated previously given the speed of modern processors performance is in the eye of the beholder. To me getting 1000 wrong answers in the time it takes to get 1 correct one is not ā€œperformanceā€

Here is an example of how the summation recursion can be done without recursive calls. This is the way one might have done this kind of thing in f77 (except for the trailz() and leadz() intrinsics).

program sums
   implicit none
   integer, parameter :: n=10**8
   real :: s1, s2, s3, x(n)
   character(*), parameter :: sfmt = '(a,es15.7,1x,z8)'
   call random_number(x)
   x = x * 0.5 + 0.5    ! values between .5 and 1.0 have the same exponent.
   !x = 256 * (x - 0.5)  ! mixed signs and exponents.
   s1 = sum(x)
   write(*,sfmt) 'intrinsic:', s1, s1
   s2 = e_sum(x)
   write(*,sfmt) 'extended :', s2, s2
   s3 = r_sum(x)
   write(*,sfmt) 'recursive:', s3, s3
contains
   function e_sum(x)
      ! extended precision summation.
      use, intrinsic :: iso_fortran_env, only: ep=>real64
      implicit none
      real :: e_sum
      real, intent(in) :: x(:)
      integer :: i
      real(ep) :: temp
      temp = 0.0_ep
      do i = 1, size(x)
         temp = temp + real( x(i), ep )
      enddo
      e_sum = temp
      return
   end function e_sum
   function r_sum(x)
      ! recursive summation.
      implicit none
      real :: r_sum
      real, intent(in) :: x(:)
      integer :: i, j, p
      real :: psum( bit_size(1) - leadz(size(x)) )  ! max recursion depth.
      p = 0
      do i = 1, size(x)
         p = p + 1
         psum(p) = x(i)
         do j = 1, trailz(i)
            psum(p-1) = psum(p-1) + psum(p)
            p = p - 1
         enddo
      enddo
      do while (p > 1)  ! cleanup loop. accumulate in reverse order.
         psum(p-1) = psum(p-1) + psum(p)
         p = p - 1
      enddo
      r_sum = psum(1)
      return
   end function r_sum
end program sums

$ gfortran sums.F90 && a.out   ! .5 to 1.0 vector
intrinsic:  1.6777216E+07 4B800000
extended :  7.5001832E+07 4C8F0DFD
recursive:  7.5001832E+07 4C8F0DFD

$ gfortran sums.F90 && a.out   ! mixed-sign vector.
intrinsic:  7.5558744E+05 49387837
extended :  7.5553831E+05 49387525
recursive:  7.5553825E+05 49387524

To achieve optimal accuracy with real32 arithmetic, the internal summations should probably be done in a compensating way, but to my surprise, the recursive summation alone does a pretty good job with mixed-sign arguments, as can be seen by uncommenting that second statement to adjust the x(:) values.

As checked from my side, the Higham method is faster than the Kahan one when compiling codes with the -O3 flag.

I modified the code from @rwmsu post, as follows:

program kahan_sum_example_02 

!#ifdef R64
  use ISO_FORTRAN_ENV, ONLY: wp=>REAL64
!#else
!  use ISO_FORTRAN_ENV, ONLY: wp=>REAL32
!#endif

  implicit none

!Some constants

  integer, parameter     :: n = 10**8 + 1 != 100,000,001.
  integer                :: i
  real(wp)               :: dpc, a_sum, k_sum, h_sum, i_sum
  real(wp)               :: dt1, dt2,  dt3,   dt4
  real(wp), dimension(n) :: dp

!-----------------------------------------------------------------
! We compute sum = 1.0 + sum_{i=2,n} dp(i).
! For simplicity, use a constant dp: dp = dpc, where

  dpc     = 1.234567891234567e-02_wp
  dp(1)   = 1.0_wp
  dp(2:n) = dpc

! Note: This is a simplified example. In this case, the correct sum
!       should be 1+dpc*n-1, which we know because dp is a vecror of
!       the same constant value: dp = [1,dpc,dpc,...,dpc]. The correct sum
!       is used here to show how accurate the Kahan summation is.
!
!       In general, if dp(i) is different for each i, then we must
!       add up 1 and dp(i) over i=2,n as below; this is where the round-off
!       error can cause a loss of precision and Kahan's technique helps
!       keep the precision.

!-----------------------------------------------------------------
! (1)Simple summation loop

  call evaltime( 0, dt1 ) 

  a_sum = dp(1)
  do i = 2, n
   a_sum = a_sum + dp(i)
  end do

  call evaltime( 1, dt1 ) 

!-----------------------------------------------------------------
! (2)Summation by Kahan's technique

  call evaltime( 0, dt2 ) 

  k_sum = ksum(dp)

  call evaltime( 1, dt2 ) 

!-----------------------------------------------------------------
! (3) Intrinsic sum

  call evaltime( 0, dt3 ) 

  i_sum = SUM(dp)

  call evaltime( 1, dt3 ) 

!-----------------------------------------------------------------
! (4) Summation by Higham's technique

  call evaltime( 0, dt4 ) 

  h_sum = hsum(dp)

  call evaltime( 1, dt4 ) 

!-----------------------------------------------------------------
! (5) what's else ? 


!-----------------------------------------------------------------
! Compare the results with the correct sum.
!
  write(*,*)
  write(*,*) "      loop      = ", a_sum, dt1, '(sec)'
  write(*,*) "      Kahan     = ", k_sum, dt2, '(sec)' 
  write(*,*) "      Intrinsic = ", i_sum, dt3, '(sec)'
  write(*,*) "      Higham    = ", h_sum, dt4, '(sec)' 
!  
  write(*,*) "    Correct sum = ", 1.0_wp + dpc*real(n-1,wp)
  write(*,*)

Contains
!
! Kahan summation
!
  pure function ksum(a) Result(ks)

    real(wp), intent(in) :: a(:)
    real(wp)             :: ks

    integer  :: i, n
    real(wp) :: temp, c, t

    n = SIZE(a)
    ks = a(1)
    c  = 0.0_wp

    do i=2,n
      temp = a(i) - c
      t    = ks + temp
      c    = (t-ks) -temp
      ks   = t
    end do

  end function ksum
!
! Higham summation
!
  pure function hsum(a) Result(s)

    real(wp), intent(in) :: a(:)
    real(wp)             :: s
    real(wp) :: e, t, y, r 
    integer  :: i, n
!
    n = SIZE(a)
    s = 0
    e = 0               
    do i = 1,n         
       t = s          
       y = a(i) + e  
       s = t + y    
       r = t - s   
       e = r + y  
    end do

  end function hsum
!  
!
      subroutine  evaltime ( istat, dt )
      implicit none 
      integer     istat 
      real(8)     dt 
      integer(8)  clock_rate, clock_max, t  
!
      if ( istat .eq. 0 ) then 
         call system_clock ( t, clock_rate, clock_max )
         dt = dble(t) 
      else 
         call system_clock ( t, clock_rate, clock_max )
         dt =  ( dble(t) - dt )/ dble(clock_rate)
      endif 
!
      return 
      end subroutine 
!
!
end program kahan_sum_example_02

Here is the result:

   loop      =    1234568.8891491259        9.5671355000000000E-002 (sec)
   Kahan     =    1234568.8912345669       0.38310695900000002      (sec)
   Intrinsic =    1234568.8891491259        9.3410794000000005E-002 (sec)
   Higham    =    1234568.8912345669       0.35454762299999998      (sec)
 Correct sum =    1234568.8912345669

@RonShepard ,

30 years ago we had a much simpler approach as below, although perhaps 10**8 might have been a challenge.

   function e_sum(x)
      implicit none
      real :: e_sum
      real, intent(in) :: x(:)

      real*10 :: temp, xe
      integer :: i

      temp = 0
      do i = 1, size(x)
         xe = x(i)
         temp = temp + xe
      end do
      e_sum = temp
      return
   end function e_sum
1 Like

Is the 80-bit thing actually supported by hardware? My understanding was that modern CPUs didn’t actually have any hardware to handle >64 bits, so things like real128 were emulated by compilers, which is why it’s like 10x slower than 64 bit.

There was hardware support on intel x86 cpus in the 1980s. I’m not sure exactly when it began and when it ended. Also on Motorola cpus, there was the Apple SANE library, which also supported 80-bit floarting point, but I don’t know how much hardware support the chips provided. I’ve used several fortran compilers on a variety of hardware that supported REAL*10. Historically it was implemented with 10-byte storage, but modern implementations, such as in gfortran, use 16-byte storage conventions, which makes it less useful than the old conventions.

This general idea of accumulating extended precision values, in things like array sums and dot products, was popular in numerical analysis in the 1970s. It solved many problems related to numerical stability in differential equations, large condition numbers in linear algebra, and so on. The DPROD() fortran intrinsic was part of this too, it takes two real numbers and returns the full extended precision floating point product for use in operations like dot products. However, DPROD() was never generalized to take arguments of nondefault kinds. I’ve always thought that it should have been, perhaps by adding a KIND= optional argument.

Yes, I see it in codes to this day that originated during that period. Every call to the utils folder summation routine, dot product, or anything having to do with a matrix is doing all its most inner calculations at double precision while 99%+ of the rest of code is default real (presumably 4 bytes).

Introducing yet another way to do sums I tried using the intrinsic sum in higher precision in this program, where the first 3 methods are copied from previous contributors but p2sum was my own.

program sums
  use, intrinsic :: iso_fortran_env, only: p1=>real32, & ! use real32 or real64
       compiler_version, compiler_options
  implicit none
  integer, parameter :: n=10**8, p2 = selected_real_kind(precision(1._p1)+1)
  real :: start,finish
  real(p1) :: s, x(n)
  character(*), parameter :: cfmt = '(a/a)', sfmt(2) = &
       ['(a,es16.07,1x,z0,1x,g0.3,a)','(a,es23.15,1x,z0,1x,g0.3,a)']
  integer :: whichp = merge(1,2,kind(x) == kind(1.0)), whichx
  write(*,cfmt) 'Compiler_version:',compiler_version()
  write(*,cfmt) 'Compiler_options:',compiler_options()
  call random_number(x)
  x = x * 0.5_p1 + 0.5_p1 ! values between 0.5 and 1.0 have the same exponent.
  do whichx = 1,2
     if(whichx==2) x = 256 * (x - 0.75_p1) ! mixed signs and exponents
     write(*,*) merge('0.5 <= x <= 1.0','-64 <= x <= +64',whichx==1)
     call cpu_time(start)
     s = sum(x)
     call cpu_time(finish)
     write(*,sfmt(whichp)) 'intrinsic p1:', s, s, finish-start, ' sec'
     call cpu_time(start)
     s = e_sum(x)
     call cpu_time(finish)
     write(*,sfmt(whichp)) 'extended  p2:', s, s, finish-start, ' sec'
     call cpu_time(start)
     s = r_sum(x)
     call cpu_time(finish)
     write(*,sfmt(whichp)) 'recursive p1:', s, s, finish-start, ' sec'
     call cpu_time(start)
     s = p2sum(x)
     call cpu_time(finish)
     write(*,sfmt(whichp)) 'intrinsic p2:', s, s, finish-start, ' sec'
  end do
contains
  function e_sum(x) ! extended precision summation.
    implicit none
    real(p1) :: e_sum
    real(p1), intent(in) :: x(:)
    integer :: i
    real(p2) :: temp
    temp = 0.0_p2
    do i = 1, size(x)
       temp = temp + real( x(i), p2 )
    enddo
    e_sum = real(temp,p1)
    return
  end function e_sum
  function r_sum(x) ! recursive summation.
    implicit none
    real(p1) :: r_sum
    real(p1), intent(in) :: x(:)
    integer :: i, j, p
    real(p1) :: psum( bit_size(1) - leadz(size(x)) )  ! max recursion depth.
    p = 0
    do i = 1, size(x)
       p = p + 1
       psum(p) = x(i)
       do j = 1, trailz(i)
          psum(p-1) = psum(p-1) + psum(p)
          p = p - 1
       enddo
    enddo
    do while (p > 1)  ! cleanup loop. accumulate in reverse order.
       psum(p-1) = psum(p-1) + psum(p)
       p = p - 1
    enddo
    r_sum = psum(1)
    return
  end function r_sum
  function p2sum(x) ! intrinsic higher precision sum
    real(p1) p2sum
    real(p1),intent(in):: x(:)
    p2sum = real(sum(real(x,p2)),p1)
  end function p2sum
    
  end program sums

The output from gfortran was

Compiler_version:
GCC version 12.1.0
Compiler_options:
-mtune=generic -march=x86-64 -g -Og -Wall -Wuninitialized -Wconversion -Wsurprising -Wcharacter-truncation -Wpedantic -Wno-unused-dummy-argument -fmax-errors=10 -fcheck=all -ffpe-trap=invalid,zero,overflow -fbacktrace -fpre-include=/usr/include/finclude/math-vector-fortran.h
 0.5 <= x <= 1.0
intrinsic p1:   1.6777216E+07 4B800000 0.105 sec
extended  p2:   7.4999016E+07 4C8F0C9D 0.156 sec
recursive p1:   7.4999016E+07 4C8F0C9D 0.362 sec
intrinsic p2:   7.4999016E+07 4C8F0C9D 0.106 sec
 -64 <= x <= +64
intrinsic p1:  -2.5215947E+05 C8763FDE 0.104 sec
extended  p2:  -2.5201980E+05 C8761CF3 0.155 sec
recursive p1:  -2.5201975E+05 C8761CF0 0.362 sec
intrinsic p2:  -2.5201980E+05 C8761CF3 0.106 sec

Ifort gave

Compiler_version:
Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel(R) 64, Version 2021.9.0 Build 20230302_000000
Compiler_options:
-L/opt/intel//oneapi/compiler/latest/lib/pkgconfig/../..//linux/compiler/lib/intel64/ -L/lib/gcc/x86_64-linux-gnu/12 -lgcc -lgcc_s -liomp5 -assume protect_parens -check all -fmath-errno -ftrapuv -traceback -warn interface -standard-semantics -warn nostderrors -check noarg_temp_created -O0 -g -o sums_p1p2.f90oi
 0.5 <= x <= 1.0
intrinsic p1:   1.6777216E+07 4B800000 .229 sec
extended  p2:   7.4998536E+07 4C8F0C61 .321 sec
recursive p1:   7.4998536E+07 4C8F0C61 1.12 sec
intrinsic p2:   7.4998536E+07 4C8F0C61 .231 sec
 -64 <= x <= +64
intrinsic p1:  -3.7506603E+05 C8B72341 .228 sec
extended  p2:  -3.7500778E+05 C8B71BF9 .321 sec
recursive p1:  -3.7500778E+05 C8B71BF9 1.12 sec
intrinsic p2:  -3.7500778E+05 C8B71BF9 .232 sec

Why did both compilers give exactly 2.0**24 for the first result? I was surprised that gfortran was noticeably faster than ifort, but I was trying for good bughunting rather than high speed with both.

I suppose a problem of long threads like this one is it can be difficult to read every post in detail.

The following program demonstrates the problem:

! test_spacing.f90
program test
real :: a, b, c
a = (2.0)**24
b = a + 1
c = a + spacing(a)
print '(G0,1X,Z8)', a, a
print '(G0,1X,Z8)', b, b
print '(G0,1X,Z8)', c, c
end program
~/fortran$ gfortran test_spacing.f90 
~/fortran$ ./a.out
16777216.0 4B800000
16777216.0 4B800000
16777218.0 4B800001

In other words, at 2^24, the spacing of 32-bit floating point numbers can’t capture small differences anymore.

1 Like

I am surprised that this code was faster for gfortran than the loop version. With optimization, I would have expected both to compile to exactly the same instructions. Without optimization, I would have expected the loop to be faster due to the memory allocation and the extra memory references.

Regarding the 2^24 value, that is of course the upper bound for the summation when the values are between .5 and 1. That is, the summation cannot possibly ever be larger than that. As to why the upper bound is actually attained, I think that is because 10**8 values is large enough so that the sum reaches that value by chance. With smaller vector lengths, it might get close but not quite reach the upper bound, depending on the random numbers for each run.

The compiler is smart enough that it doesn’t create an array temporary. See my previous response here: Size of long array - #44 by ivanpribec

Update:

Under -O2 they compile to the same instructions, but under -Og as used by @Harper the compiler uses extra move instructions to traverse the array. You can inspect the instruction generated in Compiler Explorer: Compiler Explorer

And these quantities are allocated on the stack :slight_smile: … (unless the programmer uses allocate() for them). Anyway, we are talking about a handfull of scalar quantities for each recursion depth, so something of the order of 1 kB when dealing with a sum on 10**9 elements: this looks definitely reasonnable for any system as of today.

As written above, in practice there’s no stack issue for the pairwise summation IMO. However your assertion that one just has to unlimit the stack size and that’s it, is wrong. If setting the stack size to ā€œunlimitedā€ had no drawback then it would be the default. It is not, even on Linux, and there are some reasons for that. Where I’m working, we are requested to write codes that do not require unreasonnably large stack sizes.

This is true… and false… If the computations reach a point where the whole process is memory bound, then working with a real32 array is supposed to be 2x faster. The point is that the straightforward sum (with a loop) is not memory bound, however it can get memory bound if one forces the vectorization of the loop.

See the code below that performs the summation of a real32 array in a real32 accumulator, the summation of real32 array in a real64 accumulator, and summation of real64 array in a real64 accumulator. The loops are optionaly vectorized with an !$OMP SIMD REDUCTION(+:xx) directive. Are are the outputs without and with vectorization:

$ gfortran -O3 sums.f90 && ./a.out 
 sum32/32=   16777216.0     time=   1.0866290000000001     
 sum32/64=   536851968.     time=   1.1117680000000001     
 sum64/64=   536851968.     time=   1.1423500000000000     

$ gfortran -O3 -fopenmp sums.f90 && ./a.out 
 sum32/32=   268435456.     time=  0.33729900000000002     
 sum32/64=   536897600.     time=  0.51473999999999998     
 sum64/64=   536897600.     time=  0.67588800000000004     

Observations:

  • without vectorization the timings are similar.
  • Operations on real64 are only slightly slower than on real32,
  • converting real32 to real64 has a visible but very limited impact
  • with vectorization the full real32 loop is 2x faster than the full real64 loop, showing that the code is now memory bound
  • the impact of the real32 ro real64 conversion is much more visible. Still, this is significantly faster than using a real64 array.
program sums
use iso_fortran_env
implicit none

real(real32), allocatable :: a32(:)
real(real64), allocatable :: a64(:)
integer, parameter :: N=2**28, ITER=4
real(real32) :: s32
real(real64) :: s64
integer :: i, j
integer(int64) :: tic, toc, rate

allocate(a64(N))
call random_number(a64)
a32 = a64


call system_clock(tic,rate)
s32 = 0.0
do j = 1, ITER
   !$OMP SIMD REDUCTION(+:s32)
   do i = 1, N
      s32 = s32 + a32(i)
   end do
end do
call system_clock(toc,rate)
write(*,*) "sum32/32=", s32, "time=", real(toc-tic,kind=real64)/rate


call system_clock(tic,rate)
s64 = 0.0
do j = 1, ITER
   !$OMP SIMD REDUCTION(+:s64)
   do i = 1, N
      s64 = s64 + a32(i)
   end do
end do
call system_clock(toc,rate)
write(*,*) "sum32/64=", real(s64,kind=real32), "time=", real(toc-tic,kind=real64)/rate


call system_clock(tic,rate)
s64 = 0.0
do j = 1, ITER
   !$OMP SIMD REDUCTION(+:s64)
   do i = 1, N
      s64 = s64 + a64(i)
   end do
end do
call system_clock(toc,rate)
write(*,*) "sum64/64=", real(s64,kind=real32), "time=", real(toc-tic,kind=real64)/rate

end program

@PierU, I guess I have been fortunate because all the Linux workstations Ive ever used and all of the Government and university HPC systems I’ve used that all run Linux never limited the amount of stack space I could access. If I were you I would be asking why you are being restricted in stack size. In my experience, sys-admins rarely give you an answer that doesn’t sound like " a dog ate my homework" and when you spend the time to dig deeper you find there is no real reason for doing it.

As to the problem of sums, I’ve found that the FABsum algorithm in the SIAM paper by Blanchard, Higham and Mary referenced above is the fastest of the accurate algorithms using the Intel compiler. The following is a translation of the MATLAB code from the gitlab site referenced in the paper

Module fab_sum

  USE ISO_C_BINDING
  USE ISO_FORTRAN_ENV, ONLY: REAL32, REAL64

  Implicit NONE
  PRIVATE

  Interface rsum
    Module Procedure rsum32
    Module Procedure rsum64
  End Interface

  Interface FABsum
    Module Procedure FABsum32
    Module Procedure FABsum64
  End Interface

  integer, parameter :: DEFAULT_BLOCKSIZE=128

  Public :: FABsum, rsum

Contains

  pure function rsum32(a,chop) result(rsum)

    real(REAL32),           intent(in) :: a(:)
    logical,      optional, intent(in) :: chop
    real(REAL32)                       :: rsum

    integer      :: i, n
    logical      :: chopr64
    real(real64) :: s64

    chopr64 = .FALSE.
    If (present(chop)) chopr64 = chop

    n = SIZE(a)

    if (chopr64) then
      s64 = REAL(a(1),REAL64)
      Do i=2, n
        s64 = s64 + REAL(a(i), REAL64)
      End Do
      rsum = REAL(s64, REAL32)
    else
      rsum = a(1)
      do i=2,n
        rsum = rsum + a(i)
      end do
    end if

  end function rsum32

  pure function rsum64(a) result(rsum)

    real(REAL64), intent(in) :: a(:)
    real(REAL64)             :: rsum

    integer      :: i, n

    n = SIZE(a)
    rsum = a(1)
    do i=2,n
      rsum = rsum + a(i)
    end do

  end function rsum64

  pure function FABsum32(a, block_size) result(FABsum)

    real(REAL32),           intent(in) :: a(:)
    integer,      optional, intent(in) :: block_size
    real(REAL32)                       :: FABsum

    integer      :: i, n, is, ie, npart, b
    real(REAL32) :: s, e, temp, y

    n = SIZE(a)
    b = DEFAULT_BLOCKSIZE
    if (present(block_size)) b = block_size
   npart = n/b + 1

    s = 0.0_REAL32
    e = 0.0_REAL32
    do i=1,npart
      is   = (i-1)*b + 1
      ie   = MIN(i*b,n)
      temp = s
      y    = e + rsum(a(is:ie))
      s    = temp + y
      e    = y + (temp - s)
    end do

    FABsum = s

  end function FABsum32

  pure function FABsum64(a, block_size) result(FABsum)

    real(REAL64),           intent(in) :: a(:)
    integer,      optional, intent(in) :: block_size
    real(REAL64)                       :: FABsum

    integer      :: i, n, is, ie, npart, b
    real(REAL64) :: s, e, temp, y

    n = SIZE(a)
    b = DEFAULT_BLOCKSIZE
    if (present(block_size)) b = block_size

    npart = n/b + 1

    s = 0.0_REAL64
    e = 0.0_REAL64
    do i=1, npart
      is   = (i-1)*b + 1
      ie   = MIN(i*b,n)
      temp = s
      y    = e + rsum(a(is:ie))
      s    = temp + y
      e    = y + (temp - s)
    end do

    FABsum = s

  end function FABsum64

End Module fab_sums

Here are some timings for the 100M test case I posted above using the default block size of 128 used in the paper. I compiled with -O3 -xHost -mavx2

  Test of REAL32 summations

   FABsum block size =          128

  recursive loop                  =    1398584.      time = 1.560000330209732E-002
  recursive loop with 64 bit chop =    1234569.      time =
  3.184799849987030E-002
  intrinsic SUM                   =    524288.0      time =
  3.182400763034821E-002
  recursive pair sum              =    1234569.      time =
  3.374199569225311E-002
  FABsum                          =    1234569.      time =
  1.746800541877747E-002

  Correct sum   =    1234569.


  intrinsic sum speed over pair sum speed  =   0.943157242997774
  intrinsic sum speed over FABsum speed    =    1.82184553229750

Test of REAL64 summations
   FABsum block size =          128

  recursive loop                  =    1234568.89122361       time =
  2.625399827957153E-002
  intrinsic SUM                   =    1234568.89114850       time =
  3.363801538944244E-002
  recursive pair sum              =    1234568.89123457       time =
  3.789800405502319E-002
  FABsum                          =    1234568.89123457       time =
  2.886599302291870E-002

  Correct sum   =    1234568.89123457


  intrinsic sum speed over pair sum speed  =   0.887593323928200
  intrinsic sum speed over FABsum speed    =    1.16531641100082

I tried the same code with gfortran-11 and saw similar times. However, the gfortran code would not reproduce the ā€œexactā€ answer liike Intel does.
I’m still playing around with the code to try to figure out why. Maybe a later version of gfortran would give better result

I did. The answer I got was that when the stack size is limited, the space is reserved in virtual memory from the start of the process, and the heap cannot grab it. That means that as long as you don’t reach the limit, your process is safe.

In contrast, with an unlimited stack size nothing is reserved, and if your process runs concurrently with other processes that have huge memory requirements (including your) you never know at which point your process may be short of stack space.

Don’t know if it’s technically correct, but at least this the reason I got.

And I know that the problem is similar for the heap, and yes the heap size is never artificially limited. Or is it ? Yes it can be actually, and sometimes it is on shared computing systems.

1 Like

In pre f77 code, all data would have been allocated statically, nothing would have been on a stack (except for function argument and return addresses). In f77, I think compilers had begun to mix stack and static allocations of local data, but even in f77, everything could still be done statically. It was only with f90, with the introduction of automatic variables and recursion, that stack allocation was actually required in order to support the language features.

I’m curious if you compared times and memory requirements with the recursive algorithms with and without recursive function calls? The code I posted without recursive function calls has this declaration

That automatic array probably uses stack allocation with modern compilers, but the allocation is done once, not once per recursive call. Its maximum size is 31 (with 32-bit signed integers), so in f77 it would have just been allocated statically as real(psum(31)). What is the number of function calls? If the recursion is carried all the way down to single elements, then there are size(x)-1 function calls. Without the recursive calls, when the programmer manages everything, there is a single function call.

At least to my eye, the code with recursive function calls is simpler to write and easier to understand. When the programmer takes on the stack and recursion logic, it looks more complicated. But it is often faster when it is done that way simply because it avoids the recursive function calls. I don’t know how to measure how much memory is actually used with each recursive function invocation. It is more than a single real variable, it includes all the local entities within the recursive function. Like so many other things in programming, the implementation of the algorithm involves choices that the programmer must make. At least now, with modern fortran, the programmer has that choice – with f77 and earlier, he didn’t.

For things like summations, I think the general consensus is that the compensated algorithms all take 2-4 times the effort of the simple loop, and also the extended precision summations take a comparable effort. Thus, the simpler extended precision accumulation would probably be easier to write and maintain when it is available. If an extended precision KIND is not available for the real KIND of interest, then a compensated summation algorithm would be used. I’m not sure yet exactly how a pairwise summation algorithms fit into that picture, what kinds of vector values work well for them, what kinds don’t work well, and so on.

Indeed…

It’s not only the problem of the availability of an extended precision kind, it’s also about the performances. If one has to use an emulated quadruple precision accumulator in order to sum a double precision array, the performances can drop.

The pairwise summation could be used when this problem arises.

That is also when a compensated summation algorithm might be used. Do you think a pairwise recursive algorithm is both faster and more accurate?

That is why I said I’m not sure how a pairwise algorithm fits into the picture. What exactly is the domain where it is the best choice?

BTW, I did some timings comparing the pairwise algorithm with and without recursive function calls. It looks like the recursive function calls add about 50% overhead. That’s not so bad, I expected to see a 4x or larger cost factor.

If you implement pairwise well, it’s basically free.