Some Intrinsic SUMS

@Beliavsky, After playing around a little more here are a couple of things I found related to using Intel’s compilers. For the LANL kahan vector sum routines, compiling with icx instead of icc (with options -Ofast -xHost -mavx2) and linking to ifort generated Fortran objects on my Zen 3 processor gives the wrong answers (as in not exactly equal to the correct answers) compiling them with icc works. Using -Ofast instead of -O3 also helps with the speed. The interesting thing for fsum_v is for the test program I posted above, the times for REAL32 and REAL64 are about the same.

Here are some new timings.

REAL32

loop = 524288.0 time = 3.197800368070602E-002
Fortran serial Kahan = 1234569. time = 0.247595988214016
Fortran vector Kahan = 1234569. time = 8.467298746109009E-002
LANL vector Kahan = 1234569. time = 2.398902177810669E-002
intrinsic SUMM = 524288.0 time = 3.185501694679260E-002

Correct sum = 1234569.

REAL64

loop = 1234568.89114850 time = 3.342800587415695E-002
Fortran serial Kahan = 1234568.89123457 time =
0.247623980045319
Fortran vector Kahan = 1234568.89123457 time =
8.850398659706116E-002
LANL vector Kahan = 1234568.89123457 time =
4.786601662635803E-002
intrinsic SUMM = 1234568.89114850 time =
3.257399797439575E-002

Correct sum = 1234568.89123457

I reworked the LANL code to be truly 32 bit.

Here is my implementation:

#include <x86intrin.h>

static float sum[8] __attribute__ ((aligned (32)));

float kahan_intel_vector_32(float* restrict var, long ncells)
{
#ifndef __PGI
   float const zero = 0.0;
   __m256 local_sum = _mm256_broadcast_ss((float const*) &zero);
   __m256 local_correction = _mm256_broadcast_ss((float const*) &zero);
   __m256 var_v;

#ifdef __INTEL_COMPILER
   #pragma ivdep
#else
   #pragma simd
#endif
   #pragma vector aligned
   for (long i = 0; i < ncells; i+=8) {
       var_v = _mm256_load_ps(&var[i]);
       __m256 corrected_next_term = var_v + local_correction;
       __m256 new_sum = local_sum + local_correction;
       local_correction = corrected_next_term - (new_sum - local_sum);
       local_sum = new_sum;
   }
   __m256 sum_v;
   sum_v  = local_correction;
   sum_v += local_sum;
   _mm256_store_ps(sum, sum_v);

   struct esum_type{
      float sum;
      float correction;
   } local;
   local.sum = 0.0;
   local.correction = 0.0;

   for (long i = 0; i < 8; i++) {
      float corrected_next_term_s = sum[i] + local.correction;
      float new_sum_s             = local.sum + local.correction;
      local.correction   = corrected_next_term_s - (new_sum_s - local.sum);
      local.sum          = new_sum_s;
   }
   float final_sum = local.sum + local.correction;
#else
   float final_sum = 0.0;
#endif
   return(final_sum);
}

Here is my C-interop code

Module fast_sums

  USE ISO_C_BINDING
  USE ISO_FORTRAN_ENV, ONLY: INT32, INT64, REAL32, REAL64

  Implicit NONE
  PRIVATE

  Interface fsum
    Module Procedure fsum_32
    Module Procedure fsum_64
  End Interface

  Public fsum

Contains

  Function fsum_32(a,n) Result(s)

    Integer,      Intent(IN) :: n
    Real(REAL32), Intent(IN) :: a(n)
    Real(REAL32)             :: s

    Interface
      Function c_fsum_32(a, ncells) BIND(C,name="kahan_intel_vector_32")
        IMPORT :: C_FLOAT, C_LONG
        Real(C_FLOAT),   Intent(IN) :: a(*)
        Integer(C_LONG), VALUE      :: ncells
        Real(C_FLOAT)               :: c_fsum_32
      End Function c_fsum_32
    End Interface

    Integer(C_LONG) :: c_ncells


    If (n < 8) Then
      s = SUM(a(1:n))
    Else
      c_ncells = INT(n, C_LONG)
      s        = c_fsum_32(a, c_ncells)
    End If

  End Function fsum_32

 Function fsum_64(a,n) Result(s)

    Integer,      Intent(IN) :: n
    Real(REAL64), Intent(IN) :: a(n)
    Real(REAL64)             :: s

    Interface
      Function  c_fsum_64(a, ncells) BIND(C,name="kahan_intel_vector_64")
        IMPORT :: C_DOUBLE, C_LONG
        Real(C_DOUBLE),  Intent(IN) :: a(*)
        Integer(C_LONG), VALUE      :: ncells
        Real(C_DOUBLE)              :: c_fsum_64
     End Function c_fsum_64

    End Interface

    Integer(C_LONG) :: c_ncells


    If (n < 4) Then
      s = SUM(a(1:n))
    Else
      c_ncells = INT(n, C_LONG)
      s        = c_fsum_64(a, c_ncells)
    End If

  End Function fsum_64

End Module fast_sums

The 64 bit version of the LANL code is the same as the version on the LANL GlobalSums github site. My only mod was to change the function name to kahan_intel_vector_64. Also, CHUNK = 8 appeared to be the sweet spot for fsum_v

Thank you so much @rwmsu and @Beliavsky , this is an excellent sampling of methods/compilers/hardware. On my own system (AMD 5800X, gfortran 13.1.1) I get the following timings for real64 with -O3 -march=native:

                       intrinsic                ksum_v               pairsum              sum_quad
   value   50001774.111924558878 50001774.111902222037 50001774.111902222037 50001774.111902214587
   |error|        0.000022344291        0.000000007451        0.000000007451        0.000000000000
   time           0.064962000000        0.037743000000        0.041773000000        1.869729000000

                       intrinsic                ksum_v               pairsum              sum_quad
   value   49997773.206654027104 49997773.206675805151 49997773.206675797701 49997773.206675797701
   |error|        0.000021770597        0.000000007451        0.000000000000        0.000000000000
   time           0.065199000000        0.037598000000        0.041618000000        1.872213000000

                       intrinsic                ksum_v               pairsum              sum_quad
   value   50001552.732013523579 50001552.732035480440 50001552.732035472989 50001552.732035480440
   |error|        0.000021956861        0.000000000000        0.000000007451        0.000000000000
   time           0.065174000000        0.037981000000        0.041781000000        1.872009000000

                       intrinsic                ksum_v               pairsum              sum_quad
   value   49997898.828710362315 49997898.828720159829 49997898.828720152378 49997898.828720159829
   |error|        0.000009797513        0.000000000000        0.000000007451        0.000000000000
   time           0.065125000000        0.037641000000        0.041329000000        1.872546000000

                       intrinsic                ksum_v               pairsum              sum_quad
   value   49999586.593744568527 49999586.593743167818 49999586.593743175268 49999586.593743167818
   |error|        0.000001400709        0.000000000000        0.000000007451        0.000000000000
   time           0.065180000000        0.037680000000        0.041244000000        1.872239000000

                       intrinsic                ksum_v               pairsum              sum_quad
   value   49998894.620561607182 49998894.620571263134 49998894.620571255684 49998894.620571263134
   |error|        0.000009655952        0.000000000000        0.000000007451        0.000000000000
   time           0.065198000000        0.037606000000        0.041311000000        1.872259000000

                       intrinsic                ksum_v               pairsum              sum_quad
   value   49999039.065408855677 49999039.065418615937 49999039.065418623388 49999039.065418623388
   |error|        0.000009767711        0.000000007451        0.000000000000        0.000000000000
   time           0.065163000000        0.037580000000        0.041263000000        1.873081000000

                       intrinsic                ksum_v               pairsum              sum_quad
   value   49998743.145561598241 49998743.145547457039 49998743.145547457039 49998743.145547449589
   |error|        0.000014148653        0.000000007451        0.000000007451        0.000000000000
   time           0.065184000000        0.037784000000        0.041401000000        1.873880000000

                       intrinsic                ksum_v               pairsum              sum_quad
   value   50002901.606332279742 50002901.606322593987 50002901.606322601438 50002901.606322601438
   |error|        0.000009678304        0.000000007451        0.000000000000        0.000000000000
   time           0.065166000000        0.037583000000        0.041293000000        1.874839000000

                       intrinsic                ksum_v               pairsum              sum_quad
   value   49999507.260967962444 49999507.260967038572 49999507.260967038572 49999507.260967038572
   |error|        0.000000923872        0.000000000000        0.000000000000        0.000000000000
   time           0.065217000000        0.037613000000        0.041414000000        1.874796000000

Interestingly, benchmarks like these also highlight how effective inlining can be, see when using the flags -O3 -march=native -flto -fwhole-program:

                       intrinsic                ksum_v               pairsum              sum_quad
   value   50001755.763393588364 50001755.763394705951 50001755.763394713402 50001755.763394713402
   |error|        0.000001125038        0.000000007451        0.000000000000        0.000000000000
   time           0.063389000000        0.036643000000        0.023472000000        1.798793000000

                       intrinsic                ksum_v               pairsum              sum_quad
   value   50000111.560542389750 50000111.560532115400 50000111.560532107949 50000111.560532107949
   |error|        0.000010281801        0.000000007451        0.000000000000        0.000000000000
   time           0.063425000000        0.036531000000        0.023299000000        1.798704000000

                       intrinsic                ksum_v               pairsum              sum_quad
   value   49994522.241867199540 49994522.241878405213 49994522.241878405213 49994522.241878412664
   |error|        0.000011213124        0.000000007451        0.000000007451        0.000000000000
   time           0.063424000000        0.036522000000        0.023398000000        1.798426000000

                       intrinsic                ksum_v               pairsum              sum_quad
   value   50005795.038908720016 50005795.038928024471 50005795.038928024471 50005795.038928024471
   |error|        0.000019304454        0.000000000000        0.000000000000        0.000000000000
   time           0.063409000000        0.036545000000        0.023352000000        1.798384000000

                       intrinsic                ksum_v               pairsum              sum_quad
   value   50000008.428244762123 50000008.428249299526 50000008.428249299526 50000008.428249292076
   |error|        0.000004529953        0.000000007451        0.000000007451        0.000000000000
   time           0.063460000000        0.036529000000        0.023399000000        1.798362000000

                       intrinsic                ksum_v               pairsum              sum_quad
   value   49999029.123580917716 49999029.123581461608 49999029.123581446707 49999029.123581454158
   |error|        0.000000536442        0.000000007451        0.000000007451        0.000000000000
   time           0.063436000000        0.036538000000        0.023311000000        1.798300000000

                       intrinsic                ksum_v               pairsum              sum_quad
   value   49995668.560971729457 49995668.560941107571 49995668.560941100121 49995668.560941100121
   |error|        0.000030629337        0.000000007451        0.000000000000        0.000000000000
   time           0.063413000000        0.036512000000        0.023364000000        1.799789000000

                       intrinsic                ksum_v               pairsum              sum_quad
   value   49997107.842044033110 49997107.842042587698 49997107.842042587698 49997107.842042587698
   |error|        0.000001445413        0.000000000000        0.000000000000        0.000000000000
   time           0.063442000000        0.036557000000        0.023319000000        1.798573000000

                       intrinsic                ksum_v               pairsum              sum_quad
   value   49999342.662176750600 49999342.662189945579 49999342.662189945579 49999342.662189945579
   |error|        0.000013194978        0.000000000000        0.000000000000        0.000000000000
   time           0.063418000000        0.036501000000        0.023387000000        1.799013000000

                       intrinsic                ksum_v               pairsum              sum_quad
   value   49995028.967114172876 49995028.967113137245 49995028.967113137245 49995028.967113137245
   |error|        0.000001035631        0.000000000000        0.000000000000        0.000000000000
   time           0.063425000000        0.036522000000        0.023376000000        1.798460000000

In this case it only seems to have really affected the pairsum routine, but for that function the impact is nearly 50%. See code below:

        pure recursive function pairsum(a) result(bout)
            real(wp), intent(in) :: a(:)
            real(wp) :: bout
            integer :: n
            n = size(a)
            if (n.le.32) then
                bout = sum(a)
            else
                bout = pairsum(a(1:n/2)) + pairsum(a(n/2+1:n))
            end if
        end function pairsum 

Thanks. I am also finding pairsum to be fast and accurate with the compiler options you used.

revised code
module ksum_mod
implicit none
public :: wp, sum_quad, ksum_v, pairsum
integer, parameter :: wp=kind(1.0d0), CHUNK=8, qp = selected_real_kind(32)
contains
pure function sum_quad(a) result(qsum)
real(WP), intent(in) :: a(:)
real(WP)             :: qsum
qsum = sum(real(a, kind=qp))
end function sum_quad

 pure function ksum_v(a) result(ksum)

! Fortran vector version of kahan_sum
! CHUNK is the desired vector length
! WP is the working precision (set to either REAL32 or REAL64)

    real(WP), intent(in) :: a(:)
    real(WP)             :: ksum

    integer :: i, n, jend, kend

    real (WP) :: ks(CHUNK), temp(CHUNK), t(CHUNK), c(CHUNK)

    n    = SIZE(a)
    ks   = 0.0_WP
    temp = 0.0_WP
    t    = 0.0_WP
    c    = 0.0_WP

   do i=1,n,CHUNK
      jend         = MERGE(i+CHUNK-1,n, (n-i+1)>=CHUNK)
      kend         = MERGE((n-i)+1, CHUNK, (n-i+1)<CHUNK)
      temp(1:kend) = a(i:jend)  - c(1:kend)
      t(1:kend)    = ks(1:kend) + temp(1:kend)
      c(1:kend)    = (t(1:kend) - ks(1:kend)) - temp(1:kend)
      ks(1:kend)   = t(1:kend)
    end do

    ksum = SUM(ks(1:CHUNK))

  end function ksum_v

pure recursive function pairsum(a) result(bout)
    real(wp), intent(in) :: a(:)
    real(wp) :: bout
    integer :: n
    n = size(a)
    if (n.le.32) then
       bout = sum(a)
    else
       bout = pairsum(a(1:n/2)) + pairsum(a(n/2+1:n))
    end if
end function pairsum 

end module ksum_mod

program main
use ksum_mod, only: wp, sum_quad, ksum_v, pairsum
implicit none
integer, parameter :: n = 10**8, ncalc = 4, niter = 10
integer :: iter
real(kind=wp) :: x(n), xsum(ncalc), times(0:ncalc)
character (len=*), parameter :: fmt_cr = "(a10,*(f22.12))"
call random_seed()
do iter=1,niter
   call random_number(x)
   call cpu_time(times(0))
   xsum(1) = sum(x)      ; call cpu_time(times(1))
   xsum(2) = ksum_v(x)   ; call cpu_time(times(2))
   xsum(3) = pairsum(x)  ; call cpu_time(times(3))
   xsum(4) = sum_quad(x) ; call cpu_time(times(4))
   print "(/,a10,*(a22))", "", "intrinsic", "ksum_v", "pairsum", "sum_quad"
   print fmt_cr,"value  ",xsum
   print fmt_cr,"|error|",abs(xsum - xsum(ncalc))
   print fmt_cr,"time   ",times(1:) - times(0:ncalc-1)
end do
end program main

One may even slightly expand the cutoff region and combine with OpenMP for SIMD reduction speedup while maintaining accuracy as so:

        recursive function simdsum(a) result(bout)
            real(wp), intent(in) :: a(:)
            real(wp) :: bout
            integer :: i, n
            n = size(a)
            if (n.le.256) then
                bout = 0.0
                !$omp simd reduction (+:bout)
                do i=1,n
                    bout = bout + a(i)
                end do
                !$omp end simd
            else
                bout = simdsum(a(1:n/2)) + simdsum(a(n/2+1:n))
            end if
        end function simdsum

But, I think the real treat with the pairwise summation algorithms over other methods for increased accuracy is that it still works with -Ofast or more specifically -ffast-math, which may break other algorithms seeking increased accuracy.

The sum intrinsic has an optional mask argument. Sum(x, mask=mask) is the same as
sum(pack(x, mask)), but I find the former to be considerably faster (which I guess is why sum and other array intrinsics have an optional mask argument). Likewise, a version of pairsum that takes a mask argument is faster than pairsum(pack(x, mask)), but pairsum with a mask is no longer faster than the intrinsic sum.

code
module ksum_mod
implicit none
public :: wp, sum_quad, sum_quad_mask, ksum_v, pairsum, pairsum_explicit_shape, &
          pairsum_mask
integer, parameter :: wp=kind(1.0d0), CHUNK=8, qp = selected_real_kind(32)
contains
pure function sum_quad(a) result(qsum)
real(WP), intent(in) :: a(:)
real(WP)             :: qsum
qsum = sum(real(a, kind=qp))
end function sum_quad
!
pure function sum_quad_mask(a,mask) result(qsum)
real(WP), intent(in) :: a(:)
logical , intent(in) :: mask
real(WP)             :: qsum
qsum = sum(real(a, kind=qp), mask=mask)
end function sum_quad_mask
!
 pure function ksum_v(a) result(ksum)

! Fortran vector version of kahan_sum
! CHUNK is the desired vector length
! WP is the working precision (set to either REAL32 or REAL64)

    real(WP), intent(in) :: a(:)
    real(WP)             :: ksum

    integer :: i, n, jend, kend

    real (WP) :: ks(CHUNK), temp(CHUNK), t(CHUNK), c(CHUNK)

    n    = SIZE(a)
    ks   = 0.0_WP
    temp = 0.0_WP
    t    = 0.0_WP
    c    = 0.0_WP

   do i=1,n,CHUNK
      jend         = MERGE(i+CHUNK-1,n, (n-i+1)>=CHUNK)
      kend         = MERGE((n-i)+1, CHUNK, (n-i+1)<CHUNK)
      temp(1:kend) = a(i:jend)  - c(1:kend)
      t(1:kend)    = ks(1:kend) + temp(1:kend)
      c(1:kend)    = (t(1:kend) - ks(1:kend)) - temp(1:kend)
      ks(1:kend)   = t(1:kend)
    end do

    ksum = SUM(ks(1:CHUNK))

  end function ksum_v
!

pure recursive function pairsum(a) result(bout)
    real(wp), intent(in) :: a(:)
    real(wp) :: bout
    integer :: n
    n = size(a)
    if (n.le.32) then
       bout = sum(a)
    else
       bout = pairsum(a(1:n/2)) + pairsum(a(n/2+1:n))
    end if
end function pairsum 

pure recursive function pairsum_mask(a, mask) result(bout)
    real(wp), intent(in) :: a(:)
    logical , intent(in) :: mask(:)
    real(wp) :: bout
    integer :: n
    n = size(a)
    if (n.le.32) then
       bout = sum(a, mask)
    else
       bout = pairsum_mask(a(1:n/2), mask(1:n/2)) + &
              pairsum_mask(a(n/2+1:n), mask(n/2+1:n))
    end if
end function pairsum_mask

pure recursive function pairsum_explicit_shape(n, a) result(bout)
    integer, intent(in) :: n
    real(wp), intent(in) :: a(n)
    real(wp) :: bout
    integer  :: nhalf
    if (n.le.32) then
       bout = sum(a)
    else
       nhalf = n/2
       bout = pairsum_explicit_shape(nhalf  , a(1:nhalf)) + &
              pairsum_explicit_shape(n-nhalf, a(nhalf+1:n))
    end if
end function pairsum_explicit_shape

end module ksum_mod

program main
use ksum_mod, only: wp, sum_quad, ksum_v, pairsum, pairsum_mask
implicit none
integer, parameter :: n = 5*10**7, ncalc = 6, niter = 10
integer :: iter
real(kind=wp) :: x(n), y(n), xsum(ncalc), times(0:ncalc)
logical :: xmask(n)
character (len=*), parameter :: fmt_cr = "(a10,*(f22.12))"
call random_seed()
do iter=1,niter
   call random_number(x)
   call random_number(y)
   xmask = y > 0.5_wp
   call cpu_time(times(0))
   xsum(1) = sum(x, mask=xmask)       ; call cpu_time(times(1))
   xsum(2) = sum(pack(x,xmask))       ; call cpu_time(times(2))
   xsum(3) = ksum_v(pack(x, xmask))   ; call cpu_time(times(3))
   xsum(4) = pairsum_mask(x, xmask)   ; call cpu_time(times(4))
   xsum(5) = pairsum(pack(x, xmask))  ; call cpu_time(times(5))
   xsum(6) = sum_quad(pack(x, xmask)) ; call cpu_time(times(6))
   print "(/,a10,*(a22))", "", "sum_mask", "sum_pack", "ksum_v", "pairsum_mask", &
                               "pairsum_pack", "sum_quad"
   print fmt_cr,"value  ",xsum
   print fmt_cr,"|error|",abs(xsum - xsum(ncalc))
   print fmt_cr,"time   ",times(1:) - times(0:ncalc-1)
end do
end program main

To apply pairsum in production codes, I guess one could write a script that looks for cases where sum is applied to a 1-D array of reals and replace those calls with pairsum?

I am not sure it would be possible to completely replace the intrinsic sum with this type of method, for all intrinsic types, maintain the dim and mask functionality, and remain faster + more accurate. I do intend to try, and see what issue I run into, but that is a slightly longer project for another day. Perhaps this weekend I will update this post with some progress.

There are 2 ways to bring the pairsum to have similar options as the intrinsic sum to make it safer to replace the intrinsic wherever you deem safe to do so. Using an interface (psum) or using optionals (psumop):

pair sum
module psum_mod
use iso_fortran_env
implicit none
public :: wp, psum, psumop
integer, parameter :: wp=real64

interface psum
    module procedure pairsum
    module procedure pairsum_mask
end interface

contains

pure recursive function pairsum(a) result(bout)
    real(wp), intent(in) :: a(:)
    real(wp) :: bout
    integer :: n
    n = size(a)
    if (n.le.32) then
       bout = sum(a)
    else
       bout = pairsum(a(1:n/2)) + pairsum(a(n/2+1:n))
    end if
end function pairsum 

pure recursive function pairsum_mask(a,mask) result(bout)
    real(wp), intent(in) :: a(:)
    logical, intent(in) :: mask(:)
    real(wp) :: bout
    integer :: n
    n = size(a)
    if (n.le.32) then
       bout = sum(a,mask)
    else
       bout = pairsum_mask(a(1:n/2),mask(1:n/2)) + pairsum_mask(a(n/2+1:n),mask(n/2+1:n))
    end if
end function pairsum_mask 

pure recursive function psumop(a,mask) result(bout)
    real(wp), intent(in) :: a(:)
    logical, intent(in), optional:: mask(:)
    real(wp) :: bout
    integer :: n
    n = size(a)
    if(.not.present(mask)) then
    if (n.le.32) then
       bout = sum(a)
    else
       bout = psumop(a(1:n/2)) + psumop(a(n/2+1:n))
    end if
    else
    if (n.le.32) then
       bout = sum(a,mask)
    else
       bout = psumop(a(1:n/2),mask(1:n/2)) + psumop(a(n/2+1:n),mask(n/2+1:n))
    end if
    end if
end function psumop 

end module psum_mod

program main
use psum_mod, only: wp, psum, psumop
implicit none
integer, parameter :: n = 1e7, ncalc = 3, niter = 10
integer :: iter
real(kind=wp) :: x(n),y(n), xsum(ncalc), times(0:ncalc)
logical :: xmask(n)
character (len=*), parameter :: fmt_cr = "(a10,*(f22.12))"
call random_seed()
do iter=1,niter
   call random_number(x)
   call random_number(y)
   xmask = y > 0.5_wp
   call cpu_time(times(0))
   xsum(1) = sum(x, mask=xmask)      ; call cpu_time(times(1))
   xsum(2) = psum(x, mask=xmask)     ; call cpu_time(times(2))
   xsum(3) = psumop(x, mask=xmask) ; call cpu_time(times(3))
   print "(/,a10,*(a22))", "", "intrinsic", "pairsum", "pairsumop"
   print fmt_cr,"value  ",xsum
   print fmt_cr,"|error|",abs(xsum - xsum(ncalc))
   print fmt_cr,"time   ",times(1:) - times(0:ncalc-1)
end do
end program main
gfortran 13.1 with -O3
                      intrinsic               pairsum             pairsumop
   value    2499431.050868517254  2499431.050868323538  2499431.050868323538
   |error|        0.000000193715        0.000000000000        0.000000000000
   time           0.061566000000        0.066122000000        0.077667000000

                       intrinsic               pairsum             pairsumop
   value    2498096.828970524017  2498096.828970700502  2498096.828970700502
   |error|        0.000000176486        0.000000000000        0.000000000000
   time           0.061986000000        0.067264000000        0.078617000000

                       intrinsic               pairsum             pairsumop
   value    2499812.911002283450  2499812.911002350040  2499812.911002350040
   |error|        0.000000066590        0.000000000000        0.000000000000
   time           0.060234000000        0.065094000000        0.077779000000

                       intrinsic               pairsum             pairsumop
   value    2499574.012997150421  2499574.012996914797  2499574.012996914797
   |error|        0.000000235625        0.000000000000        0.000000000000
   time           0.060748000000        0.066549000000        0.078401000000

                       intrinsic               pairsum             pairsumop
   value    2499346.838352363091  2499346.838352249470  2499346.838352249470
   |error|        0.000000113621        0.000000000000        0.000000000000
   time           0.062522000000        0.064719000000        0.074323000000

                       intrinsic               pairsum             pairsumop
   value    2498131.836015343666  2498131.836015568115  2498131.836015568115
   |error|        0.000000224449        0.000000000000        0.000000000000
   time           0.061447000000        0.067101000000        0.077593000000

                       intrinsic               pairsum             pairsumop
   value    2500425.956582235638  2500425.956582487561  2500425.956582487561
   |error|        0.000000251923        0.000000000000        0.000000000000
   time           0.058957000000        0.062763000000        0.082188000000

                       intrinsic               pairsum             pairsumop
   value    2501297.238809194881  2501297.238809351809  2501297.238809351809
   |error|        0.000000156928        0.000000000000        0.000000000000
   time           0.063083000000        0.067556000000        0.080685000000

                       intrinsic               pairsum             pairsumop
   value    2499745.321712223813  2499745.321712172125  2499745.321712172125
   |error|        0.000000051688        0.000000000000        0.000000000000
   time           0.062663000000        0.067312000000        0.079224000000

                       intrinsic               pairsum             pairsumop
   value    2498784.273390782531  2498784.273390878458  2498784.273390878458
   |error|        0.000000095926        0.000000000000        0.000000000000
   time           0.057145000000        0.058294000000        0.065160000000
ifx 2022.2.1 with -O3
                     intrinsic               pairsum             pairsumop
   value    2499986.848785410635  2499986.848785587586  2499986.848785587586
   |error|        0.000000176951        0.000000000000        0.000000000000
   time           0.036627000000        0.042614000000        0.042874000000

                       intrinsic               pairsum             pairsumop
   value    2499234.623811629601  2499234.623811814003  2499234.623811814003
   |error|        0.000000184402        0.000000000000        0.000000000000
   time           0.038537000000        0.049261000000        0.052331000000

                       intrinsic               pairsum             pairsumop
   value    2498310.842287337873  2498310.842287327163  2498310.842287327163
   |error|        0.000000010710        0.000000000000        0.000000000000
   time           0.036518000000        0.042467000000        0.042731000000

                       intrinsic               pairsum             pairsumop
   value    2501574.442508993670  2501574.442508786917  2501574.442508786917
   |error|        0.000000206754        0.000000000000        0.000000000000
   time           0.036898000000        0.047126000000        0.052493000000

                       intrinsic               pairsum             pairsumop
   value    2499425.782121376134  2499425.782121527474  2499425.782121527474
   |error|        0.000000151340        0.000000000000        0.000000000000
   time           0.043838000000        0.052087000000        0.053888000000

                       intrinsic               pairsum             pairsumop
   value    2499159.644677983597  2499159.644677705597  2499159.644677705597
   |error|        0.000000278000        0.000000000000        0.000000000000
   time           0.044212000000        0.053041000000        0.053731000000

                       intrinsic               pairsum             pairsumop
   value    2499450.881585741416  2499450.881585448515  2499450.881585448515
   |error|        0.000000292901        0.000000000000        0.000000000000
   time           0.036601000000        0.042545000000        0.042770000000

                       intrinsic               pairsum             pairsumop
   value    2499690.190008586273  2499690.190008680336  2499690.190008680336
   |error|        0.000000094064        0.000000000000        0.000000000000
   time           0.036510000000        0.042385000000        0.042710000000

                       intrinsic               pairsum             pairsumop
   value    2500430.581034909468  2500430.581034730189  2500430.581034730189
   |error|        0.000000179280        0.000000000000        0.000000000000
   time           0.039096000000        0.047734000000        0.052476000000

                       intrinsic               pairsum             pairsumop
   value    2499783.791602896992  2499783.791603248566  2499783.791603248566
   |error|        0.000000351574        0.000000000000        0.000000000000
   time           0.036501000000        0.042540000000        0.042733000000

From the results the best option would be using an interface such that each function can be fully optimized and selection is done at compile time, instead of runtime as with the optional arguments. Yet, the performance for both implentations is quite close.

1 Like

I think it is pairsum_mask(a(n/2+1:n),mask(n/2+1:n))

Thanks! fixed the typo :slight_smile:

A final thought on SUM. As demonstrated in this thread, it is possible to write a highly accurate global sum for very large arrays that performs at about the speed of the current intrinsic SUM. Why then is something like pairsum not the standard SUM algorithm in current compilers. This seems like a no-brainer to me. Am I missing something here?. Judging by the identical times and results for an intrinsic SUM and a simple loop, it looks like SUM is just a scalar loop.

1 Like

The splitting of the vector into parts works for this test case, because the only problem is that the exponent of the accumulated value eventually becomes too large, and splitting the long vector into pairs keeps the exponents of the two partial sums closer together.

The programmer is going to effort to break up large arrays into separate pairs in the attempt to keep the exponents of the two pairs about the same (and thereby to eliminate roundoff for long vectors of roughly equal elements). But then by using the intrinsic sum() which is just a simple loop over the elements in sequential order, there is no attempt at all to compute those terms accurately (e.g. with Kahan summation, or extended precision accumulation, or summation into bins, etc.). That effort would be necessary for more general vectors, including those with elements with mixed signs and varying magnitudes.

I would think that a general summation algorithm (i.e. a replacement for the intrinsic sum()) might try to do both things. Of course, that shifts the balance between accuracy and performance, so there will always be someone who is unhappy with the algorithm choice…

Another comment is that some timings presented in this discussion suggest that the real32 to real64 conversions and the real64 accumulation are relatively expensive, several times longer than the simple real32 accumulation, while other timings suggest that the two summations can be done at the same speed. Did I miss some detail of these two implementations, or is it just a question of which optimizations are enabled and used by the compiler?

Real32 vs real64 speed depends on array size. For small arrays, real32 will be roughly 2x faster, but for larger sizes you will be memory limited and the speed will be the same.

According to this: Julia equivalent of Python's "fsum" for floating point summation - #4 by stevengj - General Usage - Julia Programming Language, Julia’s sum() uses pairwise summation.

1 Like

@certik, any chance of eventually making pairwise sum or similar algoritm a default for SUM in LFortran. I think this topic would make a good future GSOC project for some motivated young person. I can also see this work leading to a better DOT_PRODUCT. At least its something that could (should ?) go into stdlib. Also, a quick comment about performance vs accuracy. Some of the discussions about performance seem to neglect the major advances in processor speeds over the last couple of decades. Speed differences between algorithms that were measured in minutes a few decades ago are now down to tenths or hundredths of a second. Except in a very exascale calculation, where even a .01 difference in time could amount to a couple hours of run time, the speed of modern processors make some of the arguments against using slower but more accurate algorithms moot. There may be some folks that can discern between a 0.1 delta in elapsed time but I’m not one of them. I guess the point I’m trying to make is the sheer processing speed of both modern commodity processors and HPC systems should allow compiler writers to favor more accurate algorithms particularly if they can run at about the same speed as the “fast” inaccurate ones.

Why does that happen? It would seem like, especially for large arrays, each clock cycle will always pull in twice as many real32 words as real64 words.

1 Like

This is always a choice in designing fortran intrinsic functions. Should the intrinsic always give the best performance, and then require the programmer to write code for accuracy if necessary, or should the intrinsic always give accurate results, and then require the programmer to write fast code if necessary?

Algorithms like in pairsum() may also be implemented without recursive calls. Before recursion was added to the language, that was the only way to implement things like that. The programmer maintains the intermediate quantities that would otherwise be on the call stack. Now, the language supports recursion and makes it a little easier, but sometimes at a performance cost.

As for whether one needs to worry about stack limits, I would point out that in many cases it is the exact same compiler and operating system that is being used on a laptop or desktop machine as is being used on a high end cluster, so if you must worry about it for one you also must worry about it for the other. Further, many sites buy hardware and then run it until it fails and can no longer be repaired. So even if new hardware is available, the programmer must accommodate also 10- to 20-year old machines at the same site (sometimes even in the same cluster).

@kargl,

Yes a standard does not specify an implementation. It defines an interface between the user and the language. All the things you describe (dim, mask) etc. are just part of the interface that tells the compiler what to sum. My issue is whats happening with the implementation (how the final sum is calculated). While being able to do multple ranks is sometimes nice, the most common use case is a rank 1 array. The issues with dim etc. are red herrings. If they can be implemented by a competent programmer to do what looks like just a serial loop, they can also be implemented for alternative ways of doing the sum (at least by a competent programmer). As to the stack issue, that’s an OS problem. If you have issues with stack then start using a real OS. I run Linux and set the stack size to unlimited in bashrc so I don’t see that problem. My example sums over about 100 Million elements and I have no problems with pairsum. I consider that an extreme example because most problems that use arrays of that size are on HPC systems not desktops and in that case the codes are using MPI with the global problem size is cut down by the number of processors you are using (today in the 100K range for some systems) so your 100 Mil arrays are cut down to something on the order of 1 Mil or less. Again my position is that the problems you state are not encountered in the real world (at least by people that know what they are doing)

@rwmsu, @kargl I think the solution to this is to offer both sum implementations (or even more if needed) and let the user select. I know for sure that for trigonometric functions we need at least two implementations: one very fast, but slightly less accurate and one slower but very accurate. So we can do the same with sum(). The only question remaining is how the user should select it, one obvious option is via command line arguments, but there might also be a need to have more fine grained approach, perhaps with some pragmas.

My point was that with larger real32 arrays you can sum into a real64 result by pulling in real32s, converting them in registers to real64 and accumulating the sum.

In case the Higham, Blanchard, Mary paper was not mentioned:

A Class of Fast and Accurate Summation Algorithms

https://epubs.siam.org/doi/10.1137/19M1257780

The Fortran 2018 standard requires (my emphasis) “The result of SUM (ARRAY) has a value equal to a processor-dependent approximation to the sum of all the elements of ARRAY or has the value zero if ARRAY has size zero.”

3 Likes