Some Intrinsic SUMS

Lately I have been reading many discussions here about fortran intrinsic procedures, their performance, correctness, etc. I have known for a while about some odd behavior across gfortran, ifort, and ifx. For instance, consider the following program:

program main
use, intrinsic :: iso_fortran_env, sp=>real32, dp=>real64
implicit none

    real(sp), allocatable :: xsp(:)
    real(sp) :: xsp_sum
    real(dp), allocatable :: xdp(:)
    real(dp) :: xdp_sum, xdp_xsp_difference
    integer :: i, j, n

    write(*,*) 'COMPILER: ',compiler_version()
    write(*,*) 'OPTIONS: ',compiler_options()
    write(*,*) ''

    do i=0,7
        do j=1,9
            n = j*(10**i)
            if (allocated(xsp)) deallocate(xsp)
            if (allocated(xdp)) deallocate(xdp)
            allocate(xsp(n), xdp(n))
            xsp = 1.0
            xsp_sum = sum(xsp)
            xdp = 1.0
            xdp_sum = sum(xdp)
            xdp_xsp_difference = xdp_sum - real(xsp_sum, dp)
            if (xdp_xsp_difference.gt.0.001) then
                write(*,'(a,i8,a,e13.6,2(a,e22.15))') 'n: ',n, &
                                                      ', xsp_sum: ',xsp_sum, &
                                                      ', xdp_sum: ',xdp_sum, &
                                                      ', xdp_xsp_difference: ',xdp_xsp_difference
            end if
        end do
    end do

end program main

Depending on the compiler and optimization options you use, different results can be obtained. To summarize, this is what I got using “debugging,” “normal optimization,” and “aggressive optimization” sets of flags.
Debugging:

fpm: Entering directory '/home/tyranids/projects/some-sums'
Project is up to date
 COMPILER: GCC version 11.3.0
 OPTIONS: -I build/gfortran_020AF3609BFD7C6D -mtune=generic -march=x86-64 -g -Wall -Wextra -Werror=implicit-interface -fPIC -fmax-errors=1 -fcheck=bounds -fcheck=array-temps -fbacktrace -fcoarray=single -fimplicit-none -ffree-form -J build/gfortran_020AF3609BFD7C6D -fpre-include=/usr/include/finclude/math-vector-fortran.h
 
n: 20000000, xsp_sum:  0.167772E+08, xdp_sum:  0.200000000000000E+08, xdp_xsp_difference:  0.322278400000000E+07
n: 30000000, xsp_sum:  0.167772E+08, xdp_sum:  0.300000000000000E+08, xdp_xsp_difference:  0.132227840000000E+08
n: 40000000, xsp_sum:  0.167772E+08, xdp_sum:  0.400000000000000E+08, xdp_xsp_difference:  0.232227840000000E+08
n: 50000000, xsp_sum:  0.167772E+08, xdp_sum:  0.500000000000000E+08, xdp_xsp_difference:  0.332227840000000E+08
n: 60000000, xsp_sum:  0.167772E+08, xdp_sum:  0.600000000000000E+08, xdp_xsp_difference:  0.432227840000000E+08
n: 70000000, xsp_sum:  0.167772E+08, xdp_sum:  0.700000000000000E+08, xdp_xsp_difference:  0.532227840000000E+08
n: 80000000, xsp_sum:  0.167772E+08, xdp_sum:  0.800000000000000E+08, xdp_xsp_difference:  0.632227840000000E+08
n: 90000000, xsp_sum:  0.167772E+08, xdp_sum:  0.900000000000000E+08, xdp_xsp_difference:  0.732227840000000E+08
fpm: Leaving directory '/home/tyranids/projects/some-sums'
fpm: Entering directory '/home/tyranids/projects/some-sums'
Project is up to date
 COMPILER: 
 Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel
 (R) 64, Version 2021.9.0 Build 20230302_000000
 OPTIONS: 
 -Ibuild/ifort_64365A593998FE80 -c -warn all -check all -error-limit 1 -O0 -g -a
 ssume byterecl -traceback -free -module build/ifort_64365A593998FE80 -o build/i
 fort_64365A593998FE80/some-sums/app_main.f90.o
 
n: 20000000, xsp_sum:  0.167772E+08, xdp_sum:  0.200000000000000E+08, xdp_xsp_difference:  0.322278400000000E+07
n: 30000000, xsp_sum:  0.167772E+08, xdp_sum:  0.300000000000000E+08, xdp_xsp_difference:  0.132227840000000E+08
n: 40000000, xsp_sum:  0.167772E+08, xdp_sum:  0.400000000000000E+08, xdp_xsp_difference:  0.232227840000000E+08
n: 50000000, xsp_sum:  0.167772E+08, xdp_sum:  0.500000000000000E+08, xdp_xsp_difference:  0.332227840000000E+08
n: 60000000, xsp_sum:  0.167772E+08, xdp_sum:  0.600000000000000E+08, xdp_xsp_difference:  0.432227840000000E+08
n: 70000000, xsp_sum:  0.167772E+08, xdp_sum:  0.700000000000000E+08, xdp_xsp_difference:  0.532227840000000E+08
n: 80000000, xsp_sum:  0.167772E+08, xdp_sum:  0.800000000000000E+08, xdp_xsp_difference:  0.632227840000000E+08
n: 90000000, xsp_sum:  0.167772E+08, xdp_sum:  0.900000000000000E+08, xdp_xsp_difference:  0.732227840000000E+08
fpm: Leaving directory '/home/tyranids/projects/some-sums'
fpm: Entering directory '/home/tyranids/projects/some-sums'
Project is up to date
 COMPILER: 
 Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2023
 .1.0 Build 20230320
 OPTIONS: 
 -Ibuild/ifx_64365A593998FE80 -c -warn all -check all -error-limit 1 -O0 -g -ass
 ume byterecl -traceback -free -module build/ifx_64365A593998FE80 -o build/ifx_6
 4365A593998FE80/some-sums/app_main.f90.o 
 
n: 20000000, xsp_sum:  0.167772E+08, xdp_sum:  0.200000000000000E+08, xdp_xsp_difference:  0.322278400000000E+07
n: 30000000, xsp_sum:  0.167772E+08, xdp_sum:  0.300000000000000E+08, xdp_xsp_difference:  0.132227840000000E+08
n: 40000000, xsp_sum:  0.167772E+08, xdp_sum:  0.400000000000000E+08, xdp_xsp_difference:  0.232227840000000E+08
n: 50000000, xsp_sum:  0.167772E+08, xdp_sum:  0.500000000000000E+08, xdp_xsp_difference:  0.332227840000000E+08
n: 60000000, xsp_sum:  0.167772E+08, xdp_sum:  0.600000000000000E+08, xdp_xsp_difference:  0.432227840000000E+08
n: 70000000, xsp_sum:  0.167772E+08, xdp_sum:  0.700000000000000E+08, xdp_xsp_difference:  0.532227840000000E+08
n: 80000000, xsp_sum:  0.167772E+08, xdp_sum:  0.800000000000000E+08, xdp_xsp_difference:  0.632227840000000E+08
n: 90000000, xsp_sum:  0.167772E+08, xdp_sum:  0.900000000000000E+08, xdp_xsp_difference:  0.732227840000000E+08
fpm: Leaving directory '/home/tyranids/projects/some-sums'

Under debugging flags, all 3 compilers seem to encounter errors due to precision after 2**24. This is surprising to me, as it indicates that the intrinsic SUM function is implemented as nothing more than a naive loop. I will update this post with some timings I have done in another program that leads me to this conclusion.
EDIT:

Click to see timing and algorithm comparisons.

Here is the timing code. It compares results and timing information for the intrinsic sum, a naive do-loop, and a pairwise reduction with cutoff at 8192 elements.

program timings
use, intrinsic :: iso_fortran_env, sp=>real32, dp=>real64, i64=>int64
implicit none

    integer, parameter :: n = 2**27, i_max = 128
    real(sp) :: xsp(n), xsp_sum
    real(dp) :: xdp(n), xdp_sum, xdp_xsp_difference
    integer(i64) :: c1, c2, cr
    integer :: i

    write(*,*) 'COMPILER: ',compiler_version()
    write(*,*) 'OPTIONS: ',compiler_options()
    write(*,*) ''
    write(*,*) 'n: ',n,', i_max: ',i_max
    
    xsp = 1.0
    xdp = 1.0
    xdp_sum = sum(xdp)
    write(*,*) '"correct" answer (xdp_sum): ',xdp_sum

    call system_clock(c1, cr)
    do i=1,i_max
        xsp_sum = sum(xsp)
    end do
    call system_clock(c2)
    xdp_xsp_difference = xdp_sum - real(xsp_sum, dp)
    write(*,'(a32,e13.6,a,e22.15,a,e13.6)') 'intrinsic sum ',xsp_sum, &
                                            ' diff: ',xdp_xsp_difference, &
                                            ', time per call: ',(real(max(c2 - c1, 1_i64))/cr)/i_max

    call system_clock(c1, cr)
    do i=1,i_max
        xsp_sum = s0(xsp)
    end do
    call system_clock(c2)
    xdp_xsp_difference = xdp_sum - real(xsp_sum, dp)
    write(*,'(a32,e13.6,a,e22.15,a,e13.6)') 'naive do loop ',xsp_sum, &
                                            ' diff: ',xdp_xsp_difference, &
                                            ', time per call: ',(real(max(c2 - c1, 1_i64))/cr)/i_max

    call system_clock(c1, cr)
    do i=1,i_max
        xsp_sum = s1(xsp)
    end do
    call system_clock(c2)
    xdp_xsp_difference = xdp_sum - real(xsp_sum, dp)
    write(*,'(a32,e13.6,a,e22.15,a,e13.6)') 'pairwise sum, cutoff 8192 ',xsp_sum, &
                                            ' diff: ',xdp_xsp_difference, &
                                            ', time per call: ',(real(max(c2 - c1, 1_i64))/cr)/i_max

    contains

        pure function s0(xin) result(xout)
            real(sp), intent(in) :: xin(:)
            real(sp) :: xout
            integer :: i
            xout = 0.0
            do i=1,size(xin)
                xout = xout + xin(i)
            end do
        end function s0

        pure recursive function s1(xin) result(xout)
            real(sp), intent(in) :: xin(:)
            real(sp) :: xout
            integer :: n
            n = size(xin)
            if (n.gt.8192) then
                xout = s1(xin(1:n/2)) + s1(xin(n/2+1:n))
            else
                xout = sum(xin)
            end if
        end function s1

end program timings

I compile and run with the “aggressive otimization” options. Results follow:

fpm: Entering directory '/home/tyranids/projects/some-sums'
Project is up to date
 COMPILER: GCC version 11.3.0
 OPTIONS: -I build/gfortran_75FF7B055CBF6CA1 -march=core2 -mmmx -mno-popcnt -msse -msse2 -msse3 -mssse3 -msse4.1 -mno-sse4.2 -mno-avx -mno-avx2 -mno-sse4a -mno-fma4 -mno-xop -mno-fma -mno-avx512f -mno-bmi -mno-bmi2 -mno-aes -mno-pclmul -mno-avx512vl -mno-avx512bw -mno-avx512dq -mno-avx512cd -mno-avx512er -mno-avx512pf -mno-avx512vbmi -mno-avx512ifma -mno-avx5124vnniw -mno-avx5124fmaps -mno-avx512vpopcntdq -mno-avx512vbmi2 -mno-gfni -mno-vpclmulqdq -mno-avx512vnni -mno-avx512bitalg -mno-avx512bf16 -mno-avx512vp2intersect -mno-3dnow -mno-adx -mno-abm -mno-cldemote -mno-clflushopt -mno-clwb -mno-clzero -mcx16 -mno-enqcmd -mno-f16c -mno-fsgsbase -mfxsr -mno-hle -msahf -mno-lwp -mno-lzcnt -mno-movbe -mno-movdir64b -mno-movdiri -mno-mwaitx -mno-pconfig -mno-pku -mno-prefetchwt1 -mno-prfchw -mno-ptwrite -mno-rdpid -mno-rdrnd -mno-rdseed -mno-rtm -mno-serialize -mno-sgx -mno-sha -mno-shstk -mno-tbm -mno-tsxldtrk -mno-vaes -mno-waitpkg -mno-wbnoinvd -mno-xsave -mno-xsavec -mno-xsaveopt -mno-xsaves -mno-amx-tile -mno-amx-int8 -mno-amx-bf16 -mno-uintr -mno-hreset -mno-kl -mno-widekl -mno-avxvnni --param=l1-cache-size=32 --param=l1-cache-line-size=64 --param=l2-cache-size=3072 -mtune=core2 -Ofast -Werror=implicit-interface -flto -fwhole-program -fimplicit-none -ffree-form -J build/gfortran_75FF7B055CBF6CA1 -fpre-include=/usr/include/finclude/math-vector-fortran.h
 
 n:    134217728 , i_max:          128
 "correct" answer (xdp_sum):    134217728.00000000     
                  intrinsic sum  0.671089E+08 diff:  0.671088640000000E+08, time per call:  0.983408E-01
                  naive do loop  0.671089E+08 diff:  0.671088640000000E+08, time per call:  0.984338E-01
      pairwise sum, cutoff 8192  0.134218E+09 diff:  0.000000000000000E+00, time per call:  0.104085E+00
fpm: Leaving directory '/home/tyranids/projects/some-sums'
fpm: Entering directory '/home/tyranids/projects/some-sums'
Project is up to date
 COMPILER: 
 Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel
 (R) 64, Version 2021.9.0 Build 20230302_000000
 OPTIONS: 
 -Ibuild/ifort_83E883136EE047D1 -c -fast -free -module build/ifort_83E883136EE04
 7D1 -o build/ifort_83E883136EE047D1/some-sums/app_timings.f90.o
 
 n:    134217728 , i_max:          128
 "correct" answer (xdp_sum):    134217728.000000     
                  intrinsic sum  0.671089E+08 diff:  0.671088640000000E+08, time per call:  0.498731E-01
                  naive do loop  0.134218E+09 diff:  0.000000000000000E+00, time per call:  0.970180E-01
      pairwise sum, cutoff 8192  0.134218E+09 diff:  0.000000000000000E+00, time per call:  0.996574E-01
fpm: Leaving directory '/home/tyranids/projects/some-sums'
fpm: Entering directory '/home/tyranids/projects/some-sums'
Project is up to date
 COMPILER: 
 Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2023
 .1.0 Build 20230320
 OPTIONS: 
 -Ibuild/ifx_83E883136EE047D1 -c -fast -free -module build/ifx_83E883136EE047D1 
 -o build/ifx_83E883136EE047D1/some-sums/app_timings.f90.o 
 
 n:    134217728 , i_max:          128
 "correct" answer (xdp_sum):    134217728.000000     
                  intrinsic sum  0.134218E+09 diff:  0.000000000000000E+00, time per call:  0.108797E+00
                  naive do loop  0.134218E+09 diff:  0.000000000000000E+00, time per call:  0.109057E+00
      pairwise sum, cutoff 8192  0.134218E+09 diff:  0.000000000000000E+00, time per call:  0.120868E+00
fpm: Leaving directory '/home/tyranids/projects/some-sums'

We can see that gfortran arrives at the same wrong answer (with nearly identical computation times between methods) as below, stopping at 2**26 for both intrinsic sum and do-loop. Pairwise reduction yields the correct result although is slightly slower on my machine. Perhaps most surprising to me is that whatever optimizations ifort is doing allow it to get the correct result using a simple do-loop. However, the intrinsic sum fails for ifort specifically at this power of 2, when it has no issues with larger sums as shown below. Ifx is simultaneously wrong for intrinsic and do-loop, while being the slowest for all 3 algorithms as well.

Normal Optimization

fpm: Entering directory '/home/tyranids/projects/some-sums'
Project is up to date
 COMPILER: GCC version 11.3.0
 OPTIONS: -I build/gfortran_776EBCE5331A33CA -march=core2 -mmmx -mno-popcnt -msse -msse2 -msse3 -mssse3 -msse4.1 -mno-sse4.2 -mno-avx -mno-avx2 -mno-sse4a -mno-fma4 -mno-xop -mno-fma -mno-avx512f -mno-bmi -mno-bmi2 -mno-aes -mno-pclmul -mno-avx512vl -mno-avx512bw -mno-avx512dq -mno-avx512cd -mno-avx512er -mno-avx512pf -mno-avx512vbmi -mno-avx512ifma -mno-avx5124vnniw -mno-avx5124fmaps -mno-avx512vpopcntdq -mno-avx512vbmi2 -mno-gfni -mno-vpclmulqdq -mno-avx512vnni -mno-avx512bitalg -mno-avx512bf16 -mno-avx512vp2intersect -mno-3dnow -mno-adx -mno-abm -mno-cldemote -mno-clflushopt -mno-clwb -mno-clzero -mcx16 -mno-enqcmd -mno-f16c -mno-fsgsbase -mfxsr -mno-hle -msahf -mno-lwp -mno-lzcnt -mno-movbe -mno-movdir64b -mno-movdiri -mno-mwaitx -mno-pconfig -mno-pku -mno-prefetchwt1 -mno-prfchw -mno-ptwrite -mno-rdpid -mno-rdrnd -mno-rdseed -mno-rtm -mno-serialize -mno-sgx -mno-sha -mno-shstk -mno-tbm -mno-tsxldtrk -mno-vaes -mno-waitpkg -mno-wbnoinvd -mno-xsave -mno-xsavec -mno-xsaveopt -mno-xsaves -mno-amx-tile -mno-amx-int8 -mno-amx-bf16 -mno-uintr -mno-hreset -mno-kl -mno-widekl -mno-avxvnni --param=l1-cache-size=32 --param=l1-cache-line-size=64 --param=l2-cache-size=3072 -mtune=core2 -O2 -Werror=implicit-interface -fimplicit-none -ffree-form -J build/gfortran_776EBCE5331A33CA -fpre-include=/usr/include/finclude/math-vector-fortran.h
 
n: 20000000, xsp_sum:  0.167772E+08, xdp_sum:  0.200000000000000E+08, xdp_xsp_difference:  0.322278400000000E+07
n: 30000000, xsp_sum:  0.167772E+08, xdp_sum:  0.300000000000000E+08, xdp_xsp_difference:  0.132227840000000E+08
n: 40000000, xsp_sum:  0.167772E+08, xdp_sum:  0.400000000000000E+08, xdp_xsp_difference:  0.232227840000000E+08
n: 50000000, xsp_sum:  0.167772E+08, xdp_sum:  0.500000000000000E+08, xdp_xsp_difference:  0.332227840000000E+08
n: 60000000, xsp_sum:  0.167772E+08, xdp_sum:  0.600000000000000E+08, xdp_xsp_difference:  0.432227840000000E+08
n: 70000000, xsp_sum:  0.167772E+08, xdp_sum:  0.700000000000000E+08, xdp_xsp_difference:  0.532227840000000E+08
n: 80000000, xsp_sum:  0.167772E+08, xdp_sum:  0.800000000000000E+08, xdp_xsp_difference:  0.632227840000000E+08
n: 90000000, xsp_sum:  0.167772E+08, xdp_sum:  0.900000000000000E+08, xdp_xsp_difference:  0.732227840000000E+08
fpm: Leaving directory '/home/tyranids/projects/some-sums'
fpm: Entering directory '/home/tyranids/projects/some-sums'
Project is up to date
 COMPILER: 
 Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel
 (R) 64, Version 2021.9.0 Build 20230302_000000
 OPTIONS: 
 -Ibuild/ifort_D4C49DE7E65100D3 -c -O2 -xHost -free -module build/ifort_D4C49DE7
 E65100D3 -o build/ifort_D4C49DE7E65100D3/some-sums/app_main.f90.o
 
fpm: Leaving directory '/home/tyranids/projects/some-sums'
fpm: Entering directory '/home/tyranids/projects/some-sums'
Project is up to date
 COMPILER: 
 Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2023
 .1.0 Build 20230320
 OPTIONS: 
 -Ibuild/ifx_D4C49DE7E65100D3 -c -O2 -xHost -free -module build/ifx_D4C49DE7E651
 00D3 -o build/ifx_D4C49DE7E65100D3/some-sums/app_main.f90.o 
 
n: 70000000, xsp_sum:  0.671089E+08, xdp_sum:  0.700000000000000E+08, xdp_xsp_difference:  0.289113600000000E+07
n: 80000000, xsp_sum:  0.671089E+08, xdp_sum:  0.800000000000000E+08, xdp_xsp_difference:  0.128911360000000E+08
n: 90000000, xsp_sum:  0.671089E+08, xdp_sum:  0.900000000000000E+08, xdp_xsp_difference:  0.228911360000000E+08
fpm: Leaving directory '/home/tyranids/projects/some-sums'

With “normal optimization,” gfortran’s behavior is identical to the debugging flags option. However, ifort now produces no errors, and ifx seemingly stops counting at 2**26 rather than 2**24 now.

Lastly, aggressive optimization:

fpm: Entering directory '/home/tyranids/projects/some-sums'
Project is up to date
 COMPILER: GCC version 11.3.0
 OPTIONS: -I build/gfortran_75FF7B055CBF6CA1 -march=core2 -mmmx -mno-popcnt -msse -msse2 -msse3 -mssse3 -msse4.1 -mno-sse4.2 -mno-avx -mno-avx2 -mno-sse4a -mno-fma4 -mno-xop -mno-fma -mno-avx512f -mno-bmi -mno-bmi2 -mno-aes -mno-pclmul -mno-avx512vl -mno-avx512bw -mno-avx512dq -mno-avx512cd -mno-avx512er -mno-avx512pf -mno-avx512vbmi -mno-avx512ifma -mno-avx5124vnniw -mno-avx5124fmaps -mno-avx512vpopcntdq -mno-avx512vbmi2 -mno-gfni -mno-vpclmulqdq -mno-avx512vnni -mno-avx512bitalg -mno-avx512bf16 -mno-avx512vp2intersect -mno-3dnow -mno-adx -mno-abm -mno-cldemote -mno-clflushopt -mno-clwb -mno-clzero -mcx16 -mno-enqcmd -mno-f16c -mno-fsgsbase -mfxsr -mno-hle -msahf -mno-lwp -mno-lzcnt -mno-movbe -mno-movdir64b -mno-movdiri -mno-mwaitx -mno-pconfig -mno-pku -mno-prefetchwt1 -mno-prfchw -mno-ptwrite -mno-rdpid -mno-rdrnd -mno-rdseed -mno-rtm -mno-serialize -mno-sgx -mno-sha -mno-shstk -mno-tbm -mno-tsxldtrk -mno-vaes -mno-waitpkg -mno-wbnoinvd -mno-xsave -mno-xsavec -mno-xsaveopt -mno-xsaves -mno-amx-tile -mno-amx-int8 -mno-amx-bf16 -mno-uintr -mno-hreset -mno-kl -mno-widekl -mno-avxvnni --param=l1-cache-size=32 --param=l1-cache-line-size=64 --param=l2-cache-size=3072 -mtune=core2 -Ofast -Werror=implicit-interface -flto -fwhole-program -fimplicit-none -ffree-form -J build/gfortran_75FF7B055CBF6CA1 -fpre-include=/usr/include/finclude/math-vector-fortran.h
 
n: 70000000, xsp_sum:  0.671089E+08, xdp_sum:  0.700000000000000E+08, xdp_xsp_difference:  0.289113600000000E+07
n: 80000000, xsp_sum:  0.671089E+08, xdp_sum:  0.800000000000000E+08, xdp_xsp_difference:  0.128911360000000E+08
n: 90000000, xsp_sum:  0.671089E+08, xdp_sum:  0.900000000000000E+08, xdp_xsp_difference:  0.228911360000000E+08
fpm: Leaving directory '/home/tyranids/projects/some-sums'
fpm: Entering directory '/home/tyranids/projects/some-sums'
Project is up to date
 COMPILER: 
 Intel(R) Fortran Intel(R) 64 Compiler Classic for applications running on Intel
 (R) 64, Version 2021.9.0 Build 20230302_000000
 OPTIONS: 
 -Ibuild/ifort_83E883136EE047D1 -c -fast -free -module build/ifort_83E883136EE04
 7D1 -o build/ifort_83E883136EE047D1/some-sums/app_main.f90.o
 
fpm: Leaving directory '/home/tyranids/projects/some-sums'
fpm: Entering directory '/home/tyranids/projects/some-sums'
Project is up to date
 COMPILER: 
 Intel(R) Fortran Compiler for applications running on Intel(R) 64, Version 2023
 .1.0 Build 20230320
 OPTIONS: 
 -Ibuild/ifx_83E883136EE047D1 -c -fast -free -module build/ifx_83E883136EE047D1 
 -o build/ifx_83E883136EE047D1/some-sums/app_main.f90.o 
 
n: 70000000, xsp_sum:  0.671089E+08, xdp_sum:  0.700000000000000E+08, xdp_xsp_difference:  0.289113600000000E+07
n: 80000000, xsp_sum:  0.671089E+08, xdp_sum:  0.800000000000000E+08, xdp_xsp_difference:  0.128911360000000E+08
n: 90000000, xsp_sum:  0.671089E+08, xdp_sum:  0.900000000000000E+08, xdp_xsp_difference:  0.228911360000000E+08
fpm: Leaving directory '/home/tyranids/projects/some-sums'

Here, gfortran appears to have joined ifx with stopping counting at 2**26, while ifort once again produces the (expected) correct results.

Does anyone have insight to offer on this topic?

1 Like

Thank you for the response. Yes, I am aware that floating point math in the computer is not the same as done by humans. That doesn’t really address at all the questions of why results are different for the same intrinsic procedure of a language intending to be following some standard? It also does not address the fact that an incorrect answer is emitted from some of the processors, while it is easily possible to come to the correct answer in much the same time of computation. It is especially odd that 2 compilers from the same vendor can give different results.

The issue you observe with summation “stopping” has been discussed previously at Discourse: Size of long array - #46 by ivanpribec

As @mecej4 shows in that thread, the consecutive integers 2^{24} and 2^{24}+1 both have the same IEEE-32 representation Z’4B800000’ .

I believe it has been mentioned several times on this forum, and in the Intel Fortran Developer forums and documentation, that you should not expect bitwise equal results. (I will update and link the discussions if I manage to find them). The compilers rely on different technologies for the the code generation and optimization leading to differences in the sequence of floating point operations taken, and hence different results:

image

(The slide is taken from the webinar GPU Offloading: The Next Chapter for Intel® Fortran Compiler given by Intel a few months ago).

2 Likes

Thanks for the link, as expected this isn’t anything new. I suppose ultimately I’m not that surprised the compiler developers refuse to consider such behavior a bug. That said, I will say that is extremely disappointing. An array of 20M “1.0” is valid input, 20.0E+6 (or thereabouts, certainly closer than 2**24) is valid for 32 bit real. I expect the language intrinsic function to give me the correct result. Anything less is an algorithmic failure of the compiler. As pointed out in the thread linked, there are many ways around the problem that remain performant. Personally I prefer the pairwise summation, as on newer hardware I was able to get a speed up along with increased accuracy. For whatever reason the 2008 core 2 duo used above never improves, but with an AMD 5800X I see about an order of magnitude speed up on gfortran going from intrinsic sum to pairwise with cutoff at 32 elements.

A related discussion is Compensated summation · Issue #402 · fortran-lang/stdlib · GitHub.

1 Like

You can try LFortran too:

$ lfortran a.f90

n:  20000000 , xsp_sum:  1.67772160e+07 , xdp_sum:  2.00000000000000000e+07 , xdp_xsp_difference:  3.22278400000000000e+06
n:  30000000 , xsp_sum:  1.67772160e+07 , xdp_sum:  3.00000000000000000e+07 , xdp_xsp_difference:  1.32227840000000000e+07
n:  40000000 , xsp_sum:  1.67772160e+07 , xdp_sum:  4.00000000000000000e+07 , xdp_xsp_difference:  2.32227840000000000e+07
n:  50000000 , xsp_sum:  1.67772160e+07 , xdp_sum:  5.00000000000000000e+07 , xdp_xsp_difference:  3.32227840000000000e+07
n:  60000000 , xsp_sum:  1.67772160e+07 , xdp_sum:  6.00000000000000000e+07 , xdp_xsp_difference:  4.32227840000000000e+07
n:  70000000 , xsp_sum:  1.67772160e+07 , xdp_sum:  7.00000000000000000e+07 , xdp_xsp_difference:  5.32227840000000000e+07
n:  80000000 , xsp_sum:  1.67772160e+07 , xdp_sum:  8.00000000000000000e+07 , xdp_xsp_difference:  6.32227840000000000e+07
n:  90000000 , xsp_sum:  1.67772160e+07 , xdp_sum:  9.00000000000000000e+07 , xdp_xsp_difference:  7.32227840000000000e+07

I don’t know where this puts it.

What is your suggested solution? Are you proposing the intrinsic sum to use the pair-wise summation by default?

I think what you might be hitting is the usual trade-off between performance, mathematical correctness, floating point behavior (fast-math vs keeping exactly the IEEE defined operations as written in the source code). The solution I think is to allow the user to select exactly what the compiler should do (via source code or compiler options), instead of relying on the compiler like a black box. You might have discovered something that is currently not configurable, but should be.

Try the following program (kinda long sorry) that implements Kahan summation and compares it with a simple loop and the SUM intrinsic function. I tried it with latest ifort, ifx, nvfortran and gfortran-11 in both REAL32 and REAL64. I ran with -O0 and -O3. Long story short, For REAL32, for ifort gave the “correct” result for both -O0 and -O3. A standard loop and the intrinsic SUM were badly in error. For ifx with -O0 Kahan sum again gave the correct result. With -O3 however, Kahan sum gave the SAME result as a loop and the intrinsic SUM. For nvfortran and gfortran-11, Kahan sum gave the correct result with both -O0 and -O3 with the loop and intrinsic sum in error. For REAL64, all compilers gave close to the correct answer with Kahan sum giving the correct answer except for -O3 with ifx. So if you need to do sums of a very large REAL32 array, I would avoid ifx.

! Kahan sum example program
!
! Details can be found in the following webpage:
!
!   https://en.wikipedia.org/wiki/Kahan_summation_algorithm
!
! Taken from code examples on Hioraki Nishikawa (aks Katate Masatsuka)
! I do like CFD website (http://www.ossanworld.com/cfdbooks/cfdcodes.html) and modified. 

program kahan_sum_example

#ifdef R64
  use ISO_FORTRAN_ENV, ONLY: wp=>REAL64
#else
  use ISO_FORTRAN_ENV, ONLY: wp=>REAL32
#endif

  implicit none

!Some constants

  integer, parameter     :: n = 10**8 + 1 != 100,000,001.
  integer                :: i
  real(wp)               :: dpc, asum, k_sum, isum
  real(wp), dimension(n) :: dp

!-----------------------------------------------------------------
! We compute sum = 1.0 + sum_{i=2,n} dp(i).
! For simplicity, use a constant dp: dp = dpc, where

  dpc     = 1.234567891234567e-02_wp
  dp(1)   = 1.0_wp
  dp(2:n) = dpc

! Note: This is a simplified example. In this case, the correct sum
!       should be 1+dpc*n-1, which we know because dp is a vecror of
!       the same constant value: dp = [1,dpc,dpc,...,dpc]. The correct sum
!       is used here to show how accurate the Kahan summation is.
!
!       In general, if dp(i) is different for each i, then we must
!       add up 1 and dp(i) over i=2,n as below; this is where the round-off
!       error can cause a loss of precision and Kahan's technique helps
!       keep the precision.

!-----------------------------------------------------------------
! (1)Simple summation loop

  asum = dp(1)
  do i = 2, n
   asum = asum + dp(i)
  end do

!-----------------------------------------------------------------
! (2)Summation by Kahan's technique

  k_sum = ksum(dp)

!-----------------------------------------------------------------
! (3) Intrinsic sum

  isum      = SUM(dp)

!-----------------------------------------------------------------
! Compare the results with the correct sum.

  write(*,*)
  write(*,*) "      loop      = ", asum
  write(*,*) "      Kahan     = ", k_sum
  write(*,*) "      Intrinsic = ", isum
  write(*,*) "    Correct sum = ", 1.0_wp + dpc*real(n-1,wp)
  write(*,*)

Contains

! Kahan summation

  pure function ksum(a) Result(ks)

    real(wp), intent(in) :: a(:)
    real(wp)             :: ks

    integer  :: i, n
    real(wp) :: temp, c, t

    n = SIZE(a)
    ks = a(1)
    c  = 0.0_wp

    do i=2,n
      temp = a(i) - c
      t    = ks + temp
      c    = (t-ks) -temp
      ks   = t
    end do

  end function ksum

end program kahan_sum_example


1 Like

I think others have pointed out the reasons for the floating point roundoff errors and have given references to compensated summation algorithms and double precision accumulation, so I’ll point out a couple more things.

There are n! ways to order a list of n values, and the straightforward summation of each of those n! orders will give the same result in exact arithmetic (because addition commutes and is associative) and can give different results with floating point arithmetic (because of roundoff error). Depending on the processor and how the intermediates are stored, that roundoff error can result from either the loss of commutativity or the loss of associativity.

Furthermore, given one of the n! orderings of the input values, there are

C_n = (1/(n+1)) * (2n .choose. n)

possible ways to put parentheses around the intermediate values in that list to evaluate the result. I don’t know how to write binomial coefficients in this forum, but hopefully the notation is clear. Those C_n values are called a Catalan numbers. Catalan number - Wikipedia

In your model problem, all the elements are equal, so the n! orderings are all equivalent, but the Catalan factor still applies because different parentheses choices can have different roundoffs.

So when you combine those two together, you end up with a very large number of possible ways to compute the summation result. A fortran compiler, or a mathematical library, cannot be expected to evaluate the expression in all of those possible ways in order to find the most accurate result.

If the programmer knows some properties of the array, then that can be used to advantage. For example, if the elements all have the same sign, then the programmer knows that catastrophic roundoff will not occur. If the elements are ordered somehow, then the small values can be added first, and that also helps to eliminate the effects of roundoff errors. Then there are also the compensated summation and higher precision accumulation approaches, all of which triple or quadruple the computational effort. So the programmer must choose which algorithms to use for this problem that are the best compromise between effort and accuracy. This problem is too complicated to expect the language, or the mathematical library, to always make the best choice.

2 Likes

For a modern machine, my findings indicate that your thoughts expressed here are incorrect. Consider the following test program:

program sumbench
use, intrinsic :: iso_fortran_env
implicit none
    integer, parameter :: n = 2**27, i_max = 2**10
    real(real32) :: x(n), sum_x
    integer :: i, i_cutoff, j
    integer(int64) :: c1, c2, cr
    write(*,'(a,a)') 'COMPILER: ',compiler_version()
    write(*,'(a,a)') 'OPTIONS: ',compiler_options()
    x = 1.0
    call system_clock(c1, cr)
    do i=1,i_max
        sum_x = sum(x)
    end do
    call system_clock(c2)
    write(*,'(2(a,e13.6))') '                      INTRINSIC SUM (', &
                            real(max(c2-c1,1_int64))/real(cr)/real(i_max),'s): ',sum_x
    call system_clock(c1, cr)
    do i=1,i_max
        sum_x = sum00(x)
    end do
    call system_clock(c2)
    write(*,'(2(a,e13.6))') '                  NAIVE DO-LOOP SUM (', &
                            real(max(c2-c1,1_int64))/real(cr)/real(i_max),'s): ',sum_x
    call system_clock(c1, cr)
    do i=1,i_max
        sum_x = sum01(x)
    end do
    call system_clock(c2)
    write(*,'(2(a,e13.6))') '         DO-LOOP (INNER REAL64) SUM (', &
                            real(max(c2-c1,1_int64))/real(cr)/real(i_max),'s): ',sum_x
    do j=20,1,-1
        i_cutoff = 2**j
        call system_clock(c1, cr)
        do i=1,i_max
            sum_x = sum02(x)
        end do
        call system_clock(c2)
        write(*,'(a,i7,2(a,e13.6))') 'RECURSIVE (CUTOFF ',i_cutoff,') PAIR-SUM (', &
                                     real(max(c2-c1,1_int64))/real(cr)/real(i_max),'s): ',sum_x
    end do
    contains
        pure function sum00(xin) result(xout)
            real(real32), intent(in) :: xin(:)
            real(real32) :: xout
            integer :: ii
            xout = 0.0
            do ii=1,size(xin)
                xout = xout + xin(ii)
            end do
        end function sum00
        function sum01(xin) result(xout)
            real(real32), intent(in) :: xin(:)
            real(real32) :: xout
            real(real64) :: temp
            integer :: ii
            temp = 0.0
            do ii=1,size(xin)
                temp = temp + real(xin(ii), real64)
            end do
            xout = temp
        end function sum01
        pure recursive function sum02(xin) result(xout)
            real(real32), intent(in) :: xin(:)
            real(real32) :: xout
            integer :: n
            n = size(xin)
            if (n.gt.i_cutoff) then
                xout = sum02(xin(1:n/2)) + sum02(xin(n/2+1:n))
            else
                xout = sum(xin)
            end if
        end function sum02
end program sumbench

When I run the code on an Intel Core 2 Duo from 2008 I get the following results:

COMPILER: GCC version 11.3.0
OPTIONS: -I build/gfortran_75FF7B055CBF6CA1 -march=core2 -mmmx -mno-popcnt -msse -msse2 -msse3 -mssse3 -msse4.1 -mno-sse4.2 -mno-avx -mno-avx2 -mno-sse4a -mno-fma4 -mno-xop -mno-fma -mno-avx512f -mno-bmi -mno-bmi2 -mno-aes -mno-pclmul -mno-avx512vl -mno-avx512bw -mno-avx512dq -mno-avx512cd -mno-avx512er -mno-avx512pf -mno-avx512vbmi -mno-avx512ifma -mno-avx5124vnniw -mno-avx5124fmaps -mno-avx512vpopcntdq -mno-avx512vbmi2 -mno-gfni -mno-vpclmulqdq -mno-avx512vnni -mno-avx512bitalg -mno-avx512bf16 -mno-avx512vp2intersect -mno-3dnow -mno-adx -mno-abm -mno-cldemote -mno-clflushopt -mno-clwb -mno-clzero -mcx16 -mno-enqcmd -mno-f16c -mno-fsgsbase -mfxsr -mno-hle -msahf -mno-lwp -mno-lzcnt -mno-movbe -mno-movdir64b -mno-movdiri -mno-mwaitx -mno-pconfig -mno-pku -mno-prefetchwt1 -mno-prfchw -mno-ptwrite -mno-rdpid -mno-rdrnd -mno-rdseed -mno-rtm -mno-serialize -mno-sgx -mno-sha -mno-shstk -mno-tbm -mno-tsxldtrk -mno-vaes -mno-waitpkg -mno-wbnoinvd -mno-xsave -mno-xsavec -mno-xsaveopt -mno-xsaves -mno-amx-tile -mno-amx-int8 -mno-amx-bf16 -mno-uintr -mno-hreset -mno-kl -mno-widekl -mno-avxvnni --param=l1-cache-size=32 --param=l1-cache-line-size=64 --param=l2-cache-size=3072 -mtune=core2 -Ofast -Werror=implicit-interface -flto -fwhole-program -fimplicit-none -ffree-form -J build/gfortran_75FF7B055CBF6CA1 -fpre-include=/usr/include/finclude/math-vector-fortran.h
                      INTRINSIC SUM ( 0.983126E-01s):  0.671089E+08
                  NAIVE DO-LOOP SUM ( 0.989637E-01s):  0.671089E+08
         DO-LOOP (INNER REAL64) SUM ( 0.117167E+00s):  0.134218E+09
RECURSIVE (CUTOFF 1048576) PAIR-SUM ( 0.128061E+00s):  0.134218E+09
RECURSIVE (CUTOFF  524288) PAIR-SUM ( 0.104334E+00s):  0.134218E+09
RECURSIVE (CUTOFF  262144) PAIR-SUM ( 0.102431E+00s):  0.134218E+09
RECURSIVE (CUTOFF  131072) PAIR-SUM ( 0.102907E+00s):  0.134218E+09
RECURSIVE (CUTOFF   65536) PAIR-SUM ( 0.103085E+00s):  0.134218E+09
RECURSIVE (CUTOFF   32768) PAIR-SUM ( 0.102669E+00s):  0.134218E+09
RECURSIVE (CUTOFF   16384) PAIR-SUM ( 0.103133E+00s):  0.134218E+09
RECURSIVE (CUTOFF    8192) PAIR-SUM ( 0.103869E+00s):  0.134218E+09
RECURSIVE (CUTOFF    4096) PAIR-SUM ( 0.104377E+00s):  0.134218E+09
RECURSIVE (CUTOFF    2048) PAIR-SUM ( 0.106350E+00s):  0.134218E+09
RECURSIVE (CUTOFF    1024) PAIR-SUM ( 0.109760E+00s):  0.134218E+09
RECURSIVE (CUTOFF     512) PAIR-SUM ( 0.110919E+00s):  0.134218E+09
RECURSIVE (CUTOFF     256) PAIR-SUM ( 0.113762E+00s):  0.134218E+09
RECURSIVE (CUTOFF     128) PAIR-SUM ( 0.119136E+00s):  0.134218E+09
RECURSIVE (CUTOFF      64) PAIR-SUM ( 0.125474E+00s):  0.134218E+09
RECURSIVE (CUTOFF      32) PAIR-SUM ( 0.137849E+00s):  0.134218E+09
RECURSIVE (CUTOFF      16) PAIR-SUM ( 0.190158E+00s):  0.134218E+09
RECURSIVE (CUTOFF       8) PAIR-SUM ( 0.324040E+00s):  0.134218E+09
RECURSIVE (CUTOFF       4) PAIR-SUM ( 0.568281E+00s):  0.134218E+09
RECURSIVE (CUTOFF       2) PAIR-SUM ( 0.963798E+00s):  0.134218E+09

With a cutoff of 2048, I measure ~10% longer computation time than the intrinsic sum routine, but yield the correct answer. Do note that this keeps the entire calculation in real32 reals, and is faster than the loop using a real64 temporary value to accumulate.

However, when I run the same program on a newer CPU, an AMD 5800X, I get the following results:

COMPILER: GCC version 13.1.1 20230429
OPTIONS: -march=znver3 -mmmx -mpopcnt -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -mavx -mavx2 -msse4a -mno-fma4 -mno-xop -mfma -mno-avx512f -mbmi -mbmi2 -maes -mpclmul -mno-avx512vl -mno-avx512bw -mno-avx512dq -mno-avx512cd -mno-avx512er -mno-avx512pf -mno-avx512vbmi -mno-avx512ifma -mno-avx5124vnniw -mno-avx5124fmaps -mno-avx512vpopcntdq -mno-avx512vbmi2 -mno-gfni -mvpclmulqdq -mno-avx512vnni -mno-avx512bitalg -mno-avx512bf16 -mno-avx512vp2intersect -mno-3dnow -madx -mabm -mno-cldemote -mclflushopt -mclwb -mclzero -mcx16 -mno-enqcmd -mf16c -mfsgsbase -mfxsr -mno-hle -msahf -mno-lwp -mlzcnt -mmovbe -mno-movdir64b -mno-movdiri -mmwaitx -mno-pconfig -mpku -mno-prefetchwt1 -mprfchw -mno-ptwrite -mrdpid -mrdrnd -mrdseed -mno-rtm -mno-serialize -mno-sgx -msha -mshstk -mno-tbm -mno-tsxldtrk -mvaes -mno-waitpkg -mwbnoinvd -mxsave -mxsavec -mxsaveopt -mxsaves -mno-amx-tile -mno-amx-int8 -mno-amx-bf16 -mno-uintr -mno-hreset -mno-kl -mno-widekl -mno-avxvnni -mno-avx512fp16 -mno-avxifma -mno-avxvnniint8 -mno-avxneconvert -mno-cmpccxadd -mno-amx-fp16 -mno-prefetchi -mno-raoint -mno-amx-complex --param=l1-cache-size=32 --param=l1-cache-line-size=64 --param=l2-cache-size=512 -mtune=znver3 -O3 -flto -fwhole-program -fpre-include=/usr/include/finclude/math-vector-fortran.h
                      INTRINSIC SUM ( 0.832319E-01s):  0.167772E+08
                  NAIVE DO-LOOP SUM ( 0.833409E-01s):  0.167772E+08
         DO-LOOP (INNER REAL64) SUM ( 0.836207E-01s):  0.134218E+09
RECURSIVE (CUTOFF 1048576) PAIR-SUM ( 0.833593E-01s):  0.134218E+09
RECURSIVE (CUTOFF  524288) PAIR-SUM ( 0.833508E-01s):  0.134218E+09
RECURSIVE (CUTOFF  262144) PAIR-SUM ( 0.832955E-01s):  0.134218E+09
RECURSIVE (CUTOFF  131072) PAIR-SUM ( 0.832843E-01s):  0.134218E+09
RECURSIVE (CUTOFF   65536) PAIR-SUM ( 0.832723E-01s):  0.134218E+09
RECURSIVE (CUTOFF   32768) PAIR-SUM ( 0.832000E-01s):  0.134218E+09
RECURSIVE (CUTOFF   16384) PAIR-SUM ( 0.830266E-01s):  0.134218E+09
RECURSIVE (CUTOFF    8192) PAIR-SUM ( 0.827628E-01s):  0.134218E+09
RECURSIVE (CUTOFF    4096) PAIR-SUM ( 0.821114E-01s):  0.134218E+09
RECURSIVE (CUTOFF    2048) PAIR-SUM ( 0.809698E-01s):  0.134218E+09
RECURSIVE (CUTOFF    1024) PAIR-SUM ( 0.785599E-01s):  0.134218E+09
RECURSIVE (CUTOFF     512) PAIR-SUM ( 0.739822E-01s):  0.134218E+09
RECURSIVE (CUTOFF     256) PAIR-SUM ( 0.649110E-01s):  0.134218E+09
RECURSIVE (CUTOFF     128) PAIR-SUM ( 0.461861E-01s):  0.134218E+09
RECURSIVE (CUTOFF      64) PAIR-SUM ( 0.347355E-01s):  0.134218E+09
RECURSIVE (CUTOFF      32) PAIR-SUM ( 0.304228E-01s):  0.134218E+09
RECURSIVE (CUTOFF      16) PAIR-SUM ( 0.316161E-01s):  0.134218E+09
RECURSIVE (CUTOFF       8) PAIR-SUM ( 0.453168E-01s):  0.134218E+09
RECURSIVE (CUTOFF       4) PAIR-SUM ( 0.804363E-01s):  0.134218E+09
RECURSIVE (CUTOFF       2) PAIR-SUM ( 0.145190E+00s):  0.134218E+09

Here, you can see that the intrinsic sum, real32 do-loop, and do-loop with real64 accumulator all run in nearly identical time, with the real64 accumulator loop yielding the correct result while the other 2 fail. The recursive summation increases in speed as the cutoff size approaches 32, hitting peak performance at ~36% the runtime of intrinsic and real64 accumulating loops.

I am curious to know what type of array would be required for the simple loop to yield both improved performance and improved correctness over the recursive method. I am no combinatorics expert so I truly do not know the answer.

1 Like

I found this article about accurate floating point summations. https://people.eecs.berkeley.edu/~demmel/AccurateSummation.pdf

1 Like

Both “xsp_sum” and “sum(xsp)” have inadequate precision for the calculation being proposed.
If you are writing the code, it is your responsibility to make sure that adequate precision variables are provided for the calculation.
I don’t accept that the compiler should be blamed for, or fix your failure to select adequate precision.

This is one of my objections to KIND, as it masks the true meaning of “selecting adequate precision for a computation method”.

It is interesting that pre-F90 Fortran, using 8087 80-bit registers and retaining the value of xsp_sum in an 80-bit register, then a DO loop approach would have provided a more accurate answer.

The Kahan sum method is just a more complex way of using a higher precision accumulator (which is required if a higher precision real is not available)

I think the canonical article for sums is Higham, N.J., 1993. The accuracy of floating point summation. SIAM Journal on Scientific Computing, 14(4), pp.783-799., which is the reference [5] in the J. Demmel article you posted.

4 Likes

A recent reference is

Brett Neuman, Andy Dubois, Laura Monroe, Robert W Robey, Fast, good, and repeatable: Summations, vectorization, and reproducibility, International Journal of High Performance Computing Applications, 01 September 2020, pp 519–531, https://doi.org/10.1177/1094342020938425

Abstract: Enhanced-precision global sums are key to reproducibility in exascale applications. We examine two classic summation algorithms and show that vectorized versions are fast, good and reproducible at exascale. Both 256-bit and 512-bit implementations speed up the operation by almost a factor of four over the serial version. They thus demonstrate improved performance on global summations while retaining the numerical reproducibility of these methods.

4 Likes

Very cool concept. Unfortunately I cannot access this paper. It is unfortunate that any academic papers would be put up behind paywalls, but that appears to be the world we live in today.

Often googling “pdf” followed by the paper title brings up an available working paper. It is here.

1 Like

Haha thanks. I tried paper name + arxiv and couldn’t find it. I’ll check it out

This link is to a long abstract or a conference summary. The paper I cited is 12 pages. I couldn’t find a freely available copy.

Fortunately, I was able to keep access to my university library e-journal accounts when I retired last month. The full paper gives complete implementations in C and C++. They rely on the low level vector intrinsics from x86intrin.h on intel. Not sure where the GCC intrinsics are coming from.

The code is open source (Apache license) and can be found at

An OpenMP implemention is given in

I presume there is way to implement these in Fortran but the path of least resistance is probably C-interop if someone really needs a fast exact sum

3 Likes

I spent some time playing around with the LANL code using the test program I posted above. For the LANL code, I wrote some C_Interop interfaces to call REAL32 and REAL64 versions of the LANL Intel code. To get the code to work for a REAL32 array, I had to create a temp array (size 4) and added a loop that converts the floats to doubles. The rest of the code is the same as the LANL code. I also wrote a “vectorized” Fortran version of the serial kahan code in my previous example. Everything was compiled using latest icx and ifort (haven’t tried ifx yet). The results are interesting. For REAL64 arrays, the LANL code is about 3 times faster than the intrinsic SUM and a serial loop. It does not give the exact result though (only the serial Kahan does that). For REAL32 arrays, the LANL code (at least my modified version) is about 75% slower that the loop or SUM (the conversion from float to double obviously is slow). Also interesting is my “vectorized” Fortran. With a vector length (chunk size) of 8 its about twice as fast as the serial version. Its an order of magnitude slower than the LANL code for REAL64 but is about the same speed for REAL32.

All runs were made on a AMD 8 core Zen 3 processor using latest Intel compilers. The compiler options I used were -O -xHost -mavx.

Here are some results

REAL64.

loop = 1234568.88914913 time = 6.375799328088760E-002
serial Kahan = 1234568.89123457 time = 0.247680008411407
LANL Kahan = 1234568.89108034 time = 2.137899398803711E-002
fvector Kahan = 1234568.89123457 time = 0.122158050537109
intrinsic = 1234568.88914913 time = 6.367000937461853E-002

Correct sum = 1234568.89123457

REAL32

loop = 262144.0 time = 6.232700496912003E-002
serial Kahan = 1234569. time = 0.247580990195274
LANL Kahan = 1234569. time = 0.108999013900757
fvector Kahan = 1234569. time = 0.119104981422424
intrinsic = 262144.0 time = 6.231200695037842E-002

Correct sum = 1234569.

Here is my Fortran vector routine. I used a chunk size of 8 for these test. There are probably more compiler options I could use that might speed it up. Also, this is just a quick hack. There is probably a better way to implement a vector version of the Kahan sum in Fortran.

 pure function ksum_v(a) result(ksum)

! Fortran vector version of kahan_sum
! CHUNK is the desired vector length
! WP is the working precision (set to either REAL32 or REAL64)

    real(WP), intent(in) :: a(:)
    real(WP)             :: ksum

    integer :: i, n, jend, kend

    real (WP) :: ks(CHUNK), temp(CHUNK), t(CHUNK), c(CHUNK)

    n    = SIZE(a)
    ks   = 0.0_WP
    temp = 0.0_WP
    t    = 0.0_WP
    c    = 0.0_WP

   do i=1,n,CHUNK
      jend         = MERGE(i+CHUNK-1,n, (n-i+1)>=CHUNK)
      kend         = MERGE((n-i)+1, CHUNK, (n-i+1)<CHUNK)
      temp(1:kend) = a(i:jend)  - c(1:kend)
      t(1:kend)    = ks(1:kend) + temp(1:kend)
      c(1:kend)    = (t(1:kend) - ks(1:kend)) - temp(1:kend)
      ks(1:kend)   = t(1:kend)
    end do

    ksum = SUM(ks(1:CHUNK))

  end function ksum_v


2 Likes

Thanks for the code. With gfortran 12.0.1 on Windows, compiling with -O3, for real64 arrays of size 10^8, I am finding ksum_v above to be about half as fast as the intrinsic sum but much more accurate and more than 10 times faster than using quadruple precision. For the code

module ksum_mod
implicit none
public :: wp, sum_quad, ksum_v
integer, parameter :: wp=kind(1.0d0), CHUNK=8, qp = selected_real_kind(32)
contains
pure function sum_quad(a) result(qsum)
real(WP), intent(in) :: a(:)
real(WP)             :: qsum
qsum = sum(real(a, kind=qp))
end function sum_quad

 pure function ksum_v(a) result(ksum)

! Fortran vector version of kahan_sum
! CHUNK is the desired vector length
! WP is the working precision (set to either REAL32 or REAL64)

    real(WP), intent(in) :: a(:)
    real(WP)             :: ksum

    integer :: i, n, jend, kend

    real (WP) :: ks(CHUNK), temp(CHUNK), t(CHUNK), c(CHUNK)

    n    = SIZE(a)
    ks   = 0.0_WP
    temp = 0.0_WP
    t    = 0.0_WP
    c    = 0.0_WP

   do i=1,n,CHUNK
      jend         = MERGE(i+CHUNK-1,n, (n-i+1)>=CHUNK)
      kend         = MERGE((n-i)+1, CHUNK, (n-i+1)<CHUNK)
      temp(1:kend) = a(i:jend)  - c(1:kend)
      t(1:kend)    = ks(1:kend) + temp(1:kend)
      c(1:kend)    = (t(1:kend) - ks(1:kend)) - temp(1:kend)
      ks(1:kend)   = t(1:kend)
    end do

    ksum = SUM(ks(1:CHUNK))

  end function ksum_v

end module ksum_mod

program main
use ksum_mod, only: wp, sum_quad, ksum_v
implicit none
integer, parameter :: n = 10**8, ncalc = 3, niter = 10
integer :: iter
real(kind=wp) :: x(n), xsum(ncalc), times(0:ncalc)
character (len=*), parameter :: fmt_cr = "(a10,*(f22.12))"
call random_seed()
do iter=1,niter
   call random_number(x)
   call cpu_time(times(0))
   xsum(1) = sum(x)
   call cpu_time(times(1))
   xsum(2) = ksum_v(x)
   call cpu_time(times(2))
   xsum(3) = sum_quad(x)
   call cpu_time(times(3))
   print "(/,a10,*(a22))", "", "intrinsic", "ksum_v", "sum_quad"
   print fmt_cr,"value  ",xsum
   print fmt_cr,"|error|",abs(xsum - xsum(ncalc))
   print fmt_cr,"time   ",times(1:) - times(0:ncalc-1)
end do
end program main

I get results such as

                       intrinsic                ksum_v              sum_quad
   value   49995471.340652205050 49995471.340660437942 49995471.340660437942
   |error|        0.000008232892        0.000000000000        0.000000000000
   time           0.109375000000        0.187500000000        2.375000000000

                       intrinsic                ksum_v              sum_quad
   value   49994981.394540734589 49994981.394532769918 49994981.394532769918
   |error|        0.000007964671        0.000000000000        0.000000000000
   time           0.109375000000        0.203125000000        2.359375000000

                       intrinsic                ksum_v              sum_quad
   value   50004272.729512684047 50004272.729511722922 50004272.729511730373
   |error|        0.000000953674        0.000000007451        0.000000000000
   time           0.109375000000        0.187500000000        2.375000000000

                       intrinsic                ksum_v              sum_quad
   value   50005300.672434933484 50005300.672399222851 50005300.672399222851
   |error|        0.000035710633        0.000000000000        0.000000000000
   time           0.109375000000        0.203125000000        2.359375000000

                       intrinsic                ksum_v              sum_quad
   value   49997116.065303988755 49997116.065316848457 49997116.065316848457
   |error|        0.000012859702        0.000000000000        0.000000000000
   time           0.109375000000        0.187500000000        2.390625000000

                       intrinsic                ksum_v              sum_quad
   value   49998428.190831281245 49998428.190844126046 49998428.190844126046
   |error|        0.000012844801        0.000000000000        0.000000000000
   time           0.125000000000        0.187500000000        2.375000000000

                       intrinsic                ksum_v              sum_quad
   value   50000059.049168981612 50000059.049166791141 50000059.049166791141
   |error|        0.000002190471        0.000000000000        0.000000000000
   time           0.109375000000        0.187500000000        2.359375000000

                       intrinsic                ksum_v              sum_quad
   value   50004361.297935038805 50004361.297945573926 50004361.297945566475
   |error|        0.000010527670        0.000000007451        0.000000000000
   time           0.093750000000        0.203125000000        2.359375000000

                       intrinsic                ksum_v              sum_quad
   value   50005629.889162346721 50005629.889165163040 50005629.889165170491
   |error|        0.000002823770        0.000000007451        0.000000000000
   time           0.109375000000        0.187500000000        2.359375000000

                       intrinsic                ksum_v              sum_quad
   value   49998030.355992607772 49998030.355996534228 49998030.355996541679
   |error|        0.000003933907        0.000000007451        0.000000000000
   time           0.109375000000        0.187500000000        2.375000000000
1 Like