Some Intrinsic SUMS

I though it was faster than the compensated algorithms, but it seems that it’s not

average accuracy runtime
straight \epsilon \ c \ \sqrt{n} x1
straight with HPA (*) \epsilon' \ c \ \sqrt{n} x1.1
straight with EHPA (*) \epsilon' \ c \ \sqrt{n} x30
pairwise \epsilon \ c \ \sqrt{\log_2{n}} x3
compensated \epsilon \ c \ x4

(*) (E)HPA = (Emulated) Higher Precision Accumulator

Where \epsilon is the machine precision of the base type, \epsilon' the machine precision for the accumulator type, and c the condition number of the summation (which is independent from the method).

The runtimes are obtained with gfortran12 -O3

Could you give a code that gives the pairwise summation for free ? The ones posted here are clearly not.

Edit: OK, it’s developped on the wikipedia page, by using a straightforward sum once the number of elements is smaller than a threshold, at the expense of a slightly larger error.

It cannot possibly be for free. If done with recursive function calls, then those calls do have a cost. Even if you don’t go all the way down to pairwise individual sums, there are still O(n) function calls for large n. And if you do it with a single function call that maintains the stack information manually, you still have the indexing operations that are done for each element, so that is also O(n) effort. Here is the code I used for the timings I mentioned.

program sums
   use, intrinsic :: iso_fortran_env, only: dp=>real64
   implicit none
   integer, parameter :: n=10**8
   integer :: itype
   real :: s1, s2, s3, s4, x(n)
   real(dp) :: cpu0, cpu1
   character(*), parameter :: sfmt = '(a,es15.7,1x,z8,es11.3," seconds")'
   do itype = 1, 2
      write(*,'("Random vector of type ",i0)') itype
      select case (itype)
      case (1)
         call random_number(x)
         x = x * 0.5 + 0.5    ! values between .5 and 1.0 have the same exponent.
      case (2)
         call random_number(x)
         x = 256 * (x - 0.5)  ! mixed signs and exponents.
      end select

      call cpu_time(cpu0)
      s1 = sum(x)
      call cpu_time(cpu1)
      write(*,sfmt) 'intrinsic:', s1, transfer(s1,1), (cpu1-cpu0)

      call cpu_time(cpu0)
      s2 = e_sum(x)
      call cpu_time(cpu1)
      write(*,sfmt) 'extended :', s2, transfer(s2,1), (cpu1-cpu0)

      call cpu_time(cpu0)
      s3 = r_sum(x)
      call cpu_time(cpu1)
      write(*,sfmt) 'recursive:', s3, transfer(s3,1), (cpu1-cpu0)

      call cpu_time(cpu0)
      s4 = p_sum(x)
      call cpu_time(cpu1)
      write(*,sfmt) 'p_sum    :', s4, transfer(s4,1), (cpu1-cpu0)
enddo
contains
   function e_sum(x)
      ! extended precision summation.
      implicit none
      real :: e_sum
      real, intent(in) :: x(:)
      integer :: i
      real(dp) :: temp
      temp = 0.0_dp
      do i = 1, size(x)
         temp = temp + real( x(i), dp )
      enddo
      e_sum = (temp)
      return
   end function e_sum
   function r_sum(x)
      ! recursive summation with manual stack.
      implicit none
      real :: r_sum
      real, intent(in) :: x(:)
      integer :: i, j, p
      real :: psum( bit_size(1) - leadz(size(x)) )  ! max reccursion depth.
      p = 0
      do i = 1, size(x)
         p = p + 1
         psum(p) = x(i)
         do j = 1, trailz(i)
            psum(p-1) = psum(p-1) + psum(p)
            p = p - 1
         enddo
      enddo
      do while (p > 1)  ! cleanup loop. accumulate in reverse order.
         psum(p-1) = psum(p-1) + psum(p)
         p = p - 1
      enddo
      r_sum = psum(1)
      return
   end function r_sum

   pure recursive function p_sum(x) result(r)
      ! recursive summation with function calls.
      implicit none
      real :: r
      real, intent(in) :: x(:)
      integer :: n
      n = size(x)
      select case (n)
      case (:0)
         r = 0.0
      case (1)
         r = x(1)
      case (2)
         r = x(1) + x(2)
      case (3:)
         r = p_sum( x(1:n/2) ) + p_sum( x(n/2+1:n) )
      end select
      return
   end function p_sum
end program sums

$ gfortran -O sums.F90 && a.out
Random vector of type 1
intrinsic:  1.6777216E+07 4B800000  9.303E-02 seconds
extended :  7.4999504E+07 4C8F0CDA  9.326E-02 seconds
recursive:  7.4999504E+07 4C8F0CDA  3.497E-01 seconds
p_sum    :  7.4999504E+07 4C8F0CDA  4.556E-01 seconds
Random vector of type 2
intrinsic:  7.7182125E+05 493C6ED4  9.314E-02 seconds
extended :  7.7176306E+05 493C6B31  9.340E-02 seconds
recursive:  7.7176300E+05 493C6B30  3.499E-01 seconds
p_sum    :  7.7176306E+05 493C6B31  4.555E-01 seconds

$ nagfor -O sums.F90 && a.out
NAG Fortran Compiler Release 7.1(Hanzomon) Build 7114
[NAG Fortran Compiler normal termination]
Random vector of type 1
intrinsic:  1.6777216E+07 4B800000  9.305E-02 seconds
extended :  7.5000056E+07 4C8F0D1F  9.317E-02 seconds
recursive:  7.5000056E+07 4C8F0D1F  2.689E-01 seconds
p_sum    :  7.5000056E+07 4C8F0D1F  4.266E-01 seconds
Random vector of type 2
intrinsic: -4.0762488E+05 C8C7091C  9.317E-02 seconds
extended : -4.0757409E+05 C8C702C3  9.315E-02 seconds
recursive: -4.0757419E+05 C8C702C6  2.681E-01 seconds
p_sum    : -4.0757412E+05 C8C702C4  4.289E-01 seconds

These timings look like 1x, 1x, 3x, 5x. That p_sum() function goes all the way down to pairwise summations, just like the r_sum() function, so that is a fair comparison of the difference between the manual stack and the recursive function versions of the algorithm.

Thank you @PierU for those speeds. But Higher Precision Accumulator means different things with
different original precisions and different compilers: gfortran in my system offers precisiion 6, 15, 18, 33 decimal digits (32, 64, 80, 128 bits) but ifort offers only 6, 15, 33 (32, 64, 128 bits). Hence if one starts with double precision (64 bits) and uses the lowest higher precision then gfortran should be faster than ifort. And did you really mean that straight with EHPA is really 40 times faster than straight? Or should “speed” at the top of the last column be replaced by “time taken”?

You’re correct, this refers to runtimes, not speed. I’ve corrected the table. I have also corrected some values, as I was previsouly fooled by some non pertinent compiler optimizations (it doesn’t change much, though). I shall post the test code.

@RonShepard
My preference is always for elapsed time, rather than processor time.

On all hardware I have used, CPU_Time is very imprecise.

2 Likes

More seriously… I have played with the different sums, and observe a strange behavior with the pairwise sum. I am summing n single precision random numbers between 1.0 and 2.0. This graph shows the relative errors for different “chunk” size to terminate the recursion (i.e. “psum 100” classically sums the elements when there are less than 100 of them):

With Err2(n) = \frac{(S(n) - S_{ref}(n))}{spacing(S_{ref}(n))} . What is plotted is actually the root mean square of Err2 over a sliding window. S_{ref}(n) is the reference summation, obtained by a straight summation with an extended precision accumulator (80 bits) (I checked that it was giving “always” the same result than with a 128 bits accumulator, and actually a 64 bits accumulator would be enough).

Two strange things to me here:

  • the error for the base pairwise summation is apparently constant over n, while the theory says it should be in O(\sqrt{log_2(n)})
  • for a chunk size of 1000, the error grows until n reaches 1000, which is expected. But after that it decreases until it reaches the same very low error as the base algorithm. I would expect it to remain first more or less constant past 1000, as the chunk size k should rule the error in O(\sqrt{log_2(n)+k})

Does anyone have an explanation? I double checked everything in the code…

Pairwise summation routine:

   pure recursive real(sp) function psum(a,chunk) result(s)
   real(sp), intent(in) :: a(:)
   integer, intent(in), optional :: chunk
   integer :: n, k, chunk___
   chunk___ = 2; if (present(chunk)) chunk___ = max(chunk___,chunk)
   n = size(a)
   if (n <= chunk___) then
      s = sum(a)
      return
   end if
   k = (n+1)/2
   s = psum(a(1:k),chunk) + psum(a(k+1:n),chunk)
   end function

Or whole code: misc/sums_bench.F90 at work · PierUgit/misc · GitHub

These tests for error are very dependent on the range of values being summed.

“summing n single precision random numbers between 1.0 and 2.0” is an extremely well behaved set of values and will only stall for very large n, while a range of values uniformly spread between say 1.0 and 1.0E10, could be a more severe test. Once a value of 1.0e10 is accumulated, all further small values < 100 will be lost.
It is a difficult balance to strike, as the value range of an actual problem does dictate the real type required.

Not sure why your error recovers as n increases, but may be due to error definition.

I thought about that, but this definition is close to the “relative error” used on the wikipedia page. Just, I am dividing by spacing(S(n)) instead of S(n). But spacing(S(n)) can be roughtly approximated by \epsilon S(n), so it’s really close (just a constant scale \epsilon between the two). Anyway, not only it recovers, but its gets independent from the chunk size for a large enough n, which is quite counter-intuitive (at least to me).

@RonShepard
I modified psum by using a minimum size of 64 to revert to the intrinsic sum and used an elapsed time timer. This shows a significant improvement to run time and a change to s4, which you may be better able to identify.
(also minor changes to use and format)

program sums
!   use, intrinsic :: iso_fortran_env, only: dp=>real64
   use iso_fortran_env, only: dp=>real64
   implicit none
   integer, parameter :: n=10**8
   integer :: itype
   real :: s1, s2, s3, s4, x(n)
   real(dp) :: cpu0, cpu1
   character(*), parameter :: sfmt = '(a,es15.7,1x,z8,f11.6," seconds")'

   write(*,'("Vector Size n = ",i0)') n
   do itype = 1, 2
      write(*,'("Random vector of type ",i0)') itype

      select case (itype)
       case (1)
         call random_number(x)
         x = x * 0.5 + 0.5    ! values between .5 and 1.0 have the same exponent.
       case (2)
         call random_number(x)
         x = 256 * (x - 0.5)  ! mixed signs and exponents.
      end select

        cpu0 = delta_sec ()
      s1 = sum(x)
        cpu0 = delta_sec ()
      write(*,sfmt) 'intrinsic:', s1, transfer(s1,1), cpu0

        cpu0 = delta_sec ()
      s2 = e_sum(x)
        cpu0 = delta_sec ()
      write(*,sfmt) 'extended :', s2, transfer(s2,1), cpu0

        cpu0 = delta_sec ()
      s3 = r_sum(x)
        cpu0 = delta_sec ()
      write(*,sfmt) 'recursive:', s3, transfer(s3,1), cpu0

        cpu0 = delta_sec ()
      s4 = p_sum(x)
        cpu0 = delta_sec ()
      write(*,sfmt) 'p_sum    :', s4, transfer(s4,1), cpu0

        cpu0 = delta_sec ()
      s4 = j_sum(x)
        cpu0 = delta_sec ()
      write(*,sfmt) 'j_sum    :', s4, transfer(s4,1), cpu0

   enddo
contains

   function e_sum(x)
      ! extended precision summation.
      implicit none
      real :: e_sum
      real, intent(in) :: x(:)
      integer :: i
      real(dp) :: temp
      temp = 0.0_dp
      do i = 1, size(x)
         temp = temp + real( x(i), dp )
      enddo
      e_sum = (temp)
      return
   end function e_sum

   function r_sum(x)
      ! recursive summation with manual stack.
      implicit none
      real :: r_sum
      real, intent(in) :: x(:)
      integer :: i, j, p
      real :: psum( bit_size(1) - leadz(size(x)) )  ! max reccursion depth.
      p = 0
      do i = 1, size(x)
         p = p + 1
         psum(p) = x(i)
         do j = 1, trailz(i)
            psum(p-1) = psum(p-1) + psum(p)
            p = p - 1
         enddo
      enddo
      do while (p > 1)  ! cleanup loop. accumulate in reverse order.
         psum(p-1) = psum(p-1) + psum(p)
         p = p - 1
      enddo
      r_sum = psum(1)
      return
   end function r_sum

   pure recursive function p_sum(x) result(r)
      ! recursive summation with function calls.
      implicit none
      real :: r
      real, intent(in) :: x(:)
      integer :: n
      n = size(x)
      select case (n)
      case (:0)
         r = 0.0
      case (1)
         r = x(1)
      case (2)
         r = x(1) + x(2)
      case (3:)
         r = p_sum( x(1:n/2) ) + p_sum( x(n/2+1:n) )
      end select
      return
   end function p_sum

   pure recursive function j_sum(x) result(r)
      ! recursive summation with function calls.
      implicit none
      real :: r
      real, intent(in) :: x(:)
      integer :: n
      integer, parameter :: min_n = 64 ! 16
      n = size(x)

      if ( n < min_n ) then
         r = sum (x)
      else
         r = j_sum ( x(1:n/2) ) + j_sum ( x(n/2+1:n) )
      end if
      return
   end function j_sum

   function delta_sec ()
      ! high precision timer.
      implicit none
      real(dp) :: delta_sec
      integer*8 :: tick, last=-1, rate=-1

      if ( rate < 0 ) call system_clock ( last, rate )
      call system_clock (tick)
      delta_sec = dble (tick-last) / dble (rate)
      last      = tick
   end function delta_sec

end program sums

gfortran -O ron_sum2.f90 -o ron_sum2.exe -v >> zz 2>&1

Driving: gfortran -O ron_sum2.f90 -o ron_sum2.exe -v -l gfortran
Built by Equation Solution <http://www.Equation.com>.
Using built-in specs.
COLLECT_GCC=gfortran
COLLECT_LTO_WRAPPER=c:/program\ files\ (x86)/gcc_eq/gcc_11.1.0/bin/../libexec/gcc/x86_64-w64-mingw32/11.1.0/lto-wrapper.exe
Target: x86_64-w64-mingw32
COLLECT_GCC_OPTIONS='-O' '-o' 'ron_sum2.exe' '-v' '-mtune=generic' '-march=x86-64' '-dumpdir' 'ron_sum2.'

Vector Size n = 100000000
Random vector of type 1
intrinsic:  1.6777216E+07 4B800000   0.061481 seconds
extended :  7.5002080E+07 4C8F0E1C   0.061670 seconds
recursive:  7.5002080E+07 4C8F0E1C   0.118890 seconds
p_sum    :  7.5002072E+07 4C8F0E1B   0.327316 seconds
j_sum    :  7.5002072E+07 4C8F0E1B   0.051229 seconds
Random vector of type 2
intrinsic:  3.6632600E+05 48B2DEC0   0.061909 seconds
extended :  3.6638588E+05 48B2E63C   0.062323 seconds
recursive:  3.6638600E+05 48B2E640   0.119196 seconds
p_sum    :  3.6638594E+05 48B2E63E   0.330983 seconds
j_sum    :  3.6638575E+05 48B2E638   0.049512 seconds

After all, this strange result at first glance is apparently correct. I have simultated how the errors accumulate during the pairwise summation (and also during the straight summation), and this correspond to the curve, with a relative error that recovers for large enough n.

The modified routine to simulate the (absolute) error:

   recursive real(sp) function psum_err(a,err,chunk) result(s)
   real(sp), intent(in) :: a(:)
   real(dp), intent(out) :: err
   integer, intent(in), optional :: chunk
   integer :: n, k, chunk___
   real(dp) :: tmp, err1, err2
   chunk___ = 2; if (present(chunk)) chunk___ = max(chunk___,chunk)
   n = size(a)
   if (n <= chunk___) then
      s = sum_err_sp(a,err)
      return
   end if
   k = (n+1)/2
   s = psum_err(a(1:k),err1,chunk) + psum_err(a(k+1:n),err2,chunk)
   call random_number(tmp)
   err = err1 + err2 + (tmp-0.5)*spacing(s)
   end function

   real(sp) function sum_err_sp(a,err) result(s)
   real(sp), intent(in) :: a(:)
   real(dp), intent(out) :: err
   integer :: i
   real(dp) :: tmp
   s = 0.0_sp
   err = 0.0_dp
   do i = 1, size(a)
      s = s + a(i)
      call random_number(tmp)
      err = err + (tmp-0.5)*spacing(s)
   end do
   end function

I’m just guessing, of course, but the larger errors for your new version of the pairwise summation are because it uses the intrinsic sum() for subvector lengths <64. The intrinsic sum apparently does not use extended precision or a compensated algorithm, so it naturally has larger errors, particularly for the type 2 vector argument which has mixed signs and exponents. The type 1 vector is a particularly simple case where all the elements have the same sign and exponent, so it doesn’t show this behavior.

The elapsed time is an interesting question. It must be because of some combination of multiple threads in addition to the vector length threshold. Presumably the p_sum() algorithm is also creating multiple threads, but it does its recursion all the way down to two elements, so that overhead (it is doing 32x more function calls) must cost more than what is gained by the use of multiple threads. If that is a correct conjecture for what is happening, then the j_sum() algorithm represents a tradeoff between reduced accuracy and increased speed. One might get the accuracy back by doing the <64 summation with extended precision (i.e. use e_sum() rather than sum()).

I don’t think the r_sum() algorithm will create multiple threads. So even if it looks more efficient than p_sum() or j_sum() with single thread times (due to only one function call to evaluate the whole sum), it has that limitation. Also, r_sum() would not be expected to use accelerator hardware, but apparently the other algorithms aren’t doing that either, even though that might occur with the right compiler optimization options.

@RonShepard
I have now included a modified j_sum as je_sum, that uses e_sum for sums < 65.
I also introduced a modified test order, as this can influence some large memory tests, but not so much help here for e_sum (Array x is 400MBytes ?).
We possibly need a random vector of type 3, with greater exponent range for x, say x = 10**(10*x), to better demonstrate the error response or overflow in the accumulator. ( edit: now included )
The results of your e_sum test vs je_sum > e_sum are surprisingly different for elapsed time. This is the type of results I attribute to better L1 cache usage/variability, but without any definate knowledge.
Would be pleased to get any comments / suggestions

Anyway, I have done these tests on Ryzen 5900X, Win 10 and Gfortran 11.1.0, and included alternative compile options in the lists.

program sums
!   use, intrinsic :: iso_fortran_env, only: dp=>real64
   use iso_fortran_env, only: dp=>real64
   implicit none
   integer, parameter :: n=10**8
   integer :: itype, k, order(6) = [ 1, 5, 6, 2, 3, 4 ]
   real :: x(n), s1, s2, s3, s4, s5, s6
   real(dp) :: cpu0, cpu1
   character(*), parameter :: sfmt = '(a,es15.7,1x,z8,f11.6," seconds")'

   write(*,'("Vector Size n = ",i0)') n
   do itype = 1, 3
      write(*,'("Random vector of type ",i0)') itype

      select case (itype)
       case (1)
         call random_number(x)
         x = x * 0.5 + 0.5    ! values between .5 and 1.0 have the same exponent.
       case (2)
         call random_number(x)
         x = 256 * (x - 0.5)  ! mixed signs and exponents.
       case (3)
         call random_number(x)
         x = 10 ** (10*x)      ! greater exponent range
      end select

     do k = 1, 6
      select case ( order(k) )
       case (1)
          cpu0 = delta_sec ()
        s1 = sum(x)
          cpu0 = delta_sec ()
        write(*,sfmt) 'intrinsic:', s1, transfer(s1,1), cpu0
  
       case (2)
          cpu0 = delta_sec ()
        s2 = e_sum(x)
          cpu0 = delta_sec ()
        write(*,sfmt) 'extended :', s2, transfer(s2,1), cpu0
  
       case (3)
          cpu0 = delta_sec ()
        s3 = r_sum(x)
          cpu0 = delta_sec ()
        write(*,sfmt) 'recursive:', s3, transfer(s3,1), cpu0
  
       case (4)
          cpu0 = delta_sec ()
        s4 = p_sum(x)
          cpu0 = delta_sec ()
        write(*,sfmt) 'p_sum    :', s4, transfer(s4,1), cpu0
  
       case (5)
          cpu0 = delta_sec ()
        s5 = j_sum(x)
          cpu0 = delta_sec ()
        write(*,sfmt) 'j_sum    :', s5, transfer(s5,1), cpu0
  
       case (6)
          cpu0 = delta_sec ()
        s6 = je_sum(x)
          cpu0 = delta_sec ()
        write(*,sfmt) 'je_sum   :', s6, transfer(s6,1), cpu0
      end select
     end do
   end do
contains

   pure function e_sum(x)
      ! extended precision summation.
      implicit none
      real :: e_sum
      real, intent(in) :: x(:)
      integer :: i
      real(dp) :: temp
      temp = 0.0_dp
      do i = 1, size(x)
         temp = temp + real( x(i), dp )
      enddo
      e_sum = (temp)
      return
   end function e_sum

   function r_sum(x)
      ! recursive summation with manual stack.
      implicit none
      real :: r_sum
      real, intent(in) :: x(:)
      integer :: i, j, p
      real :: psum( bit_size(1) - leadz(size(x)) )  ! max reccursion depth.
      p = 0
      do i = 1, size(x)
         p = p + 1
         psum(p) = x(i)
         do j = 1, trailz(i)
            psum(p-1) = psum(p-1) + psum(p)
            p = p - 1
         enddo
      enddo
      do while (p > 1)  ! cleanup loop. accumulate in reverse order.
         psum(p-1) = psum(p-1) + psum(p)
         p = p - 1
      enddo
      r_sum = psum(1)
      return
   end function r_sum

   pure recursive function p_sum(x) result(r)
      ! recursive summation with function calls.
      implicit none
      real :: r
      real, intent(in) :: x(:)
      integer :: n
      n = size(x)
      select case (n)
      case (:0)
         r = 0.0
      case (1)
         r = x(1)
      case (2)
         r = x(1) + x(2)
      case (3:)
         r = p_sum( x(1:n/2) ) + p_sum( x(n/2+1:n) )
      end select
      return
   end function p_sum

   pure recursive function j_sum(x) result(r)
      ! recursive summation with function calls.
      implicit none
      real :: r
      real, intent(in) :: x(:)
      integer :: n
      integer, parameter :: min_n = 65 ! 64 ! 16
      n = size(x)

      if ( n < min_n ) then
         r = sum (x)
      else
         r = j_sum ( x(1:n/2) ) + j_sum ( x(n/2+1:n) )
      end if
      return
   end function j_sum

   pure recursive function je_sum(x) result(r)
   
      ! recursive summation with extended precision function calls.
      implicit none
      real :: r
      real, intent(in) :: x(:)
      integer :: n
      integer, parameter :: min_n = 65 ! 128 ! 64 ! 16
      n = size(x)

      if ( n < min_n ) then
!         r = sum (x)
         r = e_sum (x)
      else
         r = je_sum ( x(1:n/2) ) + je_sum ( x(n/2+1:n) )
      end if
      return
   end function je_sum

   function delta_sec ()
      ! high precision timer.
      implicit none
      real(dp) :: delta_sec
      integer*8 :: tick, last=-1, rate=-1

      if ( rate < 0 ) call system_clock ( last, rate )
      call system_clock (tick)
      delta_sec = dble (tick-last) / dble (rate)
      last      = tick
   end function delta_sec

end program sums
set basic=-v -fimplicit-none -fallow-argument-mismatch -O2 -march=native -ffast-math
set omp=-fopenmp -fstack-arrays

set basic=-v -fimplicit-none -fallow-argument-mismatch -O -march=native
set basic=-v -fimplicit-none -fallow-argument-mismatch -O2 -march=native -ffast-math
set basic=-v -fimplicit-none -fallow-argument-mismatch -O1

gfortran %1.f90   %basic%  -o %1.exe  >> %1.tce  2>&1

%1 >> %1.tce

type %1.tce
COLLECT_GCC_OPTIONS='-v' '-fimplicit-none' '-fallow-argument-mismatch' '-O' '-march=native' '-o' 'ron_sum2.exe' '-dumpdir' 'ron_sum2.'
Vector Size n = 100000000
Random vector of type 1
intrinsic:  1.6777216E+07 4B800000   0.061096 seconds
extended :  7.4997416E+07 4C8F0BD5   0.061107 seconds
recursive:  7.4997416E+07 4C8F0BD5   0.120826 seconds
p_sum    :  7.4997416E+07 4C8F0BD5   0.327250 seconds
j_sum    :  7.4997416E+07 4C8F0BD5   0.050302 seconds
je_sum   :  7.4997416E+07 4C8F0BD5   0.052829 seconds
Random vector of type 2
intrinsic: -1.4523291E+05 C80DD43A   0.061420 seconds
extended : -1.4519922E+05 C80DCBCE   0.061530 seconds
recursive: -1.4519938E+05 C80DCBD8   0.118981 seconds
p_sum    : -1.4519919E+05 C80DCBCC   0.326990 seconds
j_sum    : -1.4519934E+05 C80DCBD6   0.050335 seconds
je_sum   : -1.4519922E+05 C80DCBCE   0.050940 seconds

COLLECT_GCC_OPTIONS='-v' '-fimplicit-none' '-fallow-argument-mismatch' '-O' '-march=native' '-ffast-math' '-o' 'ron_sum2.exe' '-dumpdir' 'ron_sum2.'
Vector Size n = 100000000
Random vector of type 1
intrinsic:  1.6777216E+07 4B800000   0.061198 seconds
extended :  7.4998288E+07 4C8F0C42   0.061058 seconds
recursive:  7.4998288E+07 4C8F0C42   0.118410 seconds
p_sum    :  7.4998288E+07 4C8F0C42   0.327628 seconds
j_sum    :  7.4998288E+07 4C8F0C42   0.050230 seconds
je_sum   :  7.4998288E+07 4C8F0C42   0.051019 seconds
Random vector of type 2
intrinsic:  9.2840462E+05 4962A94A   0.061880 seconds
extended :  9.2844181E+05 4962AB9D   0.061782 seconds
recursive:  9.2844188E+05 4962AB9E   0.119847 seconds
p_sum    :  9.2844188E+05 4962AB9E   0.327099 seconds
j_sum    :  9.2844188E+05 4962AB9E   0.050386 seconds
je_sum   :  9.2844188E+05 4962AB9E   0.050888 seconds

COLLECT_GCC_OPTIONS='-v' '-fimplicit-none' '-fallow-argument-mismatch' '-O2' '-march=native' '-ffast-math' '-o' 'ron_sum2.exe' '-dumpdir' 'ron_sum2.'
Vector Size n = 100000000
Random vector of type 1
intrinsic:  1.6777216E+07 4B800000   0.061357 seconds
extended :  7.4998368E+07 4C8F0C4C   0.061024 seconds
recursive:  7.4998368E+07 4C8F0C4C   0.092400 seconds
p_sum    :  7.4998368E+07 4C8F0C4C   0.232685 seconds
j_sum    :  7.4998368E+07 4C8F0C4C   0.042841 seconds
je_sum   :  7.4998368E+07 4C8F0C4C   0.042998 seconds
Random vector of type 2
intrinsic: -2.9935404E+04 C6E9DECF   0.063704 seconds
extended : -2.9868031E+04 C6E95810   0.063095 seconds
recursive: -2.9868047E+04 C6E95818   0.092844 seconds
p_sum    : -2.9868062E+04 C6E95820   0.233533 seconds
j_sum    : -2.9868000E+04 C6E95800   0.042905 seconds
je_sum   : -2.9868203E+04 C6E95868   0.043013 seconds
COLLECT_GCC_OPTIONS='-v' '-fimplicit-none' '-fallow-argument-mismatch' '-O1' '-o' 'ron_sum2.exe' '-mtune=generic' '-march=x86-64' '-dumpdir' 'ron_sum2.'
Vector Size n = 100000000
Random vector of type 1
intrinsic:  1.6777216E+07 4B800000   0.061273 seconds
extended :  7.5000264E+07 4C8F0D39   0.061080 seconds
recursive:  7.5000264E+07 4C8F0D39   0.125566 seconds
p_sum    :  7.5000264E+07 4C8F0D39   0.327155 seconds
j_sum    :  7.5000264E+07 4C8F0D39   0.049916 seconds
je_sum   :  7.5000264E+07 4C8F0D39   0.051608 seconds
Random vector of type 2
intrinsic:  9.8548633E+04 47C07A51   0.061245 seconds
extended :  9.8605258E+04 47C096A1   0.061045 seconds
recursive:  9.8605281E+04 47C096A4   0.125928 seconds
p_sum    :  9.8605109E+04 47C0968E   0.328060 seconds
j_sum    :  9.8605062E+04 47C09688   0.049556 seconds
je_sum   :  9.8605203E+04 47C0969A   0.051334 seconds

{{ Revised order }}
COLLECT_GCC_OPTIONS='-v' '-fimplicit-none' '-fallow-argument-mismatch' '-O1' '-o' 'ron_sum2.exe' '-mtune=generic' '-march=x86-64' '-dumpdir' 'ron_sum2.'
Vector Size n = 100000000
Random vector of type 1
intrinsic:  1.6777216E+07 4B800000   0.061068 seconds
j_sum    :  7.4997688E+07 4C8F0BF7   0.049448 seconds
je_sum   :  7.4997688E+07 4C8F0BF7   0.051313 seconds
extended :  7.4997688E+07 4C8F0BF7   0.061019 seconds
recursive:  7.4997688E+07 4C8F0BF7   0.126306 seconds
p_sum    :  7.4997688E+07 4C8F0BF7   0.330779 seconds
Random vector of type 2
intrinsic: -6.9337450E+05 C92947E8   0.061718 seconds
j_sum    : -6.9337188E+05 C92947BE   0.049567 seconds
je_sum   : -6.9337169E+05 C92947BB   0.051128 seconds
extended : -6.9337188E+05 C92947BE   0.061709 seconds
recursive: -6.9337188E+05 C92947BE   0.127979 seconds
p_sum    : -6.9337188E+05 C92947BE   0.325464 seconds
COLLECT_GCC_OPTIONS='-v' '-fimplicit-none' '-fallow-argument-mismatch' '-O1' '-o' 'ron_sum2.exe' '-mtune=generic' '-march=x86-64' '-dumpdir' 'ron_sum2.'
Vector Size n = 100000000
Random vector of type 1
intrinsic:  1.6777216E+07 4B800000   0.061153 seconds
j_sum    :  7.4999704E+07 4C8F0CF3   0.049519 seconds
je_sum   :  7.4999704E+07 4C8F0CF3   0.050971 seconds
extended :  7.4999704E+07 4C8F0CF3   0.061078 seconds
recursive:  7.4999704E+07 4C8F0CF3   0.126844 seconds
p_sum    :  7.4999704E+07 4C8F0CF3   0.331213 seconds
Random vector of type 2
intrinsic: -3.0828716E+05 C89687E5   0.062152 seconds
j_sum    : -3.0824225E+05 C8968248   0.049818 seconds
je_sum   : -3.0824238E+05 C896824C   0.051187 seconds
extended : -3.0824219E+05 C8968246   0.061058 seconds
recursive: -3.0824234E+05 C896824B   0.126645 seconds
p_sum    : -3.0824225E+05 C8968248   0.329529 seconds

COLLECT_GCC_OPTIONS='-v' '-fimplicit-none' '-fallow-argument-mismatch' '-O1' '-o' 'ron_sum2.exe' '-mtune=generic' '-march=x86-64' '-dumpdir' 'ron_sum2.'
Vector Size n = 100000000
Random vector of type 1
intrinsic:  1.6777216E+07 4B800000   0.061181 seconds
j_sum    :  7.4999504E+07 4C8F0CDA   0.049612 seconds
je_sum   :  7.4999504E+07 4C8F0CDA   0.051121 seconds
extended :  7.4999504E+07 4C8F0CDA   0.061106 seconds
recursive:  7.4999504E+07 4C8F0CDA   0.124811 seconds
p_sum    :  7.4999504E+07 4C8F0CDA   0.329647 seconds
Random vector of type 2
intrinsic: -6.9732494E+05 C92A3ECF   0.061310 seconds
j_sum    : -6.9740488E+05 C92A43CE   0.049676 seconds
je_sum   : -6.9740469E+05 C92A43CB   0.051098 seconds
extended : -6.9740481E+05 C92A43CD   0.061636 seconds
recursive: -6.9740494E+05 C92A43CF   0.125510 seconds
p_sum    : -6.9740488E+05 C92A43CE   0.327516 seconds
Random vector of type 3
intrinsic:  4.0709641E+16 5B10A134   0.062140 seconds
j_sum    :  4.3465584E+16 5B1A6BB8   0.049463 seconds
je_sum   :  4.3465584E+16 5B1A6BB8   0.050820 seconds
extended :  4.3465589E+16 5B1A6BB9   0.061240 seconds
recursive:  4.3465589E+16 5B1A6BB9   0.125593 seconds
p_sum    :  4.3465584E+16 5B1A6BB8   0.329287 seconds

Unless there is some kind of parallel execution involved, I do not understand how an expression like

could ever be executed faster than the simple e_sum(x) which is ultimately invoked for the numerical evaluation. I think the vector size is too large for any l1 cache effects to be exploited. So the difference must be somehow that simd hardware or accelerators are used in parallel, or compiler-generated threads allow execution in parallel, or something along those lines.

Regarding cpu_time() vs. elapsed time, this distinction was more important in the past. In the past, the cpu_time() (or its local equivalent) was supposed to be the stable value from run to run. Thus if you were optimizing an algorithm, or tuning some run time parameters, that is what the programmer would use. In contrast, the elapsed time depended on how many other users were on the machine at that time, or whether an i/o operation had to wait for a disk drive to spin up, or whether the mail deamon had to service an incoming mail request. Those things affected wall time, but not cpu time. These days, when mutiple compute cores are dedicated to just a single task, those types of things are less important, and the goal is to reduce the wall time. With parallel execution, the cpu time is sometimes the total over all the threads, so in those cases one can reduce the elapsed time by factors of 2x or 4x while the cpu time remains the same.

I do not believe that any current Fortran compiler will produce multi-threaded code without OpenMP/OpenACC and specific, additional compilation flags.

@RonShepard
Regarding the consistent difference between the esum test (.06 sec) and the je_sum test (.05 sec) using -O1:
e_sum is called with n = 100000000 ( which is stored in a combination of L1, L3 and memory )
je_sum > e_sum is called with n typically = 64 or smaller. ( which more likely could be in L1 cache )

I estimate that this enables the x section to be in L1 cache more often (due to interaction of other code) and so more efficient AVX, while e_sum with x(100000000) has x in L3 cache and beyond.
My “experience” with AVX is that when a large array is in L3 cache, the efficiency is less, as there appears to be an instruction timing problem moving the array from L3 to L1 for the AVX instruction.
I have experienced this reduced efficiency with AVX for large real*8 vectors for SUM and DOT_PRODUCT on both I7 and AMD processors. With these processors, SUM performance is very dependent on AVX efficiency.

(Unfortunately, this is based on my “experience” and not my knowledge of the interaction of the loading and calcluating AVX instructions. Does anyone know if there is an issue for loading the AVX register from L1 vs L3 cache ?)

There must be some explaination for this different performance ?

Another result is the performance improvement with -ffast-math, without a significant change in estimated error, but again this outcome is very dependent on the “x” values. Again more so with je_sum.

http://www.idris.fr/media/formations/simd/idrissimd.pdf

In this course (sorry it is in French) they show (pages 64 to 67) that there are basically 2 limiting factors, one is as you mentioned the clock ticks limit to transfer data between cache levels and the second one is the self regulation of the clock rate when using AVX instructions, which slows down with respect to the nominal GHz rate (I think to prevent overheating).

I have summarized my tests around the sums here: misc/sums_bench.md at main · PierUgit/misc · GitHub

3 Likes

In the rest of the document we assume that we are in the case where the denominator can be neglected:

The equation after this looks suspicious. I think you want something like "denominator ~ 1.

Indeed… Corrected. Thanks for pointing it.