Computing the mean: accuracy and speed

Related to the thread Some Intrinsic SUMS, I am finding the pairwise calculation of the mean is both faster and more accurate than sum(x)/size(x)

For summing single precision arrays with

module mean_mod
implicit none
private
public :: wp, mean_base, mean_accurate, mean_updated, mean_pair
integer, parameter :: wp = kind(1.0), qp = selected_real_kind(2*precision(1.0_wp))
contains
!
recursive pure function mean_pair(x) result(y)
real(kind=wp), intent(in) :: x(:)
real(kind=wp)             :: y
integer                   :: n, nhalf
real(kind=wp)             :: frac
integer, parameter        :: size_no_recurse = 32
n = size(x)
if (n <= size_no_recurse) then
   y = sum(x)/max(1, size(x))
   return
end if
nhalf = n/2
if (mod(n,2) == 0) then
   y = 0.5_wp*mean_pair(x(1:nhalf)) + 0.5_wp*mean_pair(x(nhalf+1:n))
else
   frac = nhalf/real(n, kind=wp)
   y = frac*mean_pair(x(1:nhalf)) + (1.0_wp-frac)*mean_pair(x(nhalf+1:n))
end if
end function mean_pair
!
pure function mean_accurate(x) result(y)
real(kind=wp), intent(in) :: x(:)
real(kind=wp)             :: y
y = sum(real(x, kind=qp))/max(1,size(x)) ! no check for size(x) == 0 since sum(x) = 0.0 then
end function mean_accurate
!
pure function mean_base(x) result(y)
real(kind=wp), intent(in) :: x(:)
real(kind=wp)             :: y
y = sum(x)/max(1,size(x)) ! no check for size(x) == 0 since sum(x) = 0.0 then
end function mean_base
!
pure function mean_updated(x) result(mean)
! based on Alan Miller's code https://jblevins.org/mirror/amiller/update.f90
real(kind=wp), intent(in) :: x(:)
real(kind=wp)             :: mean
integer                   :: i, n
real(kind=wp)             :: dev
mean = 0.0_wp
n = 0
do i=1,size(x)
   n    = n + 1
   dev  = x(i) - mean
   mean = mean + dev/n
end do
end function mean_updated
end module mean_mod

program main
use mean_mod, only: wp, mean_base, mean_pair, mean_updated, mean_accurate
use iso_fortran_env, only: compiler_version, compiler_options
implicit none
integer, parameter :: n = 10**8, ncalc = 4, npow = 4
real(kind=wp), parameter :: base = 10.0_wp**3, xshift = 0.0_wp
real(kind=wp)      :: xorig(n), x(n), xmean(ncalc), xscale, t1(ncalc), t2(ncalc)
integer            :: icalc, ipow
character (len=20) :: fmt_s
fmt_s = merge("(/,*(a,10x))", "(/,*(a,18x))", wp == kind(1.0))
print*,"compiler_version: ", compiler_version()
print*,"compiler_options: ", compiler_options()
print*,"wp: ", wp
print*,"huge(x): ", huge(x)
print*,"n: ",n
print*,"xshift: ", xshift
call random_seed()
call random_number(xorig)
xorig = xorig - xshift
do ipow=0, npow
   xscale = base ** ipow
   x      = xorig * xscale
   do icalc = 1, ncalc
      call cpu_time(t1(icalc))
      select case (icalc)
         case(1)
            xmean(1) = mean_base(x)
         case(2)
            xmean(2) = mean_updated(x)
         case (3)
            xmean(3) = mean_pair(x)
         case(4)
            xmean(4) = mean_accurate(x)
      end select
      call cpu_time(t2(icalc))
   end do
   print fmt_s, "method:", "base", "updated", "pair", "accurate"
   print*,"xscale: ", xscale
   print*,"means: " , xmean
   print*,"errors: ", abs(xmean - xmean(ncalc))
   print*,"rel errors: ", abs(xmean - xmean(ncalc))/xmean(ncalc)
   print*,"times: ", t2-t1
end do
end program main

I get

 compiler_version: GCC version 13.0.0 20221218 (experimental)
 compiler_options: -march=tigerlake -mmmx -mpopcnt -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -mavx -mavx2 -mno-sse4a -mno-fma4 -mno-xop -mfma -mavx512f -mbmi -mbmi2 -maes -mpclmul -mavx512vl -mavx512bw -mavx512dq -mavx512cd -mno-avx512er -mno-avx512pf -mavx512vbmi -mavx512ifma -mno-avx5124vnniw -mno-avx5124fmaps -mavx512vpopcntdq -mavx512vbmi2 -mgfni -mvpclmulqdq -mavx512vnni -mavx512bitalg -mno-avx512bf16 -mavx512vp2intersect -mno-3dnow -madx -mabm -mno-cldemote -mclflushopt -mclwb -mno-clzero -mcx16 -mno-enqcmd -mf16c -mfsgsbase -mfxsr -mno-hle -msahf -mno-lwp -mlzcnt -mmovbe -mmovdir64b -mmovdiri -mno-mwaitx -mno-pconfig -mno-pku -mno-prefetchwt1 -mprfchw -mno-ptwrite -mrdpid -mrdrnd -mrdseed -mno-rtm -mno-serialize -mno-sgx -msha -mshstk -mno-tbm -mno-tsxldtrk -mvaes -mno-waitpkg -mno-wbnoinvd -mxsave -mxsavec -mxsaveopt -mxsaves -mno-amx-tile -mno-amx-int8 -mno-amx-bf16 -mno-uintr -mno-hreset -mno-kl -mno-widekl -mno-avxvnni -mno-avx512fp16 -mno-avxifma -mno-avxvnniint8 -mno-avxneconvert -mno-cmpccxadd -mno-amx-fp16 -mno-prefetchi -mno-raoint --param=l1-cache-size=48 --param=l1-cache-line-size=64 --param=l2-cache-size=12288 -mtune=tigerlake -O3 -flto
 wp:            4
 huge(x):    3.40282347E+38
 n:    100000000
 xshift:    0.00000000    

method:          base          updated          pair          accurate
 xscale:    1.00000000    
 means:   0.167772159      0.500012517      0.500038147      0.500038147    
 errors:   0.332265973       2.56299973E-05   0.00000000       0.00000000    
 rel errors:   0.664481223       5.12560837E-05   0.00000000       0.00000000    
 times:    9.37500000E-02  0.421875000       4.68750000E-02   9.37500000E-02

method:          base          updated          pair          accurate
 xscale:    1000.00000    
 means:    171.798691       500.004608       500.038177       500.038147    
 errors:    328.239441       3.35388184E-02   3.05175781E-05   0.00000000    
 rel errors:   0.656428814       6.70725203E-05   6.10304980E-08   0.00000000    
 times:    9.37500000E-02  0.406250000       3.12500000E-02  0.109375000    

method:          base          updated          pair          accurate
 xscale:    1000000.00    
 means:    175921.859       499999.219       500038.156       500038.125    
 errors:    324116.250       38.9062500       3.12500000E-02   0.00000000    
 rel errors:   0.648183048       7.78065660E-05   6.24952321E-08   0.00000000    
 times:    9.37500000E-02  0.453125000       3.12500000E-02   9.37500000E-02

method:          base          updated          pair          accurate
 xscale:    1.00000000E+09
 means:    180143984.       499981376.       500038144.       500038144.    
 errors:    319894144.       56768.0000       0.00000000       0.00000000    
 rel errors:   0.639739454       1.13527341E-04   0.00000000       0.00000000    
 times:    9.37500000E-02  0.437500000       4.68750000E-02   9.37500000E-02

method:          base          updated          pair          accurate
 xscale:    9.99999996E+11
 means:    1.84467440E+11   4.99978568E+11   5.00038173E+11   5.00038140E+11
 errors:    3.15570717E+11   59572224.0       32768.0000       0.00000000    
 rel errors:   0.631093323       1.19135359E-04   6.55310046E-08   0.00000000    
 times:    7.81250000E-02  0.421875000       6.25000000E-02   9.37500000E-02

and setting wp = kind(1.0d0) in mean_mod to sum double precision arrays, I get

 compiler_version: GCC version 13.0.0 20221218 (experimental)
 compiler_options: -march=tigerlake -mmmx -mpopcnt -msse -msse2 -msse3 -mssse3 -msse4.1 -msse4.2 -mavx -mavx2 -mno-sse4a -mno-fma4 -mno-xop -mfma -mavx512f -mbmi -mbmi2 -maes -mpclmul -mavx512vl -mavx512bw -mavx512dq -mavx512cd -mno-avx512er -mno-avx512pf -mavx512vbmi -mavx512ifma -mno-avx5124vnniw -mno-avx5124fmaps -mavx512vpopcntdq -mavx512vbmi2 -mgfni -mvpclmulqdq -mavx512vnni -mavx512bitalg -mno-avx512bf16 -mavx512vp2intersect -mno-3dnow -madx -mabm -mno-cldemote -mclflushopt -mclwb -mno-clzero -mcx16 -mno-enqcmd -mf16c -mfsgsbase -mfxsr -mno-hle -msahf -mno-lwp -mlzcnt -mmovbe -mmovdir64b -mmovdiri -mno-mwaitx -mno-pconfig -mno-pku -mno-prefetchwt1 -mprfchw -mno-ptwrite -mrdpid -mrdrnd -mrdseed -mno-rtm -mno-serialize -mno-sgx -msha -mshstk -mno-tbm -mno-tsxldtrk -mvaes -mno-waitpkg -mno-wbnoinvd -mxsave -mxsavec -mxsaveopt -mxsaves -mno-amx-tile -mno-amx-int8 -mno-amx-bf16 -mno-uintr -mno-hreset -mno-kl -mno-widekl -mno-avxvnni -mno-avx512fp16 -mno-avxifma -mno-avxvnniint8 -mno-avxneconvert -mno-cmpccxadd -mno-amx-fp16 -mno-prefetchi -mno-raoint --param=l1-cache-size=48 --param=l1-cache-line-size=64 --param=l2-cache-size=12288 -mtune=tigerlake -O3 -flto
 wp:            8
 huge(x):    1.7976931348623157E+308
 n:    100000000
 xshift:    0.0000000000000000     

method:                  base                  updated                  pair                  accurate
 xscale:    1.0000000000000000     
 means:   0.50001808444994267       0.50001808444984286       0.50001808444992779       0.50001808444992779     
 errors:    1.4876988529977098E-014   8.4932061383824475E-014   0.0000000000000000        0.0000000000000000     
 rel errors:    2.9752900930260033E-014   1.6985797919140990E-013   0.0000000000000000        0.0000000000000000     
 times:    9.3750000000000000E-002  0.48437500000000000        7.8125000000000000E-002   2.1406250000000000     

method:                  base                  updated                  pair                  accurate
 xscale:    1000.0000000000000     
 means:    500.01808445017298        500.01808444995379        500.01808444992781        500.01808444992781     
 errors:    2.4516566554666497E-010   2.5977442419389263E-011   0.0000000000000000        0.0000000000000000     
 rel errors:    4.9031359698994258E-013   5.1953005755716148E-014   0.0000000000000000        0.0000000000000000     
 times:    9.3750000000000000E-002  0.46875000000000000        6.2500000000000000E-002   2.1562500000000000     

method:                  base                  updated                  pair                  accurate
 xscale:    1000000.0000000000     
 means:    500018.08444989804        500018.08444995317        500018.08444992773        500018.08444992779     
 errors:    2.9744114726781845E-008   2.5378540158271790E-008   5.8207660913467407E-011   0.0000000000000000     
 rel errors:    5.9486077907569048E-014   5.0755244555186119E-014   1.1641111136510578E-016   0.0000000000000000     
 times:    7.8125000000000000E-002  0.48437500000000000        6.2500000000000000E-002   2.1718750000000000     

method:                  base                  updated                  pair                  accurate
 xscale:    1000000000.0000000     
 means:    500018084.45003015        500018084.45005620        500018084.44992781        500018084.44992781     
 errors:    1.0234117507934570E-004   1.2838840484619141E-004   0.0000000000000000        0.0000000000000000     
 rel errors:    2.0467494729101988E-013   2.5676752269356833E-013   0.0000000000000000        0.0000000000000000     
 times:    7.8125000000000000E-002  0.48437500000000000        6.2500000000000000E-002   2.2812500000000000     

method:                  base                  updated                  pair                  accurate
 xscale:    1000000000000.0000     
 means:    500018084449.78223        500018084449.74268        500018084449.92773        500018084449.92780     
 errors:   0.14556884765625000       0.18511962890625000        6.1035156250000000E-005   0.0000000000000000     
 rel errors:    2.9112716556320351E-013   3.7022586715018707E-013   1.2206589751077715E-016   0.0000000000000000     
 times:    9.3750000000000000E-002  0.48437500000000000        6.2500000000000000E-002   2.1718750000000000     
2 Likes

I asked ChatGPT-4 to

Write a Fortran function in a module that computes the mean of real(kind=wp) :: x(:), where wp is a module kind parameter, by recursively computing the mean of equal-size subsets of x. Supply a main program to test the function.

Its first try only worked when size(x) was a power of 2, which I pointed out. Its second try was slow because it never reverted to the sum(x)/size(x) formula. When I asked it to do so for small size(x), it came up with

module statistics
  implicit none
  integer, parameter :: wp = kind(1.0d0), size_no_recurse = 32
contains

  recursive function mean(x) result(mean_value)
    real(kind=wp), intent(in) :: x(:)
    real(kind=wp) :: mean_value
    integer :: n
    n = size(x)
    if (n <= size_no_recurse) then
      mean_value = sum(x) / max(1, n)
    elseif (mod(n, 2) == 0) then
      mean_value = (mean(x(:n/2)) + mean(x(n/2+1:))) / 2
    else
      mean_value = ((n-1) * mean(x(:n-1)) + x(n)) / n
    end if
  end function mean

end module statistics


program test
  use statistics
  implicit none
  real(kind=wp), dimension(:), allocatable :: x
  integer :: n
  real(kind=wp) :: mean_value

  n = 10
  allocate(x(n))
  x = [1.0_wp, 2.0_wp, 3.0_wp, 4.0_wp, 5.0_wp, 6.0_wp, 7.0_wp, 8.0_wp, 9.0_wp, 10.0_wp]
  
  mean_value = mean(x)
  print*, "Mean value: ", mean_value
  
  deallocate(x)
  
end program test

which I am finding to be more accurate than my mean_pair function in some cases.

2 Likes

I’m glad to see this topic. My original looking into sum was prompted by my basic statistical functions not working: avg, var, std for 1D arrays with large numbers of elements. I will say I am very surprised the pair mean routine is faster than using something like pair_sum(x)/size(x)

1 Like

All of this is so intriguing. It seems to suggest that the stdlib project should be including native Fortran optimized (ie. accuracy, speed, …) versions of the intrinsics as benchmarks and as alternatives; and that LFortran in particular might benefit from the routines produced. The existing intrinsics are so heavily used that large gains in performance and/or accuracy could have a large immediate impact versus just proposed new intrinsics. Perhaps this could be a new joint project that would produce an fpm repository package as well. And chatGPT keeps looking more and more valuable if given guidance in producing parts of that. I assume if it can generate the code with comments, and a unit test, that it can even more easily generate a man-page for the procedure as well.

If given a request like “create a regular expression|xy plot|… library in Fortran” what does chatGPT respond? I am wondering how far away AI is from being able to refactor netlib into modern code structure; or just writing a new one with some mathematical guidance(?). It was sometimes stated in the early days of Fortran that Fortran would eliminate the need for programmers. Well, you can see how that prediction went. But AI is obviously going to revolutionize it if not eliminate it.

2 Likes

When I tried the @Beliavsky - chatGPT method with a large array
x = [(n, n=0,10**8)]
ifort gave a run-time segmentation fault unless I had compiled with the -heap-arrays option.

You may try increasing stack size, for instance in bash ulimit -s unlimited. I have that in my .bashrc file so it’s never a problem.

@Harper
Is your example “x = [(n, n=0,10**8)]” suggesting that auto-allocate goes to the stack, while explicit allocate goes to the heap ?

This is indeed the case by default with ifort

1 Like

I changed my program using auto-allocate by including

  if(allocated(x))deallocate(x)
  allocate(x(10**8+1))
  x(:) = [(n, n=0,10**8)]

It made no difference: ifort still worked as expected with -heap-arrays but seg.faulted without that.
That suggests that explicit allocation also went by default to the stack.

Could you try either

allocate(x(0:10**8))
or
x(:) = [(n, n=1,10**8)]

No: as an allocatable, x is allocated on the heap regardless it is allocated explicitely or implicitely on assignement.

What is going to the stack is the temporary array generated by the array constructor and the implied do loop. This code won’t segfault:

allocate(x(10**8+1))
do n = 1, size(x)
    x(n) = n
end do

I did not know that - pretty crazy. Do we have any indication why implied do-loops are not expanded by the compiler to an actual loop like that inline? Isn’t allocating an entire temporary array (even on the stack) always going to be worse than the straight forward loop?

The problem is not the implied loop in itself, it is the array constructor [ ], which is an expression. As such, it can be involved in more complex expressions, and the all-road strategy of some compilers is to always allocate temporary arrays for them.

Array syntax is nice, but behind the apparent simplicity for the developers, it can be a nightmare to optimize in all possible cases for the compiler writers.

A compiler could do that, but the straightforward formula translation is to do exactly what the programmer said to do, namely to assign the lhs of the expression with the array on the rhs.

Another “gotcha” related to this issue is the size of the executable file. The one with the array assignment might well be 400MB in size (or 800MB for real64), while the do-loop version might be less than 1kB. For such a program, the do loop version of the code might well get loaded, execute, and finish before the array version of the code is even loaded from disk.

Implied do-loops may work poorly for large arrays and should be avoided in such cases. Compiling with gfortran on Windows the following trivial program took 14 seconds, vs 0.6 second for the do-loop equivalent, and 0.9 second when n = 10**4 in the code.

implicit none
integer, parameter :: n = 10**6
integer :: i, vec(n)
vec = [(i, i=1,n)]
print*,vec(n)
end

I tried a forall statement and an ordinary do loop instead of an implied-do loop, the ifort seg.fault went away, .and forall and do were about equally fast. With gfortran ithere never was a seg.fault, and forall was a little faster than the do loop.

When the numbers have a huge dynamic range, the order of operations can have a big impact on the result (if all the big numbers come first, the impact of a huge number of tiny numbers may be completely lost). Sadly, sorting is more expensive than the summation … but you may want to at least detect such cases.

Also, it is almost never a good idea to compute the mean without variance|standard deviation. The canonical example is where someone has a pair of thermometers … positioned at head and feet and their average temp is 100df. Either they are at the beach and comfy, or their head is an oven at, say, 300df and their feet at -100df. The standard deviation for one is a bit under 1.5, and the other more like 290.

I am going to do some experiments and see if the faster summing methods can maintain accuracy and speed over compensated methods by compiling at Ofast and sorting the input first versus just doing a compensated sum at O3 compilation optimization level.

UPDATE: On this idea – First go around this will definitely not work, as the sorting is way too slow. I tried with a basic quicksort so it could definitely be better, but still.

Indeed, I think stdlib should have fast implementation of Fortran intrinsics, maybe not all, but those that we can make faster, so that compilers like LFortran could use them. I proposed exactly that in Fortran runtime math library. I think back then it wasn’t received as beneficial, but maybe now is a better time. :slight_smile:

1 Like

In most cases, the regression sum/accumulation can mitigate this problem of large dynamic range of values, with no significant performance penalty. However there can always be cases of systematic spreading of large values that swamp any approach.

Also doing a second pass for stdev, once the average is known, assists accuracy in these cases.