Parallel sum function

I have a subroutine as below:

subroutine lm(n, x, y, res) 
  integer(8) :: n
  real(8) :: x(n), y(n), res  
  real(8) :: sum_x, sum_y, sum_xx, sum_xy
  integer(8) :: i
  
  sum_x = 0d0
  sum_y = 0d0
  sum_xx = 0d0
  sum_xy = 0d0

  !$omp parallel do reduction(+:sum_x)
  do i = 1, n
    sum_x = sum_x + x(i) 
  end do
  !$omp end parallel do
  
  !$omp parallel do reduction(+:sum_y)
  do i = 1, n
    sum_y = sum_y + y(i) 
  end do
  !$omp end parallel do

  !$omp parallel do reduction(+:sum_xx)
  do i = 1, n
    sum_xx = sum_xx + x(i)*x(i) 
  end do
  !$omp end parallel do
  
  !$omp parallel do reduction(+:sum_xy)
  do i = 1, n
    sum_xy = sum_xy + x(i)*y(i) 
  end do
  !$omp end parallel do

  res = (n*sum_xy - sum_x*sum_y)/(n*sum_xx - sum_x*sum_x)
end subroutine

If a parallel sum function can be used, the subroutine will be much less verbose. Is there a plan of adding a parallel sum function to the language or std? C++ is already doing this, see:
https://en.cppreference.com/w/cpp/experimental/reduce

1 Like

There is already such a function in the standard:
https://gcc.gnu.org/onlinedocs/gfortran/CO_005fSUM.html

Using images and co_sum() you can do the same thing than with OpenMP. I have starded playing with that:
https://github.com/vmagnin/exploring_coarrays/blob/main/pi_monte_carlo_co_sum.f90

See Milan’s book for more information.

2 Likes
res = (n*sum(x*y) - sum(x)*sum(y))/(n*sum(x**2) - sum(x)**2)

Edit: If you still want to do it the verbose way, you might consider merging the loops.

  !$omp parallel do reduction(+:sum_x,sum_y,sum_xx,sum_xy)
  do i = 1, n
    sum_x = sum_x + x(i)
    sum_y = sum_y + y(i)
    sum_xx = sum_xx + x(i)**2
    sum_xy = sum_xy + x(i)*y(i) 
  end do
  !$omp end parallel do

I expect this would also reduce the overhead of launching the thread group. Before adding parallel instructions, it’s a good idea to focus on single-core performance. Then you have a better picture if you’re actually profiting from the OMP pragmas or just wasting additional CPU cycles.

Edit 2: a MWE can be found in the folded region.

Click to open
module lm_mod
  
  ! These are the types used in the dotCall64 R package
  use, intrinsic :: iso_c_binding, only: &
    dp => c_double, &
    int64 => c_int64_t

  implicit none

contains

  subroutine lm(n, x, y, res) 
    integer(int64), intent(in) :: n
    real(dp), intent(in) :: x(n), y(n)
    real(dp), intent(out) :: res  
    real(dp) :: sum_x, sum_y, sum_xx, sum_xy
    integer(int64) :: i

#if PARALLEL
    sum_x = 0
    sum_y = 0
    sum_xx = 0
    sum_xy = 0
    !$omp parallel do reduction(+:sum_x,sum_y,sum_xx,sum_xy) shared(x,y)
    do i = 1, n
      sum_x = sum_x + x(i)
      sum_y = sum_y + y(i)
      sum_xx = sum_xx + x(i)**2
      sum_xy = sum_xy + x(i)*y(i)
    end do
    !$omp end parallel do
#else
    sum_x = sum(x)
    sum_y = sum(y)
    sum_xx = sum(x**2)
    sum_xy = sum(x*y)
#endif
    res = (n*sum_xy - sum_x*sum_y)/(n*sum_xx - sum_x*sum_x)

  end subroutine

end module

program test_lm

  use lm_mod
  implicit none

  real(dp), allocatable :: x(:), y(:)
  real(dp) :: dx, res

  character(len=8) :: argv
  integer(int64) :: n, i
  integer(int64) :: rate, t1, t2

  call system_clock(count_rate=rate)

  call get_command_argument(1,argv)
  read(argv,*) n

  allocate(x(n),y(n))

  dx = 1.0_dp/real(n-1,dp)
  x = [((i-1)*dx,i=1,n)]

  call random_number(y)

  y = 0.5_dp*x + 0.05_dp*y

  call system_clock(t1)
  call lm(n,x,y,res)
  write(*,*) "res = ", res
  call system_clock(t2)

  write(*,*) "Time(s) = ", real(t2-t1,dp)/real(rate,dp)

end program
3 Likes

Can co_sum be used in shared libs?

See that very recent post:

Note that the term Coarrays can be used to talk about coarrays themselves, but is also often used as a more generic term to talk about all the parallel features introduced by Fortran 2008/2018, like the collective subroutines.

1 Like