A simple example to compare coarrays and openmp

I thought it would be instructive to consider a simple, but not entirely trivial, computational example and compare parallel implementations using coarrays and openmp (and MPI too).

The example repeatedly applies a Laplace smoothing operator (3-point, nearest-neighbor stencil) to 1D data. Here is the serial implementation:

program main

  integer :: j, n, num_itr
  character(16) :: arg
  real, allocatable :: x(:), y(:)

  call get_command_argument(1, arg)
  read(arg,*) n

  call get_command_argument(2, arg)
  read(arg,*) num_itr

  allocate(x(0:n+1), y(n))

  call random_number(x(1:n))  ! initial state
  x(0) = 0.0; x(n+1) = 0.0    ! fixed boundary values

  do j = 1, num_itr
    call apply_laplacian_smoothing(x, y)
    x(1:n) = y
  end do

  write(*,'((i0,1x,g0))') (j, x(j), j=1,n)

contains

  subroutine apply_laplacian_smoothing(x, y)
    real, intent(in)  :: x(0:)
    real, intent(out) :: y(:)
    integer :: j
    real, parameter :: theta = 0.25
    do j = 1, size(y)
      y(j) = (1-2*theta)*x(j) + theta*(x(j-1) + x(j+1))
    end do
  end subroutine

end program

And here is my coarray implementation with interspersed comments to help explain what is going on:

program main

  integer :: j, num_pts, num_itr
  integer :: np, me, n, r, offset
  character(16) :: arg
  real, allocatable :: x(:), y(:)
  real :: xleft[*], xright[*]

  call get_command_argument(1, arg)
  read(arg,*) num_pts

  call get_command_argument(2, arg)
  read(arg,*) num_itr

  np = num_images()
  me = this_image()

  ! Partition the points as equally as possible amongst the images.
  n = num_pts / np
  r = modulo(num_pts, np) ! extras
  if (me <= r) n = n + 1  ! first r images get an extra point

  ! This image works on points offset+1 to offset+n (computed next),
  ! but the global index doesn't really matter; for the purposes of
  ! this image it is more convenient to use local indices 1 to n.

  offset = n*(me-1)
  if (me > r) offset = offset + r

  ! To compute the smoothed value at index j we need data at j-1 and j+1
  ! thus our data array needs to extend to local indices 0 and n+1. Those
  ! are our boundary points: either a partition boundary or global boundary.

  allocate(x(0:n+1), y(n))
  call random_number(x(1:n))  ! initial state

  ! Fixed boundary values
  if (me == 1) x(0) = 0.0     ! first image has the left boundary point
  if (me == np) x(n+1) = 0.0  ! last image has the right boundary point

  do j = 1, num_itr

    ! Copy the value at my first and last point to coarrays
    ! for my neighbor images to retrieve.
    if (me > 1)  xleft  = x(1)
    if (me < np) xright = x(n)

    ! Set my partition boundary values using data from my neighbor images
    sync all
    if (me > 1)  x(0) = xright[me-1]
    if (me < np) x(n+1) = xleft[me+1]

    call apply_laplacian_smoothing(x, y)
    x(1:n) = y
  end do

  ! Each image writes its values one at a time and in image order
  if (me > 1) sync images (me-1)
  write(*,'((i0,1x,g0))') (offset+j, x(j), j=1,n)
  if (me < np) sync images (me+1)

contains

  ! This is exactly the same subroutine as in the serial version.
  subroutine apply_laplacian_smoothing(x, y)
    real, intent(in)  :: x(0:)
    real, intent(out) :: y(:)
    integer :: j
    real, parameter :: theta = 0.25
    do j = 1, size(y)
      y(j) = (1-2*theta)*x(j) + theta*(x(j-1) + x(j+1))
    end do
  end subroutine

end program

Note that the two programs don’t give the same results because they start with different state; easily fixed by just assigning a constant value.

Can someone provide an openmp implementation (@adenchfi maybe?); I’ve only ever used openmp vicariously through students. I expect it to be far simpler than my coarray version in this simple case. I haven’t given any thought yet to what an MPI version would be, but I expect it to be much more complex (syntactically at least) than the coarray version.

1 Like

Here’s a straightforward example of work-sharing with a parallel do loop:

  subroutine apply_laplacian_smoothing(x, y)
    real, intent(in)  :: x(0:)
    real, intent(out) :: y(:)
    integer :: j
    real, parameter :: theta = 0.25
    
    !$omp parallel do shared(x, y)
    do j = 1, size(y)
      y(j) = (1-2*theta)*x(j) + theta*(x(j-1) + x(j+1))
    end do
    !$omp end parallel do

  end subroutine

A more complex implementation could be using OpenMP to offload onto an accelerator. Another approach would be to introduce the parallel section within the iteration loop. In that case you could also make the copy x(1:n) = y execute in parallel by using !$omp workshare. Some timings I did for smaller n (100 - 1000) didn’t show any benefit.

3 Likes

For completeness, perhaps @nncarlson and @ivanpribec can include compilation flags and execution lines, in case this is helpful to those new to using these features. Maybe @nncarlson can also change the title of the thread to include CUDA.

I was asked to contribute the omp version, but since ivan did so, I’ll contribute the CUDA Fortran version, which has elements of both OMP and MPI style programming (but in this example, it is only the OMP elements that appear). In CUDA Fortran, you need your program file and any GPU-side modules/subroutines to be in separate files. I have main.cuf and smooth_mod.cuf, containing the following:

main.cuf
program main                                                                                                                                                                                                                                                           
  use cudafor                                                                                                                                     
  use smooth                   
  ! LOCAL CPU VARIABLES                                                                                                                         
  real, parameter :: theta = 0.25                                                                                                                 
  integer :: j, n, num_itr                                                                                                                        
  character(16) :: arg                                                                                                                            
  real, allocatable :: x(:)  ! Don't need y(:) on CPU-side, as it's just an intermediate array in the smoothing.                                  
  ! on-GPU or GPU-related variables                                                                                                               
  integer :: ierr  ! Sees if GPU-related stuff fails                                                                                              
  integer :: threads_per_block = 256   ! Thread blocks are an abstracted layer where you couple threads into a group. Think of this like a team,  
                                       ! or a group of processors on a single node (if using shared memory).                                      
  integer :: num_blocks                ! Divide the domain of your array into different blocks. This tracks how many blocks you need.             
  real, allocatable, device :: x_GPU(:), y_GPU(:)  ! Device label is what says this is on the GPU                                                 
  ! MAIN                                                                                                                                          
  call get_command_argument(1, arg)                                                                                                               
  read(arg,*) n                                                                                                                                   
  call get_command_argument(2, arg)                                                                                                               
  read(arg,*) num_itr                                                                                                                             
  allocate(x(0:n+1))                                                                                                                              
  call random_number(x(1:n))  ! initial state                                                                                                     
  x(0) = 0.0; x(n+1) = 0.0    ! fixed boundary values                                                                                             
                                                                                                                                                  
!!!! Prep GPU variables                                                                                                                           
  allocate(x_GPU(0:n+1), y_GPU(n))                                                                                                                
  num_blocks = ceiling(real(n)/threads_per_block)                                                                                                 
                     
  print *, "x before: "                                                                                                  
  write(*,'((i0,1x,g0))') (j, x(j), j=0,n+1)                                                                                                                                  
!!!! Transfer state to GPU device                                                                                                                 
  ! This memory transfer is often the bottleneck in real applications, and CUDA has more advanced ways of transferring memory efficiently.        
  ! This is, however, the simplest way to do it.                                                                                                  
                                                                                                 
  x_GPU(:) = x(:)                                                                                                                                 
                                                                                                                                                  
!!!! Compute laplacian smoothing on GPU device                                                                                                    
  smooth_many_times: do j=1, num_itr                                                                                                              
     call apply_laplacian_smoothing<<<num_blocks,threads_per_block>>>(x_GPU, y_GPU, theta)                                                        
     x_GPU(1:n) = y_GPU(:)                                                                                                                        
  end do smooth_many_times                                                                                                                        
                                                                                                                                                  
  ! ERROR CHECKING: often good to do some. I omit any here.                                                                                       
                                                                                                                                                  
!!!! Transfer back to CPU for printing                                                                                                            
  x(:) = x_GPU(:)                                                                                                                                 
!!!! Print Result                                                                                                                                 
                                                                                                                                                  
  print *, "x after smoothing: "                                                                                                                  
  write(*,'((i0,1x,g0))') (j, x(j), j=0,n+1)                                                                                                      
                                                                                                                                                  
  deallocate(x, x_GPU, y_GPU)                                                                                                                     
                                                                                                                                                  
end program main                                                                                

and smooth_mod.cuf:

smooth_mod.cuf
module smooth                                                                                                                                     
contains                                                                                                                                          
                                                                                                                                                  
  attributes(global) subroutine apply_laplacian_smoothing(arr, smootharr,theta)                                                                   
    implicit none                                                                                                                                 
    ! INPUT/OUTPUT                                                                                                                                
    real, dimension(0:), intent(in) :: arr                                                                                                        
    real, dimension(:), intent(out) :: smootharr                                                                                                  
    real, value :: theta  ! Passing theta from CPU by value                                                                                       
    ! LOCAL VARIABLES                                                                                                                             
    integer :: idx, n                                                                                                                             
    ! Find the idx of the global array that this thread will be updating                                                                          
    idx = blockDim%x * (blockIdx%x - 1) + threadIdx%x                                                                                             
    n = size(smootharr)                                                                                                                           
    if (idx <= n) then                                                                                                                            
       smootharr(idx) = (1-2*theta)*arr(idx) + theta*(arr(idx-1) + arr(idx+1))                                                                    
    end if                                                                                                                                        
                                                                                                                                                  
!!!! We don't need a loop here; the assumption is that the number of threads/threadBlocks we've allocated fully 'cover' the array's indices.      
!!!! If we had a large enough array, we would indeed have to have a loop here, assigning multiple idxs to each thread to update.                  
                                                                                                                                                  
  end subroutine apply_laplacian_smoothing                                                                                                        
                                                                                                                                                  
end module smooth      

I compile with nvfortran -O2 smooth_mod.cuf main.cuf -o smooth.exe and one can usually just run with ./smooth.exe 10 50. However, I have a laptop with both an integrated GPU and an NVIDIA GPU, so to specify the NVIDIA GPU, my hardware vendor tells me I have to run with

__NV_PRIME_RENDER_OFFLOAD=1 __GLX_VENDOR_LIBRARY_NAME=nvidia ./smooth.exe 10 50

which results in

Output
 x before: 
0 0.
1 0.9079230
2 0.1906921
3 0.6716529E-01
4 0.8000845
5 0.7973146
6 0.6368300
7 0.5887827
8 0.1590656
9 0.3206477
10 0.4481761
11 0.
 x after smoothing: 
0 0.
1 0.6506269E-01
2 0.1247638
3 0.1742048
4 0.2093665
5 0.2274395
6 0.2270390
7 0.2082922
8 0.1727977
9 0.1234708
10 0.6429417E-01
11 0.
Notes on Efficiency

Note 1: The above is not an efficient approach by any means. However, it’s a minimal ‘can compile and use GPU to do something’ example. One would optimize this by using shared memory and/or register-level collectives within a ‘warp’.
Note 2: For this inefficient version, one could instead use the OpenACC framework, which would have a very similar syntax to the OMP version ivan posted above. However, using CUDA Fortran directly allows for optimization way beyond what OpenACC does [except for some simple cases where OpenACC has achieved good optimization].
Note 3: For illustration, instead of theta being a parameter in the subroutine, I illustrate how one can pass scalar constants to the GPU subroutine.

Timings

Note 4: Timings: Doing 100 smoothing operations, I get (using time and the user output):

n=1000: 0.106s
n=10000: 0.112s
n=100000: 0.130s
n=1000000: 0.283s
n=10000000: 1.408s
n=100000000: 16.942s

We can see the GPU doesn’t really get ‘tested’ until n>100k elements in the array.

PS: Is there a ‘Hide’ or ‘Hide Spoiler’ syntax on here? So that anyone who doesn’t want to see the output, can just not click ‘Show’? The blur spoiler is nice but space-wasting. EDIT: Thanks!

3 Likes

This straightforward example would not be effective on the i7/ryzen hardware that I use.
For n small, the omp region initiation would swamp the computing time.
For n large, larger than cache, the memory delays would be the same as for a single thread /serial case.
The inner loop calculation is just insufficient to justify multiple threads. It would be similar to using omp for dot_product.
Modifying the code to target AVX would be the better outcome.

I looked at the alternative of applying OMP to the itteration loop as below, but you would need to program barriers after smooth and after update to keep the threads from clashing at the interface.
In OpenMP, these “barriers” are just too time expensive to pay off. The alternative of using atomic/flush on a thread status vector might be better, but this is getting too complex for a forum post.
My understanding is that this is what coarray Fortran does better than OpenMP.
Could using " adjust(j) = theta*( x(j-1) + x(j+1) - 2*x(j) )" instead of y be a better approach, which may remove one of the barriers ?

!$OMP PARALLEL DO PRIVATE(i,j,y,n1,n2), SHARED(x,NN1,NN2,n)
  do ith = 1,num_threads
    n1 = nn1(ith)
    n2 = nn2(ith)

    do j = 1, num_itr

      call apply_laplacian_smoothing_part (x, y, n1, n2)
      if ( num_threads > 1 ) then
    !$OMP BARRIER
      end if

      x(n1:n2) = y(n1:n2)
      if ( num_threads > 1 ) then
    !$OMP BARRIER
      end if

    end do

  end do
!$OMP END PARALLEL DO

What are the typical values of “n” and “num_itr” that are being considered ?
Once x(n) exceeds the cache size, OpenMP will stall, but expected performance times are relative ?
For large num_itr, the straightforward approach will not be effective.
Will be interested in other comments.

1 Like

A very interesting contribution @JohnCampbell. I need to take some time to finally learn some OMP and understand these examples.

I hadn’t primarily been thinking of how things performed (though in the end that’s critical), but I had been using large values of n – 100 to 1000 points per core/thread – and relatively few number of iterations – less than 100. I’m thinking of the problem really as a stripped down 1D PDE problem; I thought posing it as a “smoothing” operation would be more accessible.

Your consideration of regimes where a strategy will work well or not is useful info.

I should go back and do some more careful timings of my coarray version, but for some larger N (though certainly not too large) I was getting close to perfect speedup over serial using a 12 core Threadripper (using the NAG compiler).

Of course this could be exploited in a coarray program as well.

1 Like

You can enclose the output in a “Details section”. See Click-to-show code snippets - #2 by ivanpribec


This was really just a toy demo. I agree with your concerns about the overheads and delays.
As your question toward the end indicates, the payload also affects the choice of parallelization strategy. I also like to keep the 80-20 principle in mind; often it’s hard to justify the increased complexity of a program which reaches close to the theoretical-max speed-up. The following comic comes to mind:

2 Likes

Fully agreed. I ran the examples on a macbook (gfortran 11.2, OpenMPI, OpenCoarrays) only to find (somewhat unexpectedly) that the serial version, at least when compiled with -O3 was much faster than both omp and coarray versions.

1 Like

We should add more computation per array element. Either some higher order finite difference, or some other method, or perhaps 2D or 3D domain which would naturally require more computation per element.

I have an !$OMP version running, along the lines I suggested above, being dividing x(n) between the num_threads and using barrier after calculating the smoothing adjustment and after updating x.

It shows that:

  • the barrier overhead is inverse to n and
  • that two threads are better than 1,
  • but not so much with more threads, due to memory access bandwidth limits.

My test is using gFortran Ver 11.1 with Windows 7 on an i5-2300 with 1 to 4 threads.
I have attached a link to the code, build and run log, which could be adapted to other environments. (now and get_processor are my reports and environment variables g…describe gFortran)

For each itteration, I used a process of calculating the adjustment, then later updating x after a barrier, rather than modifying x during the smoothing calculation. ( havn’t tested the OP approach)

My timing uses system_clock, with integer*8 arguments (28 million updates per second), which is implemented better than omp_get_wtime, that has poor resolution (64 updates per second) on Windows OS.

When OpenMP works well I would expect it to be quicker than coarrays, but BARRIER or frequent OMP regions may not work well. !$OMP PARALLEL DO has about 10 micro-second overhead ( 30k - 50k processor cycles, with looks to be very expensive, for small n )

I have created a dropbox link below (although I don’t see this aproach in other posts ?)
It contains
program:omp2.f90,
build:do_omp.bat and
run results:omp2.log from “do_omp omp2”.
Run times for n=5_million and 200 itterations are 2.04 sec for 1 thread and 1.39 sec for 2 threads, so does show some benefit.
(the i5-2300 is old, but has 4 cores, no hyper-threading so is a good development environment)

1 Like

For the OpenMP approach, I have modified the smoothing pass to minimise the memory demand and so increase the number of threads that can be efficiently used.
Array y has been eliminated and only the edge conditions x(n1) and x(n2) (that conflict with other threads) are returned for later update.
x(n1+1:n2-1) is updated in a single smoothing pass.

    subroutine apply_laplacian_smoothing_part_newer (x, x_edge, n1,n2)
!
!  this smoothing adjustment is applied to x(n1+1:n2-1) now in a single memory pass, 
!  returning only edge adjustments x(n1) and x(n2) in x_edge(2) for later update.
!  only x(n1-1:n2+1) is read through cache, without y(n1:n2) being required
!
      real, intent(inout) :: x(0:)
      real, intent(out)   :: x_edge(2)
      integer, intent(in) :: n1,n2

      integer :: j, k
      real    :: xn, xl
      real, parameter :: theta = 0.25
      real, parameter :: alpha = 1. - 2.*theta

      k = 0
      do j = n1,n1+1
         k = k+1
         x_edge(k) = theta*x(j-1) + alpha*x(j) + theta*x(j+1)
      end do

      xl = x_edge(2)
      do j = j,n2
         xn     = theta*x(j-1) + alpha*x(j) + theta*x(j+1)
         x(j-1) = xl
         xl     = xn
      end do
      x_edge(2) = xl

    end subroutine apply_laplacian_smoothing_part_newer

This approach demonstrates that the algorithm can be modified to address the practicality of cache to memory bottleneck, which is more significant for OpenMP.

3 Likes

The previous routine is an ideal candidate for the statement function “smooth” (which was made obsolete in F95), which provides a more concise coding

    subroutine apply_laplacian_smoothing_part_newer (x, x_edge, n1,n2)

!  this smoothing adjustment is applied to x(n1+1:n2-1) now in a single memory pass, 
!  returning only edge adjustments for x(n1) x(n2) in x_edge(2) for later update.
!  only x(n1-1:n2+1) is read through cache, without y(n1:n2) being required

      real, intent(inout) :: x(0:)
      real, intent(out)   :: x_edge(2)
      integer, intent(in) :: n1,n2

      integer :: j
      real    :: xn, xl, smooth
      real, parameter :: theta = 0.25
      real, parameter :: alpha = 1. - 2.*theta

      smooth(j) = theta*x(j-1) + alpha*x(j) + theta*x(j+1)

      x_edge(1) = smooth (n1)
      xl        = smooth (n1+1)
      do j = n1+2,n2
         xn     = smooth (j)
         x(j-1) = xl
         xl     = xn
      end do
      x_edge(2) = xl

    end subroutine apply_laplacian_smoothing_part_newer

I would describe this code as clearer, but there is no change to the performance.