Coarrays and multithreading (with Intel Fortran)?

Hello,

I started looking at using coarrays with Intel Fortran (classic 2021). Just toy examples at the moment…

The Intel documentation states that coarrays and OpenMP cannot be used in the same program. As a matter of fact, I tried inserting some OpenMP directives in a working coarray example, and got a crash… I tried using do concurrent constructs instead: no crash, but the compiler doesn’t seem to parallelize the loop (although I used -fopenmp for compilation, which should be enough).

Do I miss something? I find this restriction very annoying, as the hybrid MPI+OpenMP model is kinda state of the art, and I’d like coarrays just to hide the MPI stuff, keeping the OpenMP one.

2 Likes

Yes, MPI+OpenMP is the standard way to write parallel HPC codes for the past 20 years or so. I don’t know if I would call that the state-of-the-art, but it’s definitely the standard, established way. So if coarrays are to replace MPI, and do concurrent to replace OpenMP, then they have to work together.

1 Like

Just curious. How much of this is due to using MPI and OpenMP threading as the underlying communication/transport layer for co-arrays? My only real experience with co-arrays is on Cray systems that have PGAS aware hardware that makes Co-arrays competitive with MPI-2 and MPI-3 one sided communications.

I’ve also read that Intel has optimized a lot the communications between the images that run on the same node (maybe placing the coarrays in shared memory areas?), so that coarrays can be a valid alternative to OpenMP. But it’s hardly convincing, as it can’t really be as flexible as OpenMP is (and not to mention that it makes sense only if there’s way to know which images run on the same node… but maybe it’s possible ?)

I think that’s true even for MPI. If you use MPI on the same node, often times the MPI implementation has more efficient implementation.

From the title of this paper,

it seems like it might be possible, despite what the Intel documentation says.

Sounds logical, as the commication between 2 processes on the same machine doesn’t have any network bottleneck.

But one can imagine that accessing coarray data from another image on the same machine can come with almost no additionnal cost, if shared memory is used. In such a case, the MPI interface still has to copy the data, which is not needed with the coarray approach.

With the IFX compiler it is definitely possible. There are a set of examples here: monte carlo pi and we are also soon using it in a production code: NRQCD.

The code also works with openCoarrays (gfortran) and openMP. We have not tried the old ifort compiler.

As IFX and openCoarrays use MPI for co-arrays underneath, my understanding and our tests suggest that it basically respects the various MPI environment flags

I could finally run a coarray+OpenMP program without any crash on a single node (with ifort 2021)… But:

  1. I’m struggling a lot with the configuration, the compilation parameters, etc… (I still haven’t found a way to run less images than the number of cores on the machine).
  2. although the prints in the code report that several OpenMP threads are used by each process, it seems that the threads run sequentially rather than simultaneously (!)
program coarray_example
    use iso_fortran_env
    use omp_lib
    implicit none
    integer(int64), parameter :: Ntot = 2**22
    integer(int64) :: me, ni, n, i, k, s
    integer(int64), allocatable :: data(:)[:]  ! Coarray declaration
    integer(int64), allocatable :: local_data(:)

    call omp_set_num_threads(2)
    
    me = this_image()
    ni = num_images()
    
    if (me == 1) print *, 'Number of images:', ni
    
    n = Ntot / ni
    allocate( data(n)[*], local_data(n), source=0_int64 )

    print "(A,I3,A,I9,A)", 'image', me, ' has allocated a coarray of size (', n, ')[*]'
    
    ! Each image sets its own data
    call setdata( data(:)[me], me )

    print *, 'Image', me, ' has set its own values'
    
    ! Synchronize all images
    sync all
    
    ! Image 1 collects and sums the data from all images
    if (me == 1) then
        s = 0
        do k = 1, ni
            print *, 'Image 1 collecting and summing data from image', k
            local_data(:) = data(:)[k]
            s = s + sum( local_data )
        end do
        print *, 'collected sum =', s, '(should be =', Ntot*(Ntot+1)/2, ' )'
    end if
 
    
contains


    subroutine setdata(data,me)
    	integer(int64), intent(out) :: data(:)
    	integer(int64), intent(in)  :: me
    	integer(int64) :: i
    	!$OMP PARALLEL DO SCHEDULE(static, 10000)
    	do i = 1, size(data,kind=int64)
    	   if (me == 1 .and. mod(i,10000) == 0) print*, omp_get_thread_num(), i
    	   data(i) = (me-1)*size(data) + i
    	end do
    end subroutine
    
end program coarray_example

Compilation: ifort -O3 -coarray -coarray-config-file=caf.cfg -fopenmp /opt/intel/21/mpi/2021.4.0/lib/release/libmpi.so.12 coarrays.f90 -o coarrays

caf.cfg:
-genvall -genv I_MPI_FABRICS=shm:ofi -machinefile=./hostsfile -n 8 ./coarrays

hostsfile contains a single line with the machine name on which the program is launched.