Use coarrays with a C main program

I am experimenting a bit with the combination C and coarrays, in particular:

  • The main program is C
  • This calls a simple Fortran routine compiled with the coarray option (I use ifort on Windows for now)

I can build the program without any complaints, but it does not print anything from the Fortran routine and actually seems to stop.

I attach the code.
run.txt (618 Bytes)

I checked with Dependency Walker if any DLLs were missing, but that was not the case.

Should I perhaps start it in a different way or add a call in the main (C) program to make sure that the coarray environment is started correctly?

Found something here: CoarrayLib - GCC Wiki; not sure how up to date it is.

You may also like to (re-)read: Using coarrays in a shared library (OpenCoarrays) - #3 by themos

Aha, thanks! I will check these two.

At least to get a single image version working with the Intel compiler, you can try,

// Routines from libifcore
extern void for_rtl_init_(int *, char **);
extern int for_rtl_finish_();

extern void some_caf_subroutine(); // Fortran procedure using coarrays

int main(int argc, char **argv) {
  for_rtl_init_(&argc, argv);
  some_caf_subroutine();
  for_rtl_finish_(); // ignore status
  return 0;
}

When compiling with your C compiler, you’ll need to add the path of the Intel Fortran runtime libraries:

LDFLAGS=-L/opt/intel/oneapi/compiler/2024.0/lib
LDLIBS=-lifcore -licaf -ldl

main: main.c some_caf_subroutine.o
	$(CC) $(CFLAGS) $(LDFLAGS) -o $@ $^ $(LDLIBS)

I don’t think this is intended to work. Something has to initialise the images and whatever runtime library is used for the communication. If the main program isn’t Fortran then nothing knows to do so, let alone how.

Using a Fortran main program of course does solve it. In my experiment the program simply calls the C routine which then calls the Fortran print routine. The C routine is clearly run by each image.

(By the way, I use ifort to link the program, so I do not have to worry about the libraries.)

TL;DR: …what @everythingfunctional said.

:warning: unportable internals exposed below :warning:

Inspecting the symbols using nm from binutils, it wasn’t hard to find a routine responsible for launching multiple images:

// WARNING: may be incorrect!
int for_rtl_ICAF_LAUNCH(int /*num_images*/, char* /*config*/, void * /* ??? */);

The first argument is the number of images corresponding to the flag -coarray-num-images= (no flag results in -1, giving as many images as available cores). The second argument corresponds to the argument of the flag -coarray-config-file=. I don’t know what the third argument is for; values other than NULL would crash the entire process :bomb:.

If you compile a coarray program with ifx in compiler explorer, another procedure appears:

caflaunch$MAIN$blk:
        push    rax
        mov     edi, -1
        xor     esi, esi
        xor     edx, edx
        call    for_rtl_ICAF_LAUNCH@PLT
        test    al, 1
        jne     .LBB1_1
        mov     edi, offset __unnamed_3
        pop     rax
        jmp     for_exit@PLT
.LBB1_1:
        pop     rax
        ret

which can be roughly decompiled to:

// "fake" caflaunch$MAIN$blk
extern "C" int caflaunch(void) {
	int result = for_rtl_ICAF_LAUNCH(-1,nullptr,nullptr);
	if ((result & 1) != 0) {
		return result;
	}
	return for_exit(0);
}

What is weird is that I couldn’t find any reference to this procedure (or it’s address) in the object file, so it remains unclear when or how this procedure is invoked and why it is there in the first place.

Empirically, calling the routine seems to fork the process and create multiple images:

int main(int argc, char **argv) {
  for_rtl_init_(&argc, argv);
  caflaunch();
// vvv  multiple images running  vvv
  some_caf_subroutine();
  for_rtl_finish_(); // ignore status
  return 0;
}

No matter what I tried, I couldn’t find a way to join the images back together and this behavior essentially matches the MPI model (upon which the Intel Coarray Fortran implementation is based). After a call to MPI_Finalize the number of processes (images) remains undefined, and the expectation is the program will exit.

So my suggestion for anyone trying to incorporate a coarray section into a non-Fortran program, would be to either:

  • launch a separate process and read/write whatever values needed using an external file
  • launch a separate process and communicate using IPC (this could be as simple as reading and writing to stdin/stdout)
  • invoke the non-Fortran program from the coarray Fortran program as suggested in the other thread

Hm, interesting. The reason I asked about this is that the intended use is something along these lines:

In a loop (configurable via a C++ program via some user-defined input file):

  • Let program 1 (or rather the main routine from a shared library) calculate one step.
  • At the end of the step, results are passed on to program 2 and there is an exchange between other instances of the same program.
  • Let program 2 pick up the results and do its own step (again exchanges between other instances of program 2 may happen).
  • Repeat until the end of the calculation.

The C++ program currently is responsible for managing the parallel instances (which all works via MPI), but I would like to be able to use coarrays instead of MPI.

At least that is the most complicated scenario I can foresee.

I was testing this idea of running an application where the main is in C and it calls some code in Fortran, and to mimic the expected complexity, say that the fortran code uses MPI and tries to incorporate coarray:

With some help from Claude I got the following mwe:

app\main.c
#include <stdio.h>
#include <stdlib.h>
#include <mpi.h>

#if defined(__INTEL_COMPILER) || defined(__INTEL_LLVM_COMPILER)
  void for_rtl_init_  (int *, char ***);
  int for_rtl_finish_(void);
#elif defined(__GNUC__)
  void _gfortran_caf_init  (int *, char ***);
  void _gfortran_caf_finish (void);
#endif

void caf_runtime_init(int *argc, char ***argv)
{
#if defined(__INTEL_COMPILER) || defined(__INTEL_LLVM_COMPILER)
    for_rtl_init_(argc, argv);
#elif defined(__GNUC__)
    _gfortran_caf_init(argc, argv);
#endif
    /* Optional: verify MPI was brought up by the runtime */
    int mpi_ready;
    MPI_Initialized(&mpi_ready);
    if (!mpi_ready) {
        fprintf(stderr, "[caf_runtime] ERROR: MPI not initialized by CAF runtime\n");
        exit(EXIT_FAILURE);
    }
}

void caf_runtime_finish(void)
{
    /* Optional: flush/sync before tearing down */
    int mpi_finalized;
    MPI_Finalized(&mpi_finalized);
    if (mpi_finalized) {
        fprintf(stderr, "[caf_runtime] WARNING: MPI already finalized before caf_runtime_finish\n");
        return;
    }

#if defined(__INTEL_COMPILER) || defined(__INTEL_LLVM_COMPILER)
    for_rtl_finish_();
#elif defined(__GNUC__)
    _gfortran_caf_finish ();
#endif
}

void coarray_mpi_initialize(void);
void coarray_mpi_hello(void);

int main(int argc, char **argv)
{
    caf_runtime_init(&argc, argv);
    puts("C main entering Fortran coarray hello");
    coarray_mpi_initialize();
    coarray_mpi_hello();
    caf_runtime_finish();
    return 0;
}
src\coarray_mpi.f90
module coarray_mpi
  use, intrinsic :: iso_c_binding, only: c_int
  implicit none
  private

  public :: say_hello
contains
  subroutine coarray_mpi_initialize() bind(C, name="coarray_mpi_initialize")
    integer(c_int) :: image_count

    image_count = num_images()
    sync all
  end subroutine coarray_mpi_initialize

  subroutine coarray_mpi_hello() bind(C, name="coarray_mpi_hello")
    call say_hello()
    call example_mcurcic_mpi()
    call example_mcurcic_caf()
    call example_jacobi_mpi()
    call example_jacobi_caf()
  end subroutine coarray_mpi_hello

  subroutine say_hello()
    integer(c_int) :: image_index
    integer(c_int) :: image_count
    integer(c_int) :: turn

    real, allocatable :: A(:)

    image_index = this_image()
    image_count = num_images()

    do turn = 1, image_count
      sync all
      if (image_index == turn) then
        write (*, '(a,i0,a,i0)') 'Hello from coarray image ', image_index, ' of ', image_count
      end if
    end do

    sync all
  end subroutine say_hello

  subroutine example_mcurcic_mpi()
      use mpi
      integer :: ierr, rank, nproc, request
      integer :: stat(mpi_status_size)
      integer :: array(5) = 0
      integer, parameter :: sender = 0, receiver = 1

      call mpi_comm_rank(mpi_comm_world, rank, ierr)
      call mpi_comm_size(mpi_comm_world, nproc, ierr)

      if (nproc /= 2) then
        if (rank == 0) then
          write(*,*) 'skip example_mcurcic_mpi : This code must be run on 2 parallel processes'
        end if
        return
      end if

      if (rank == sender) array = [1, 2, 3, 4, 5]
      print '(a,i1,a,5(4x,i2))', 'array on proc ', rank, &
      ' before copy:', array

      call mpi_barrier(mpi_comm_world, ierr)

      if (rank == sender) then
          call mpi_isend(array, size(array), mpi_int, &
          receiver, 1, mpi_comm_world, request, ierr)
      else if (rank == receiver) then
          call mpi_irecv(array, size(array), mpi_int, &
          sender, 1, mpi_comm_world, request, ierr)
          call mpi_wait(request, stat, ierr)
      end if

      print '(a,i1,a,5(4x,i2))', 'array on proc ', rank, &
      ' after copy: ', array
  end subroutine

  subroutine example_mcurcic_caf()
      integer :: array(5)[*] = 0
      integer, parameter :: sender = 1, receiver = 2

      if (num_images() /= 2) then
          if( this_image() == 1 ) then
          write(*,*) 'skip example_mcurcic_caf : This code must be run on 2 parallel processes'
          end if
          return
      end if

      if (this_image() == sender) array = [1, 2, 3, 4, 5]

      print '(a,i2,a,5(4x,i2))', 'array on proc ', this_image(), &
          ' before copy:', array

      sync all

      if (this_image() == receiver) array(:) = array(:)[sender]

      print '(a,i1,a,5(4x,i2))', 'array on proc ', this_image(), &
      ' after copy: ', array
  end subroutine

  subroutine example_jacobi_mpi()
      use mpi
      integer :: ierr, rank, size
      integer :: nx, ny, local_nx
      integer :: i, j, iter, max_iter
      integer :: up, down
      integer :: status(MPI_STATUS_SIZE)

      real, allocatable :: A(:,:), Anew(:,:)
      real :: local_sum, global_sum

      call MPI_Comm_rank(MPI_COMM_WORLD, rank, ierr)
      call MPI_Comm_size(MPI_COMM_WORLD, size, ierr)

      nx = 100
      ny = 100
      max_iter = 100

      local_nx = nx / size

      allocate(A(0:local_nx+1, 0:ny+1), source = 1.0 )
      allocate(Anew(0:local_nx+1, 0:ny+1), source = 0.0 )

      up = rank - 1
      down = rank + 1
      if (rank == 0) up = MPI_PROC_NULL
      if (rank == size-1) down = MPI_PROC_NULL

      do iter = 1, max_iter

        ! Exchange halos
        call MPI_Sendrecv(A(1,0), ny+2, MPI_REAL, up, 0, &
                          A(local_nx+1,0), ny+2, MPI_REAL, down, 0, &
                          MPI_COMM_WORLD, status, ierr)

        call MPI_Sendrecv(A(local_nx,0), ny+2, MPI_REAL, down, 1, &
                          A(0,0), ny+2, MPI_REAL, up, 1, &
                          MPI_COMM_WORLD, status, ierr)

        ! Jacobi update
        do concurrent( i = 1:local_nx, j = 1:ny )
            Anew(i,j) = 0.25 * (A(i-1,j) + A(i+1,j) + A(i,j-1) + A(i,j+1))
        end do

        ! Swap arrays
        A(1:local_nx,1:ny) = Anew(1:local_nx,1:ny)

      end do

      local_sum = norm2( A(1:local_nx,1:ny) )
      ! Global reduction
      call MPI_Allreduce(local_sum, global_sum, 1, MPI_REAL, MPI_SUM, MPI_COMM_WORLD, ierr)
      global_sum = sqrt(global_sum)
      if (rank == 0) print *, "MPI ", "Iter:", iter, " Norm:", global_sum

  end subroutine 

  subroutine example_jacobi_caf()
    integer :: nx, ny, local_nx
    integer :: i, j, iter, max_iter
    integer :: me, nimgs

    real, allocatable :: A(:,:)[:], Anew(:,:)[:]
    real :: local_sum, global_sum

    me = this_image()
    nimgs = num_images()

    nx = 100
    ny = 100
    max_iter = 100

    local_nx = nx / nimgs

    allocate(A(0:local_nx+1, 0:ny+1)[*], source = 1.0)
    allocate(Anew(0:local_nx+1, 0:ny+1)[*], source = 0.0)

    do iter = 1, max_iter

      sync all

      ! Exchange halos using coarrays
      if (me > 1) A(0,:) = A(local_nx,:)[me-1]
      if (me < nimgs) A(local_nx+1,:) = A(1,:)[me+1]

      sync all

      ! Jacobi update
      do concurrent( i = 1:local_nx, j = 1:ny )
        Anew(i,j) = 0.25 * (A(i-1,j) + A(i+1,j) + A(i,j-1) + A(i,j+1))
      end do

      A(1:local_nx,1:ny) = Anew(1:local_nx,1:ny)

    end do

    local_sum = norm2( A(1:local_nx,1:ny) )
    ! Global reduction
    global_sum = local_sum
    call co_sum(global_sum)
    global_sum = sqrt(global_sum)
    if (this_image() == 1) print *, "CAF ", "Iter:", iter, " Norm:", global_sum

  end subroutine 

end module coarray_mpi

A quick windows bash script to build and run using mpiexec (from intel MPI) instead of cafrun:

launch.bat
@echo off
setlocal

set "ROOT=%~dp0"
set "BUILD_DIR=%ROOT%build"
set "FORTRAN_OBJ=%BUILD_DIR%\coarray_mpi.obj"
set "C_OBJ=%BUILD_DIR%\main.obj"
set "EXE=%BUILD_DIR%\coarray_mpi.exe"
set "MPI_LIB=C:\Program Files (x86)\Intel\oneAPI\mpi\latest"

where ifx >nul 2>nul || (
	echo ifx was not found on PATH.
	echo Activate the oneapi environment first.
	exit /b 1
)

where icx >nul 2>nul || (
	echo icx was not found on PATH.
	echo Activate the oneapi environment first.
	exit /b 1
)

where mpiexec >nul 2>nul || (
	echo mpiexec was not found on PATH.
	echo Activate the oneapi environment first.
	exit /b 1
)

if not exist "%BUILD_DIR%" mkdir "%BUILD_DIR%"

echo Compiling Fortran coarray source...
ifx /O2 /nologo /c /Qcoarray:distributed /module:"%BUILD_DIR%" /object:"%FORTRAN_OBJ%" "%ROOT%src\coarray_mpi.f90" -I"%MPI_LIB%\include"
if errorlevel 1 exit /b 1

echo Compiling C main...
icx /O2 /nologo /c /Fo"%C_OBJ%" "%ROOT%app\main.c" -I"%MPI_LIB%\include"
if errorlevel 1 exit /b 1

echo Linking executable...
ifx /O2 /nologo /Qcoarray:distributed /exe:"%EXE%" "%C_OBJ%" "%FORTRAN_OBJ%" /link /LIBPATH:"%MPI_LIB%\lib\release" impi.lib
if errorlevel 1 exit /b 1

echo Running with Intel MPI...
"%MPI_LIB%\bin\mpiexec.exe" -n 4 "%EXE%"
exit /b %errorlevel%

This worked with intel compilers. Don’t know if it could work with gnu+opencoarrays.

What seems encouraging about this is that it keeps the door open for progressive integration/migration within Fortran codes while also allowing on using external libraries in mixed language environement where those might be written in C/C++ using MPI in their own way.

Interesting, it works, although I had to add “mpi” to the option -I"%MPI_LIB\include".