Error running coarray

Hi all,
I’m triing to run an example code with coarray with oneapi intel compiler on a shared memory system. When I run the code of example on a single image all seems to go ok, but if I set -coarray=shared and -coarray-num-images=6 (or any other number) I’ve the following error:

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 8 PID 953 RUNNING AT 41ca5757f459
=   KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 9 PID 954 RUNNING AT 41ca5757f459
=   KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 10 PID 955 RUNNING AT 41ca5757f459
=   KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   RANK 11 PID 956 RUNNING AT 41ca5757f459
=   KILLED BY SIGNAL: 11 (Segmentation fault)
===================================================================================

This is not due to error proble, the code print only a welcome messages from any images!
I compile the code with the following command:

ifort -O3 -coarray=shared -coarray-num-images=6 hello.f90 -o hello

The compilation seems ok, no error after the command, but if I run it with ./hello I have the previows one error!
anyone have any Idea?
Thank you in advance

It’s impossible to help without more info. Please post the exact code you’re compiling – it sounds like it is just a few lines. What platform? What version of the compiler?

Hello and thanks for your help,
my example code is:

program main
  implicit none
  print *,"debug 1" ! 1st statement of program
  write (*,*) "Hello from image", this_image(), "of", num_images()
  sync all
  print *,"debug 2" ! 1st statement following first coarray-esk statement.
  write (*,*) "Goodbye from image", this_image(), "of", num_images()
end program main

the version of the compiler is:

root@8c6937a9880a:/home/data# root@8c6937a9880a:/home/data# ifort --version
ifort (IFORT) 2021.6.0 20220226
Copyright (C) 1985-2022 Intel Corporation.  All rights reserved.

and the platform is Linux Ubuntu.
I think the problem is in environment configuration, but I don’t understand where I’m wrong.
Thanks, best regards

I would agree. This works fine for me (same compiler, Fedora 35). Did you install the Intel MPI library from the OneAPI HPC toolkit? I would have expected the compiler to complain about missing libraries if you hadn’t, but perhaps not. Also, have you sourced the setvars.sh script from the oneapi installation directory?

Yes I’ve installed the MPI library from the oneAPI HPC toolkit, and yes I’ve runned the setvars.sh script fro the oneapi installation directory… I’ve no idea to solve this problem, any idea is welcome.
Thanks

Bummer. Since the compiler works with a single image, I’m afraid I’m out of ideas. If you haven’t already, you should consider posing your problem on Intel’s Fortran forum.

Actually one more thought. Are you able to run an MPI program using Intel’s MPI library?

An MPI version of your hello.f90 program:

program main
  use mpi
  implicit none
  integer :: nproc, this_rank, ierr
  call MPI_Init(ierr)
  call MPI_Comm_size(MPI_COMM_WORLD, nproc, ierr)
  call MPI_Comm_rank(MPI_COMM_WORLD, this_rank, ierr)
  write (*,*) "Hello from image", this_rank, "of", nproc
  call MPI_Barrier(MPI_COMM_WORLD, ierr)
  write (*,*) "Goodbye from image", this_rank, "of", nproc
  call MPI_Finalize(ierr)
end program main

Then

mpifc hello-mpi.f90
mpiexec -np 3 ./a.out

Strangely, I couldn’t get the mpi_f08 module version to work; the compiler complained about “undefined reference to `mpi_f08_compile_constants_MP_mpi_comm_world_'”. But I only ever use Intel’s MPI when I’m experimenting with coarrays and the Intel compiler.

Ok, I managed to reinstall and compile the program, now I can compile and run the Hello example. Now I’ve a new mistacke on my code, when I run the code after few seconds the program freeze, I’ve added some flag in the compilation to debug the code (-g -traceback). The problem seems to be in a subrutine which uses coarrays.

subroutine write_bat(medb,outf)

use iso_fortran_env, only: stdout => output_unit, &
                           stderr => error_unit
class (Sbathy),intent(in)::medb
character(len=*), intent(in)::outf
character*40,parameter :: odir="/home/data/bathy_img/"
character*12, parameter ::ofile='pippo1.bat'
real, allocatable, dimension(:,:),codimension[:] :: bt
integer ::i,j,ic, nlat,nlon
integer, dimension(4)::ib
print *, this_image(), 'write_bat test -----------------------'
nlat=medb%gmed%nlat
nlon=medb%gmed%nlon
ib= medb%get_inner()
print*, this_image(), 'bbox= ',ib
write(*,'(a10,4(f8.4,2x))') 'bbox exp= ',medb%gmed%lon(ib(1)),medb%gmed%lon(ib(2)),&
                        medb%gmed%lat(ib(3)),medb%gmed%lat(ib(4))
!if (this_image()==2) then
!print*,this_image(), medb%gemo%box%p0%plat, medb%gemo%box%p1%plat ,medb%gmed%box%p0%plat
!end if
allocate(bt(nlon,nlat)[*])

!local bathymetry

bt=medb%gmed%z(1:nlon,1:nlat)
sync all

do ic=2,num_images()
  if (this_image().eq.ic) then
    do j=ib(3),ib(4)
      do i=ib(1),ib(2)
         bt(i,j)[1]=bt(i,j)
      end do
    end do
  end if
 sync all
end do

open(stdout,file=outf,status='unknown',access='stream')
if(this_image()==1) write(stdout) bt(:,:)
close(stdout)

!open(stdout,file=trim(odir)//trim(ofile),status='unknown',access='stream')
!if (this_image()==2) write(stdout) medb%gmed%z
!close(stdout)

end subroutine write_bat 

I compile the code with:

ifort -O3 -traceback -g -coarray=shared -coarray-num-images=6 mod_bathy_field.f90 bathy.f90 `nf-config --fflags --flibs` -o pippon

and run it with:
./pippon

Anyone have some idea on what is the cause of the problem?
Can I provide more information to help me to solve it?
best regards

You could add write/print statements inside and after the loop to check where it’s stuck. (Or use a debugger.)

One line that stands out to me is:

open(stdout,file=outf,status='unknown',access='stream')

I thought it was not allowed to open an already open unit, especially stdout.

Hello Sideboard and thank you for your help.
Before the loop there is a print statements, but when I run the code seems to stuck at the beginning of the subroutine. This subrutine is called from the main, before the call there is a sync all, and if I use a debugger flag (-g) the problem seems the sync all, I’ve commented it, and after that the debugger point to the subrutine at this specific line:

allocate(bt(nlon,nlat)[*])

but the print statement not was printed before the incriminated line.

Note that output is buffered. To check that, you could use flush(stdout) directly after the print/write.

At the beginning of the subrutine I’ve put a print, but there is no print on screen, I think the subroutine don’t start. I’m very new user in fortran, maybe I’ve used it in the past, but now I’m very rusty and many things have changed. There is a way to test the problem writing a simple test code or something like that?
thank you, best regards

Glad you were to sort out your initial problem with your hello program.

Is your write_bat subroutine a module procedure (i.e., defined in a module) and your main program uses that module? Or have you written an explicit interface block for it in the main program? Because of its class(Sbathy) dummy argument you can’t call it with an implicit interface, as in old F77 style.

I attach the code here. I hope this clarify all your questions.
Thank’s for your help,
best regards
mod_bathy_field.f90 (10.8 KB)
bathy.f90 (1.8 KB)

You indicate that the problem appears to be happening at about the call to write_bat or early inside. I don’t see anything obviously wrong that would account for the hang. So you really just need to pin down the line where things go wrong as @Sideboard suggested.
Print statements are a low tech debugging approach, but can be very effective. Frankly it’s what I do 99% of the time. Add the line

print *, "FOO", this_image()

immediately before the call to write_bat (and after the sync all), and add the line

print *, "BAR", this_image()

as the very first executable statement in write_bat. Do all the images print “FOO” but only some, or none, print “BAR”? If so then something is going wrong making the call to write_bat. Otherwise move the print statements (or add new ones) to pin down where things are hanging. Note that I’ve never observed any issue with buffering of stdout on linux in this type of usage (i.e., I don’t think the flush is necessary). When you pin down the offending line report back.

A side remark: In looking over your code I happened to notice this line

medb=Sbathy(fcu,fg(this_image()),this_image())

Where the RHS is a structure constructor expression. You may have the impression (from C++) that this is doing an in-place definition of the medb LHS variable; I did at the beginning. But this isn’t the case. Fortran will create a temporary value for the RHS and then use intrinsic assignment to assign that temporary to medb (or a defined assignment if you created one). This probably isn’t really what you want, especially if these are large objects. For this reason I’ve found structure constructors to be far less useful than they appear at first blush. Instead you can modify the structure constructor, making it a type bound subroutine (say init), and can replace the assignment with

call medb%init(fcu,fg(this_image()),this_image())
2 Likes