I am learning using coarrays on one node clusters, with typically 24 cores. I can make tests on an interactive machine with 10 bi-thread cores and it works fine with:
But I would also like to learn launching coarrays tasks with the SLURM workload manager. I tried the following job.slurm script but I think it ran on only one core:
By default, the number of images created is equal to the number of execution units on the current system. You can override this by specifying a number using the [Q]coarray-num-images compiler option on the command line that compiles the main program. You can also specify the number of images at execution time in the environment variable FOR_COARRAY_NUM_IMAGES.
You could also fix the number to the executable, -coarray -coarray-num-images=24. More guidelines are given in the documentation.
For running on a single node only, you can also try setting I_MPI_FABRIC=shm. This will use intra-node communication mechanism. There are further settings you can play with.
Does anyone know if shm fabric implies use of an API like POSIX shared memory (shm_open) or could it be something different? The Intel docs mention /dev/shm/ is used on Linux.
But this is not the end of the story. I have noticed that the images were continuing running after the end of all Fortran instructions, until SLURM stopped them after the allocated time.
I have added a few sync all, trying to improve the situation. But also added a stop at the end of the test if (this_image()==1), an advice found with the Mistral AI agent. My code is now schematically like this:
...
computation: do i = 1, num_samples
! All images working on their own array p()
end do computation
sync all
call co_sum(p, 1)
sync all
if (this_image() == 1) then
write(*,'(A)') "I am image 1 saving the picture in 'buddhabrot.ppm'"
...
! Stopping image 1 stops all images
! It avoids problems with images sometimes continuing to run
stop
end if
It’s better, sometimes stopping, sometimes still hanging.
In our HPC cluster, I can make tests on an interactive machine and the Fortran images never hang at the end of the computation. It occurs only when it is launched by the SLURM workload manager:
just now, I am doing a test. It is hanging, image 1 has started to write the file buddhabrot.ppm, but only 17 Bytes (instead of 3 MB).
I cancel it, and restart the job, now it stops correctly and I have a 3 MB ppm file.