Multiple parallelization layers vs the agnosticism of coarrays: suggestions for improvement

How can Fortran take advantage of the fact that communication is faster intra-node (shared memory) than inter-node? Coarrays are reducing the need on say MPI, but current implementations seem to not be able to act like hybrid MPI+openMP/pthreads programs.

Example problem: domain decomposition. One may wish to solve a PDE by decomposing it onto different subdomains, solving them on each subdomain, and communicating boundary/overlap conditions between the domains. The smaller amount of memory each process has to itself, the smaller subdomain it can handle, and thus the more boundary condition matching iterations have to be done, which increases both processing and communication time. Shared memory models allow for larger subdomains to be solved efficiently. Thus, in the MPI+openMP/pthreads approach, one assigns subdomains to different nodes, wherein each node may use a shared memory model to efficiently solve a subdomain, and less communication is needed overall.

  • There appears to be no way in coarray Fortran to take advantage of a hybrid parallelization scheme like the above without using external libraries like MPI or openMP/pthreads. While it is possible Intel or other Fortran compilers may implement coarray communication in a hybrid way, the programmer doesn’t know this, and cannot design the algorithm around this; but there are clear speed advantages to being able to specify this behavior on the algorithm/program level, the above being such an example.

Suggestions:

  1. Be able to group images via TEAMS into groups that have faster communication, or otherwise can share memory. The program would need to be able to know at runtime how many images there are per node, which ones are on what nodes, etc. This would require that the compiler/underlying communication protocol is able to share that information to the Fortran program.
    • As far as I can tell, MPI already supports this behavior. Intel MPI at least allows you on the command-line to execute mpirun -ppn N -n M ./prog.exe, for processes per node and number of processes. The associated MPI function is MPI_Get_processor_name(). Creating some intrinsics for the purpose of forming teams with faster communication, such as this_node() or this_processor(), would be a start. In Intel, if you use IFPORT module, this is the HOSTNAM/HOSTNM function, and gfortran uses HOSTNM.
  • If we assume the underlying coarray implementation automatically changes the memory model used for communicating between images on the same node, then using this_node() to assign teams might be all we need. However, I do not know of any current compilers that do this for coarrays - the Intel MPI & coarray documentation at least does not seem to explicitly mention this.
  • Perhaps there could also be a flag in FORM TEAM that specifies the memory model (shared vs distributed) used for the team, though this goes against the implementation-agnostic form of the standard. One could combine this with checking HOSTNM/this_node() to ensure teams are assigned appropriately.
    • MPI-3 actually already has the capability of shared memory access of arrays between processors on the same node, so having a similar functionality seems natural for coarray teams. Examples: Documentation Ex1 Ex2
  • The standard currently specifies that all images start on the same team (according to the below picture).
    • If we remove that specification and remain agnostic, then standard-conforming compilers could instead choose to initialize images into different teams based on the different nodes, to remove some of the work the programmer has to do.
    • Going the opposite direction of agnostic, we could propose the initial team contains all images, but we further specify that child teams are automatically formed for images on different nodes.
  1. The other solution I can think of involves implementing something like the below proposal.
    Intrinsic threads Proposal
  • If one creates a coarray program whose underlying implementation is a distributed memory model, one could in principle execute it with one image per node, and have each image launch a number of threads (all sharing the node’s memory) to achieve the same result as MPI+pthreads for a hybrid distributed+shared solution.
  • This solution allows the standard to remain agnostic in terms of the coarray implementation, while giving enough flexibility to programmers to design their algorithms to fully take advantage of the multiple levels of parallelization in practice today.
  • Additionally, intrinsic threads have an independent use-case, being able to implement dynamic parallelism (sequential->parallel->sequential), which may be useful in areas where coarrays/images are not.

Notes:

  • Both approaches above would improve designing algorithms for heterogeneous computing ecosystems (including some nodes having GPUs, some having less processors, etc).
  • Addressing this issue also affects asynchronous/real-time applications. If you can guarantee your head/master node/team has a shared memory model, you can generally process data it receives from the other nodes faster for realtime viewing/frontend use.
1 Like