Fortran coarray shared memory

How to implement shared memory in fortran coarray? It’s to me that every image have a copy of local variable. Although fortran is a global address space language (PGAS) which means that one image is easy to access the data of other images. As to me, this is valid only for fortran coarrays. How about the local variable?
In the source code of neural-fortran https://github.com/modern-fortran/neural-fortran and the subrotuine of train_batch , there is a code snippet

    real(rk), intent(in) :: x(:,:), y(:,:), eta

    indices = tile_indices(im)
    is = indices(1)
    ie = indices(2)

    call db_init(db_batch, self % dims)
    call dw_init(dw_batch, self % dims)

    do concurrent(i = is:ie)
      call self % fwdprop(x(:,i))
      call self % backprop(y(:,i), dw, db)
      do concurrent(n = 1:nm)
        dw_batch(n) % array =  dw_batch(n) % array + dw(n) % array
        db_batch(n) % array =  db_batch(n) % array + db(n) % array
      end do
    end do

It’s clearly that the is and ie are the index of data for present image. But we simply use the x(:,i), y(x:i) to access the data which are not coarray by the definition. Does it means that every image has a copy of all the x and y. If so, when the number of image is incrased, the memory will blow up ?

In the paper of “A parallel Fortran framework for neural networksand deep learning”, the author says that

Elapsed times on up to 12 parallel images on a shared-memory system are shown in Figure 4.

What’s the meaning of shared-memory system? I do not get the meaning of shared-memory as every image has a copy of the dataset.

To be clear, does fortran coarray supports shared memory ? If yes, how to implement the shared memory code?

2 Likes

Welcome @haomiao!

In the paper, by “shared-memory system” I mean a single-node, multi-core computer. It means that the processes (images) physically share RAM, but it doesn’t mean that each image has a copy of every other image’s data.

The importance of mentioning the “shared-memory system” in the paper was to make note that there is no communication through the network.

From the Fortran code point of view, it makes no difference if you’re running on a single-node or multi-node machine. For example, running a program on 8 images on one 8-core machine, vs. 2 4-core machines, should yield the same results (assuming no race conditions) and memory use by each image.

Note that neural-fortran doesn’t use coarrays but only collective subroutines–co_sum and co_broadcast. Each image allocates a different section of the array (is:ie). Decomposition is done by the library (in tile_indices), but the communication is done by the compiler or a coarrays-implementing library.

Thanks @milancurcic
For example, when I run the following code

program main
    integer :: i
    integer :: a(4) = [1,2,3,4]
    do i = 1, 4
        if(this_image() == i) a(1) = i*2
    end do
    print*, "process index: ",  this_image(), a
end program main

and run cafrun -n 4 ./a.out
The results are

process index: 4 8 2 3 4
process index: 1 2 2 3 4
process index: 2 4 2 3 4
process index: 3 6 2 3 4

clearly there are four copies of a.

Just like the neural-fortran to declare x(:,:) , there is no difference to me as I declare the a is the local array. In my opinion, there is a copy of x(:,:) in other images.

You’re right, all images have a local copy of all variables, no matter if they’re declared as coarray or not. The allocatable arrays (non-coarrays) don’t have to be allocated on all images, and they don’t need to have the same extent (lower and upper bounds). Coarrays do need to be allocated with the same lower and upper bounds on all images, and on all images at the same time (allocating a coarray is a blocking operation).

Back to neural-fortran, I looked at how it’s implemented and it seems like all images have a copy of the full input arrays x and y. Each image only trains on its own portion of the workload, after which the weights and biases are globally updated using co_sum. Of course, this is not optimal for memory use, but allows a very simple high-level API which uses exactly the same code for serial and parallel execution. In other words, the parallel work distribution is managed inside of network_type % train_batch(), and not outside of it.

A more optimal approach memory-wise would be for each image to read only its portion of the data, and call network_type % fwdprop(), network_type % backprop(), and network_type % update() directly. You’d need to take care of data exchange yourself, but it’s simple enough, you can just follow how that’s implemented in network_type % train_batch().

Each image has its own variables, which is true for both coarrays and noncarrays. The difference with coarrays is that there is a corresponding coarray on every image that can be accessed from a different image. The method used for addressing coarrays on other images depends on the hardware and OS support. For Linux-based systems (common in HPC) there is an OS plug-in called XPMEM that allows a process to access memory in a different process in the same shared-memory space. For the case of a single node with many cores, and one image for each core, each image is a separate process, and memory addressing to the remote process can be done through XPMEM. (The addressing possibility is available in the hardware - XPMEM mainly enforces memory space security between processes.) Within a node memory can be accessed without employing the systems internal network. There have been hardware designs that had true global addressing (the upper bits of the memory address were the image number-1 based on the initial image, and routing tables were built into the unified memory and network hardware. However, these machines are (were) pretty rare because of the cost. But the performance on distributed memory coarray codes was pretty spectacular. Typically there was only 1 process/node so all codes with more than one image were distributed memory. Remote accesses did not involve a library call - the compiler just generated inline code to stuff the correct bits into the upper part of the address and a simple load or store instruction was executed.

1 Like

@haomiao ,

Welcome to this forum.

Re: “when the number of image is incrased, the memory will blow up ?” is your concern that each image having its own copy of some 'large" memory dataset will cause an issue?

By the way, note the ALLOCATABLE attribute and how the memory usage can be controlled depending on the needs across images:

   integer, allocatable :: a(:)
   if ( this_image() == 1 ) then
      a = [ 1, 2, 3, 4 ]
   end if
   print *, "On image ", this_image(), "; allocated(a)? ", allocated(a)
   if ( allocated(a) ) print *, "a = ", a 
end

On image 3 ; allocated(a)? F
On image 2 ; allocated(a)? F
On image 4 ; allocated(a)? F
On image 1 ; allocated(a)? T
a = 1 2 3 4

Re: “does fortran coarray supports shared memory?” do you mean the use of some “global” data that can be accessed (shared) the same on all the images?

@FortranFan
Yes, it’s exactly what you have said. How could we use “global” data that can be shared on all images if we don’t allocate the variable on each image. As I think, the OPENMP could do this easily, and MPI could allocate the shared memory. But how about the coarray? We could allocate a in image 1, but how other images get the a if we do not allocate a on them?

They always reference a[1]?

You can’t allocate a coarray on some of the images and not on others. It has to happen on all images. Or at least on all images on the same team, IIRC.

That’s what I suspected and that’s why I asked to confirm.

Re: “How could we use “global” data that can be shared on all images if we don’t allocate the variable on each image,” to the best of my knowledge, the Fortran coarray model does not allow what you seek,

What you are asking appears more like a threading approach with multiple threads and single data model. Fortran coarray is not that.

That’s more or less what coarrays are. A coarray must have (or be allocated to) the same shape and size on every image (at least within a team), but any image can access any image’s data directly.

However, with collective subroutines, one can share data between images without using a coarray. I.e. co_broadcast can send the data from one image to all the others. co_sum can gives you the total from all images.

There are a variety of techniques possible with just the above features, and we haven’t even gotten into teams and event_type yet. But which technique/strategy you use depends heavily on the problem you’re trying to solve. “How do I have global data?” isn’t a specific enough question to give an appropriate answer, and it’s quite possible the answer to your question may be “having global data might not be a good design in this case”.