Using Coarrays and Memory Efficiently

I am making a program and using OpenCoarrays to parallelize it. I would like to do computations on all processors, and then feed those calculations into an array only allocated on image 1.

However, it seems like the transfer of data is not going as planned. I’ve made a small reproducing example.

program cotest                                                                                                                                    
  real, dimension(:,:), codimension[:], allocatable :: arr                                                                                        
  integer :: idx1, idx2, thisstart, thisend                                                                                                       
  if (this_image() .eq. 1) then                                                                                                                   
     allocate(arr(4,2)[*])                                                                                                                        
  else                                                                                                                                            
     allocate(arr(1,1)[*])                                                                                                                        
  end if                                                                                                                                          
                                                                                                                                                  
  thisstart = 2*(this_image() - 1) + 1  ! = 1 for image 1; = 2*1+1 = 3 for image 2                                                                
  thisend = thisstart + 1               ! = 2 for image 1; = 4 for image 2                                                                        
  print *, thisstart, thisend                                                                                                                     
  do idx1=thisstart, thisend                                                                                                                      
     print *, "idx1:", idx1                                                                                                                       
     arr(idx1, :)[1] = idx1 + this_image()                                                                                                          
     !do idx2=1,2                                                                                                                                 
     !   arr(idx1, idx2)[1] = idx1*idx2                                                                                                           
     !end do                                                                                                                                   
  end do                                                                                                                                          
  sync all                                                                                                                                        
  call execute_command_line('')  ! Forces things to print in order; flushes stdout                                                                
  ! sync all not blocking...?                                                                                                                     
  if (this_image() .eq. 1) then                                                                                                                   
     print *, this_image(), "is about to start the print loop."                                                                                   
     do idx1=1, 4                                                                                                                                 
        print *,"for idx1", idx1, "arr is ", arr(idx1, :)[1]                                                                                      
     end do                                                                                                                                       
  end if                                                                                                                                          
                                                                                                                                                  
end program cotest 

Using the same compilation/run parameters, I get the output

           1           2
 idx1:           1
 idx1:           2
           3           4
 idx1:           3
 idx1:           4
           1 is about to start the print loop.
 for idx1           1 arr is    2.00000000       2.00000000    
 for idx1           2 arr is    3.00000000       3.00000000    
 for idx1           3 arr is    5.00000000       0.00000000    
 for idx1           4 arr is    6.00000000       0.00000000

As you can see, the image 1 part works fine, but the assignment for image 2 is not working so well. Also, if I try to assign an array of dim 2 to arr(idx1, : ) then I get a memory crash.

I suspect what is occurring is that the coarray library makes some sort of assumption that if its allocated dimensions on image 2 are (1,1), then the same must be true for the partner on image 1. If so, then what I want to do seems fruitless. EDIT: It appears I am indeed not allowed to allocate different amounts of memory on the different images. The below question is still valid; I now just don’t know how to accomplish it.

Is what I wish to do possible, i.e. only allocate the memory I need for an object on one core instead of allocating the memory on all cores? And then transfer the intermediate values computed on other cores to the central array on image 1? If so, how?

EDIT2: I know if I were to use OMP I could just create an array in shared memory that’s shared by the processors. I think the Intel compilers have a shared memory version of coarrays, but I am not sure if I should change the syntax or what for the above example.

In principle I think the solution is to declare a coarray of a derived type, which contains an allocatable component. See discussion here.

I practice I am not sure how well OpenCoarrays supports this (or other compilers, for that matter). It would be interesting to know.

1 Like

I’ve used this myself and it does work with Intel, NAG, and with one exception, gfortran/OpenCoarrays. With gfortran the component needs to be a pointer (and thus you need to manage its deallocation manually). If you use an allocatable you run into a “double free” memory error when the derived type object goes out of scope. I’ve reported the bug here

With coarrays you are running multiple images of your program (asynchonously), and each one is running in its own address space. A coarray variable has its own independent storage in each image. Use of the [.] selector allows you to read/write to the version of the variable on another image. How the compiler runtime manages to accomplish the communication between images is a separate issue. It may use shared memory, e.g., MPI with shared memory transport, but that doesn’t alter the semantics of how coarrays work from a programming perspective.

I find the coarray approach to parallel programming to be pretty much the same as with MPI, it just has a much nicer syntax. OMP is something very different. I’m not a user of OMP at all, but my understanding is that with OMP you are running a single copy of the program within which threads are spawned which can see the single address space (i.e., shared) of the host process. If you want to use coarrays (or MPI) you’ll need to think very differently about your program than you would with OMP.

Indeed. My program would probably be better written with OMP for the task I’m doing, but I’m purposefully getting coarrays involved to get used to them. I think my confusion partially stems from the fact that there seem to be both distributed and shared memory backends for coarrays [on at least Intel], but I understand now.

If you would describe what you wanted your toy example to do (I couldn’t quite decipher it from the code provided) I could probably show you what it would look like using coarrays.

I want a large array allocated on only one image [to save memory], say A( :,:,:,: )[1] (notation meaning I want it only allocated on image 1). I want the different images to compute elements of this array, in intermediate local variables/arrays A_loc(:), and end up assigning these to the large array on image 1, A(idx1, idx2, idx3, :)[1] = A_loc(:).
For simplicity, let’s assume I’ve subdivided the loop over idx1 across multiple images, this is roughly the main part of the program and what I want to accomplish. The below pseudocode is what I was hoping I could do to accomplish this.

do idx1=this_image_start,this_image_end
  do idx2=1,N
    do idx3=1,N
      tmp(:) = [some computation that different images can do independently]
      ! Collect into one array allocated only on image 1 since A is large
      A(idx1, idx2, idx3, :)[1] = tmp(:)   
    end do
  end do
end do

The idea is that since idx1 is non-overlapping between different images, the different images should be able to ‘fill-in’ different parts of A independently.

Could you make tmp a coarray instead (same size on every image), and then have image 1 get the data from each (following a sync all to ensure their calculations are complete)?

Do you mean abandon the large matrix A entirely? If I want A to be a coarray (have the images ‘fill it in’ like I imagine an OMP program would do), then apparently I need to allocate it to be a coarray of the same dimensions across all images, unless your suggestion of a derived-type form works. I personally want to avoid pointers, so if I do it that way I’ll have to switch to the free Intel compilers. Not the end of the world I suppose.
But yeah, allocating A on each image is the memory problem I’m trying to avoid. I have some reasons for wanting to have A of that rank (partially laziness with regards to implementing some symmetry-reduction and inverse mapping magic in my application, partially to avoid saving a lot of files to disk), but something like your suggestion would just translate to reducing the size of my matrix and doing the calculation in a streamed mode. Not the worst idea, but the cluster I use doesn’t like it when I save a lot of files to disk :slight_smile:

I meant to allocate a coarray of size tmp (with size corresponding to the maximum work-array size used on any individual image). This is instead of allocating the coarray A, which I believe would be a factor of num_images() times larger than tmp needs to be? You would still need to allocate a regular array A on image 1, but not on all images.

This is just based on my interpretation of your latest example (quite plausibly incorrect :wink: ).

The more flexible solution is likely to use a derived type coarray with allocatable components, as discussed earlier in this thread.

1 Like

You probably want something like this.

integer, allocatable :: tmp(:)[:]
integer, allocatable :: all(:, :)

allocate(tmp(individual_sizes)[num_images()])

! each image computes their part and does a plain assignment to temp like
! tmp(:) = some_computation()

sync all

if (this_image() == 1) then
  allocate(all(your,sizes))
  do i = 1, num_images()
    all(i, :) = tmp(:)[i]
  end do
end if
4 Likes

Oh, I see! Use a smaller coarray to transfer the memory from image B to image A, but then have the array on image 1 be a regular array allocated just on image 1. I like this approach, because it will work on gfortran/OpenCoarrays too. Thanks! This should work excellently, and I’m surprised I didn’t think of it, but there you go.

Indeed. Thanks for writing it out!
EDIT: It’s working great, and required very minor modification to the code. Thanks again!

I see others have beaten me to it. But as an alternative, @gareth’s original suggestion would have looked something like this:

type :: box
  real, pointer :: A(:,:,:,:) 
end type
type(box), allocatable :: buffer[:]
allocate(buffer[*])
if (this_image() == 1) buffer%A => A
sync images

do idx1=this_image_start,this_image_end
  do idx2=1,N
    do idx3=1,N
      tmp(:) = [some computation that different images can do independently]
      ! Collect into one array allocated only on image 1 since A is large
      buffer[1]%A(idx1, idx2, idx3, :) = tmp(:)   
    end do
  end do
end do

Observe that the %A pointer component is not associated (or allocated) with anything on any image but image 1.

1 Like