Memory locations of structure members get messed in MPI

Hi to all,

I am struggling to write down a code to write down simple structures of arrays in HDF5 format when using parallel version (hence MPI). The key idea in doing this is to create a compund object that needs memory locations of the structure members, in particular their offsets (e.g. in H5Tinsert function). In the serial version of the code, the members are layed regulary in memory and I have no problem. Passing to MPI version, the memory layout gets scrambled and I am not able to give valid memory locations in creating objects. This problem seems to be general because I encountered it using both Intel and GNU compilers in their recent versions on a x86-64 AMD CPU. Do you have any suggest to tackle this problem? Thanks. Below there is a simple version of the code that shows memory locations of the structure members and their offsets, you can compile MPI version by adding -DMPIV flag and preprocessing it (-cpp in GNU, -fpp in Intel).

PROGRAM main

  USE ISO_C_BINDING
#if defined MPIV
  USE MPI
#endif

IMPLICIT NONE

  !----------------------------------------------------------------

  INTEGER :: b,i,j,k,me,nb,np,mpierr
  integer,dimension(:),allocatable :: ni,nj,nk
  type vt
    double precision,dimension(:,:,:),allocatable :: r,p
  end type vt
  type(vt),target,dimension(:),allocatable :: var

!----------------------------------------------------------------

  nb = 4
  me = 0

#if defined MPIV
  call mpi_init(mpierr)
  call mpi_comm_size(mpi_comm_world,np,mpierr)
  call mpi_comm_rank(mpi_comm_world,me,mpierr)
#endif

  if (me==0) then

    allocate (ni(nb))
    allocate (nj(nb))
    allocate (nk(nb))

    allocate (var(nb))

    do b = 1,NB
      ni(b) = b/3+1
      nj(b) = b
      nk(b) = b/5+1
      allocate (var(b)%r(ni(b),nj(b),nk(b)))
      allocate (var(b)%p(ni(b),nj(b),nk(b)))
    END DO
    !
    ! Initialize the data
    !
    do b = 1,nb
      DO k = 1,nk(b)
        DO j = 1,nj(b)
          DO i = 1,ni(b)
            var(b)%r(i,j,k) = 10000*me+1000*b+i*100+j*10+k
            var(b)%p(i,j,k) = -var(b)%r(i,j,k)
          END DO
        END DO
      END DO
    end do

    write (*,*) 'var loc:',LOC(var)
    do i = 1,nb
      write(*,*) 'item:',i
      write (*,*) 'var loc:',LOC(var(i))
      write (*,*) 'var%r loc:',LOC(var(i)%r)
      write (*,*) 'var%p loc:',LOC(var(i)%p)
    end do

    do i = 1,nb
      write(*,*) 'item:',i
      if(i>1) write (*,*) 'offset var i/i-1:',LOC(var(i))-LOC(var(i-1))
      write (*,*) 'offset var%r/var:',LOC(var(i)%r)-LOC(var(i))
      write (*,*) 'offset var%p/var:',LOC(var(i)%p)-LOC(var(i))
      write (*,*) 'offset var%p/var%r:',LOC(var(i)%p)-LOC(var(i)%r)
    end do

    do b = 1,nb
      deallocate (var(b)%r,var(b)%p)
    END DO
    deallocate (var)
    deallocate (ni,nj,nk)

  end if

#if defined MPIV
  call mpi_finalize(mpierr)
#endif

END PROGRAM main

@artu72, welcome to the Forum. While I do not have a useable answer for this problem, I do wonder if you can not solve this by allocating the memory in one single piece of memory and then using pointers to address the various pieces in the way that you want. Just a thought. I have not studied your sample program yet.

As @Arjen pointed out I wouldn’t use LOC to create your HDF5 data structure, by design it’s a non-portable solution. Having said that, you should keep in mind that HDF5 object creation in parallel is synchronous so all MPI ranks need to know about all the other ranks’ data sizes.

You first need to create the same HDF5 object across all ranks and then you can populate it.

As a side note, I would use the iso_c_binding module with HDF5 and c_ptrs

It would also be helpful if you provided the HDF5 part of your code too.

You are adressing an interesting problem, where the var(i)%r can be allocated an address on the heap ( identified by LOC(var(i)%r) ), independent of any sequential structure.
I have looked at this for a single thread only, so I have no idea of the MPI requirements.

In a single thread case, the memory management is difficult to understand, as the memory allocated in “allocate (var(nb))” also reserves space to describe the size and location of all allocatable components.
Then there is the allocation of elements of the structure with : “allocate (var(b)%r(ni(b),nj(b),nk(b)))” which will provide a “pointer” to this allocatable array.

In the single thread version of what you have defined, it requires a complex memory reference structure to be associated with what is reserved in “allocate (var(nb))”.
Could the memory for var(nb) change as allocatable components are added/allocated ?
( My problem with this was I was trying to identify the amount of memory demanded by this data structure as it grew. My attempt at this had “type(vt) :: var” in a module )

As your “var” is local, can your allocatable components go out of scope ?

With your MPI approach, each image will potentially have a different memory mapping structure.
I am not sure how or when the HDF5 library would use this information.

It’s an interesting observation and I wonder how much it is related to MPI. Maybe linking to the MPI just results in a different memory allocation and many other modifications to the code would also reveal that there is no guarantee about the memory layout.

In any case, for the aim to create a compound datatype in HDF5 you don’t need memory locations at all. It suffices to know the size of the of the components. You find an example here: DAMASK/results.f90 at 0e4a8f284d4579e8f1827ae53ca6caf79addac09 · eisenforschung/DAMASK · GitHub. MPI parallel HDF5 had/has some issues with compound datatype, at least when Fortran strings are involved. I don’t remember the exact details, but I think that the code runs in serial in that case.

1 Like

Hi, thank you for the replies that gave me a better understandment of the problem. As you suggested, it is better that I explain in some more detail the situation and my requests. I am developing a MPI Fortran CFD code. The 3D CFD domain is divided in ordered cells along three computational coordinate indices. These cells are grouped in a number, say nb, of blocks. Each block has ni(b),nj(b),nk(b) cells along three coordinates for a total of ni(b)*nj(b)*nk(b) cells. The blocks are scattered between the np processes of a MPI run, in a way to get a balanced load in a run. The physical variables r (density) and p (pressure) are stored in a structure of double precision arrays for each process n in the range (0:np-1)

b=1,nb(n)
var(b)%r(1:ni(b),1:nj(b),1:nk(b))
var(b)%p(1:ni(b),1:nj(b),1:nk(b))
n=0,np

where nb(n) is the number of blocks hosted in process n.

Now, the central point is to save these arrays in hdf5 format in the most efficient way. Of course, the first choice is parallel collective writing. In hdf5 forum, they suggested me to create a compound datatype for each process and then to write in parallel an array of compound datatype scattered between processes.

In order to do this, hdf5 library needs to know the offset locations of the members of the structure. One could provide them by:

  1. their dimensions (but one needs that the data are contiguous in memory)
  2. the offset of their memory locations

and I write down a simple code that I provided in the first post of the topic, where I found that in the MPI run the memory locations of the structure members are scrambled in unpredictable manner.

Now you will find the hdf5 version of the same code below, where a compound datatype is created, populated with the 2*nb arrays in each process, and then written to disk: to my purpose, in a MPI run only the rank 0 will write data since I need to understand the memory layout behavior. The code has an input argument: if 0 or unspecified, it will use offset based on the dimension of arrays, otherwise will use memory locations offsets.

I have tested the code with intel and gnu compilers and the results are the same:

serial version (no -DMPIV preprocessing flag):

  • dimension offset (0) : run went fine and file created, but values in the file are wrong
  • memory location offset (1) : run went fine and file created, and values OK

parallel version (-DMPIV preprocessing flag):

  • dimension offset (0) : run went fine and file created, but values in the file are wrong
  • memory location offset (1) : hdf5 library complains about incorrect memory locations
    major: Datatype
    minor: Unable to insert object
    #001: H5Tcompound.c line 458 in H5T__insert(): member extends past end of compound type
    major: Datatype
    minor: Unable to insert object
    #000: H5Tcompound.c line 370 in H5Tinsert(): unable to insert member

So, in MPI implementation, there are no working alternatives.

module var_mod

  type vt
    double precision,dimension(:,:,:),allocatable :: r,p
  end type vt
  type(vt),target,dimension(:),allocatable :: var

end module var_mod

PROGRAM main

  USE hdf5
  USE ISO_C_BINDING
#ifdef MPIV
  USE MPI
#endif
  USE var_mod

  IMPLICIT NONE

! FILES

  CHARACTER(LEN=*),PARAMETER :: H5FILE_NAME = "mpivars.h5"
  CHARACTER(LEN=*),PARAMETER :: DATASETNAME = "CompoundOfArrays"
  CHARACTER(LEN=10) :: varname,FIRSTARG

  INTEGER,PARAMETER :: NB = 4
  INTEGER,PARAMETER :: RANK = 3

  !----------------------------------------------------------------

  INTEGER :: b,ilen,ioff,ist,i,j,k,me,np,mpierr
  INTEGER(hid_t) :: file,dataset,space
  INTEGER(hsize_t) :: DIM(1) = (/1/)   ! Dataspace dimensions
  INTEGER(SIZE_T)     ::   offset,type_size,type_sized  ! Size of the double precision datatype
  INTEGER :: hdferr
  TYPE(C_PTR) :: f_ptr
  INTEGER(hid_t) :: vwrite     ! File datatype identifier
  INTEGER(hid_t),dimension(nb) :: tidr,tidp ! /* Nested Array Datatype ID        */
  INTEGER(HSIZE_T),DIMENSION(3) :: tdims

  integer,dimension(:),allocatable :: ni,nj,nk

!----------------------------------------------------------------

  me = 0

#ifdef MPIV
  call mpi_init(mpierr)
  call mpi_comm_size(mpi_comm_world,np,mpierr)
  call mpi_comm_rank(mpi_comm_world,me,mpierr)
#endif

  if (me==0) then

    ioff = 0
    if (command_argument_count()/=0) then
      call get_command_argument(1,FIRSTARG,ilen,ist)
      read (FIRSTARG,*) ioff
    end if

    allocate (ni(nb))
    allocate (nj(nb))
    allocate (nk(nb))

    allocate (var(nb))

    do b = 1,NB
      ni(b) = b/3+1
      nj(b) = b
      nk(b) = b/5+1
      allocate (var(b)%r(ni(b),nj(b),nk(b)))
      allocate (var(b)%p(ni(b),nj(b),nk(b)))
    END DO
    !
    ! Initialize FORTRAN interface.
    !
    CALL h5open_f(hdferr)
    !
    ! Initialize the data
    !
    do b = 1,nb
      DO k = 1,nk(b)
        DO j = 1,nj(b)
          DO i = 1,ni(b)
            var(b)%r(i,j,k) = 10000*me+1000*b+i*100+j*10+k
            var(b)%p(i,j,k) = -var(b)%r(i,j,k)
          END DO
        END DO
      END DO
    end do
    !
    ! Create the file.
    !
    CALL H5Fcreate_f(H5FILE_NAME,H5F_ACC_TRUNC_F,file,hdferr)

    CALL h5tget_size_f(H5T_NATIVE_DOUBLE,type_sized,hdferr)

    type_size = sum(ni*nj*nk)*2*type_sized
    if(ioff>0) type_size = H5OFFSETOF(C_LOC(var),C_LOC(var(nb)%p))+H5OFFSETOF(C_LOC(var(nb)%r),C_LOC(var(nb)%p))

    write (*,*) 'var loc:',LOC(var)
    do i = 1,nb
      write (*,*) 'item:',i
      write (*,*) 'var loc:',LOC(var(i))
      write (*,*) 'var%r loc:',LOC(var(i)%r)
      write (*,*) 'var%p loc:',LOC(var(i)%p)
    end do

    do i = 1,nb
      write (*,*) 'item:',i
      if (i>1) write (*,*) 'offset var i/i-1:',LOC(var(i))-LOC(var(i-1))
      write (*,*) 'offset var%r/var:',LOC(var(i)%r)-LOC(var(i))
      write (*,*) 'offset var%p/var:',LOC(var(i)%p)-LOC(var(i))
      write (*,*) 'offset var%p/var%r:',LOC(var(i)%p)-LOC(var(i)%r)
    end do
    !
    ! Create the memory data type.
    !
    CALL H5Tcreate_f(H5T_COMPOUND_F,type_size,vwrite,hdferr)

    do b = 1,nb
      tdims = (/ni(b),nj(b),nk(b)/)
      CALL h5tarray_create_f(H5T_NATIVE_DOUBLE,RANK,tdims,tidr(b),hdferr)
      CALL h5tarray_create_f(H5T_NATIVE_DOUBLE,RANK,tdims,tidp(b),hdferr)
    end do

    offset = 0
    do b = 1,nb
      write (varname,10) b
10    format('var_r_b',I0)
      if(ioff>0) offset = H5OFFSETOF(C_LOC(var),C_LOC(var(b)%r))
      CALL h5tinsert_f(vwrite,varname,offset,tidr(b),hdferr)
      if(ioff==0) offset = offset+type_sized*ni(b)*nj(b)*nk(b)
      write (varname,11) b
11    format('var_p_b',I0)
      if(ioff>0) offset = H5OFFSETOF(C_LOC(var),C_LOC(var(b)%p))
      CALL h5tinsert_f(vwrite,varname,offset,tidp(b),hdferr)
      if(ioff==0) offset = offset+type_sized*ni(b)*nj(b)*nk(b)
    end do
    !
    ! Create the data space.
    !
    !
    CALL H5Screate_simple_f(1,dim,space,hdferr)
    !
    ! Create the dataset.
    !
    CALL H5Dcreate_f(file,DATASETNAME,vwrite,space,dataset,hdferr)
    !
    ! Write data to the dataset
    !
    f_ptr = C_LOC(var)
    CALL H5Dwrite_f(dataset,vwrite,f_ptr,hdferr)
    !
    ! Release resources
    !
    CALL H5Tclose_f(vwrite,hdferr)
    CALL H5Sclose_f(space,hdferr)
    CALL H5Dclose_f(dataset,hdferr)
    CALL H5Fclose_f(file,hdferr)

    do b = 1,nb
      deallocate (var(b)%r,var(b)%p)
    END DO
    deallocate (var)
    deallocate (ni,nj,nk)

  end if

#ifdef MPIV
  call mpi_finalize(mpierr)
#endif

END PROGRAM main

I can’t suggest for MPI, but for a parallel approach using OpenMP, where each thread performs output to the same shared file, I have an IO_routine that performs the the file related tasks.
Essentially, I place this routine call in a !$OMP CRITICAL region. Each call to the routine opens the file, positions to the required position, perform I/O then close the file. This operation is called once for each of the !$OMP DO loop steps (about 50 steps, more than threads) in my case, but multiple calls should be possible.
It is a conservative approach and works reliably.
Does MPI have the equivalence of OMP CRITICAL or can shared access be guaranteed by the HDF5 library?

Also, would it help to have the following ?

  type vt
    integer :: hdf5_offset_location
    integer :: ni,nj,nk
    double precision,dimension(:,:,:),allocatable :: r,p
  end type vt

Hence, “var(b)%hdf5_offset_location” could describe each block, assuming it is possible to allocate each a unique value over the range of MPI processes.

If I’m writing a CFD code, particularly one that is using block structured grids, I would probably use CGNS with HDF5 support enabled instead of HDF5 directly. CGNS was designed explicitly for CFD code data. See

https://cgns.github.io/hdf5.html

Also, if I remember correctly, the only way to ensure that data in a derived type is contiguous in memory is to add a SEQUENCE statement as the first statement in the type,
ie
type vt
SEQUENCE
etc

Is this still true. I know on some compilers way back when you could pass a Fortran derived type directly to a C structure but only if the derived type used sequence association.

For an (old) example of how to use CGNS with a block structured code, see the NASA CFL3D code. I don’t know if it uses parallel I/O though.

Edit. FYI, CFL3D is considered the “gold standard” for block structured codes. Most results published for other CFD codes will compare their results with CFL3D

I am not sure your approach makes a lot of sense for HDF5 in parallel. Having a single MPI rank do the writes is in general a bad idea, in my opinion at least, since you will inevitably run into situations where your variables overflow or run out of memory or spend a lot of time doing IO.

Also, exactly because you rely on memory locations that your rank does not own I suspect you will run into many difficulties. What I usually is just have all the ranks output the data they own. The code ends up being a lot simpler.

Remember that HDF5 is does collective reads/writes in parallel so all processes would run the same code. Quoting the HDF5 website

Parallel HDF5 opens a parallel file with a communicator. It returns a file handle to be used for future access to the file.
All processes are required to participate in the collective Parallel HDF5 API. Different files can be opened using different communicators.

That usually looks something like this

  1. Call MPI_AllGather to get the total size of your data that you need to write
  2. Open your file across all process, and create the same structures e.g. “/pressure, /velocity, …”
  3. Write your data. This is very case specific depending on your Datatype and/or Dataset, if you are using Groups, etc. There is an extensive list of examples for almost every case.
  4. Close your file across all processes.

It would be helpful if you could provide a mockup of the structure your HDF5 should have. For example, this is from one my nuclear codes (unstructured, adapted meshes on tensor-product spaces)

 "/"
  |
  |-- "Gset1"
  |    |
  |    |-- "rank1"
  |    |    |
  |    |    |-- "angle sizes"
  |    |    |-- "angle numbers"
  |    |    |-- "external fields"
  |    |
  |    |-- "rank2"
  |    |    |
  |    |    |-- "angle sizes"
  |    |    |-- "angle numbers"
  |    |    |-- "external fields"
  |    .
  |    .
  |    .
  |    |-- "rankN"
  |         |
  |         |-- "angle sizes"
  |         |-- "angle numbers"
  |         |-- "external fields"
  |
  |-- "Gset2"
  |    |
  |    ...
  .
  .
  .
  |
  |-- "halos"
       |
       |-- "rank1"
       |-- "rank2"
       .
       .
       .
       |-- "rankN"

Again, many thanks to you all for constant support.

A central point that I have missed in previous post is the following: we are looking for the best method available to write out data on parallel file system, i.e. focusing on the performances. In particular, we are studying these alternatives (requiring parallel collective writing):

  • pure MPI I/O: working in progress
  • hdf5: I need some help…
  • pnetcdf: completed and tested

and then compare the performances to select out the best option. We have no limitations on the library design and/or file/data structure. CGNS has been phased out because of the motivations that I will give in the following.

I now would like to give more detail about specific questions:

  • Campbell: I do not know if a MPI equivalent of OMP_CRITICAL is available, but I will look up.

  • rwmsu : We have considered CGNS, of course. But the performances are bad, and parallel collective writing is available only for a global array sliced over mpi processes (designed in particular for unstructured grids). The cngs output is already available in our code as a visualization file.

  • gnikit: I agree about your proposed solution. Doing a initial allgather is not a big concern, since it is necessary once after allocations of structutes in the initial phase of the run. Regarding the structure of hdf5 file, as I said before, there are no costraints, since our goal is pure performances and the files will be stored only for internal use, e.g. restarting the runs.

Trying to optimize performance is a good exercise but in my experience unless you are running on a large HPC system with a dedicated parallel file system such as LUSTRE you are going to be disappointed by any of the options you are investigating.

I can tell you right now, with a relatively high degree of confidence, that pure MPI will be the fastest. The difference in timings between pure MPI/HDF5/PNetCDF might not necessarily be substantial for a real life application like a CFD code, assuming you have chosen reasonable settings for each implementation and your implementation itself is scalable, but there is a non trivial, measurable overhead whenever you introduce another library.

Also, I think the performance comparison between HDF5 and pure MPI has already been done and you should be able to find some timings. As expected HDF5 is slower but not in a meaningful way IMO. A few minutes difference in a simulations that last hours is pretty trivial.

This is an 2020 presentation from the HDF5 group https://www.hdfgroup.org/wp-content/uploads/2020/06/2020-06-26-Parallel-HDF5-Performance-Tuning.pdf and I am sure that there is a tonne of research papers out there testing similar things.

Anyway, circling back to the HDF implementation; for simple data output, like structured arrays, you can use the HDF5 Lite API.

As for the HDF5 file structure, what I meant was that an HDF5 file is basically like a file system where you create “directories” and place data in them. Before you start coding the HDF5 solution you need to decide on that file “layout”. The layout (assuming again reasonable choices) does not impact performance. It does however impact how your code looks.

2 Likes

As I pointed out before, we work on a cluster with a parallel file system (yes, LUSTRE).

I’d like also to add that MPI has support for custom datatypes (useful for both C structs and Fortran derived types). Bill Gropp once taught an extensive course at UIUC that covers MPI quite deeply:

https://wgropp.cs.illinois.edu/courses/cs598-s16/

For creating custom MPI data types, see lecture 27a, “MPI Datatypes”, under “More on Efficient Halo Exchange”. He also covers MPI-IO in the same course, on lectures 32-33.

1 Like