Jagged array pain

I have a data structure like this:

type data_t
   real, allocatable :: a2d(:,:) 
end type

I am typically allocating an array of size 10000 or more of this type

type(data_t), allocatable :: d(:) 
allocate( d(10000) )

And the shape of each component a2d(:,:) is typically 400x5000 (varying from element to element of d, otherwise I would have used a global 3D array). It can end up with 100 GB or more, and I may need several instances of it in a conjugate gradient scheme.

What I am storing in each column of a(:,:) is however not always the same size, so I though about defining a jagged array type in order to allocate only what is needed:

type( jagged_t)
   real, allocatable :: a1d(:)
end type

type data_t
   type( jagged_t), allocatable :: aj(:)
end type

At the end, I can allocate each a1d(:) array to the exact size I need, without wasting space…

Really?

Well, not exactly. With the ifort compiler, I can see that the storage size of a type(jagged_t) is 576 bytes. That is equivalent of 18 real elements. So, for each column I have a memory overhead of 18 samples. This is not extremely bad, as with this method I can save say 100 samples per column on average, so this is still interesting, and I am just paying 20% of what I am saving as a tax to the compiler.

Nonetheless I wanted to mitigate these 20% by defining 2D subgroups instead of individual 1D arrays:

integer, parameter :: NCOL = 10

type( subgroup_t )
   real, allocatable :: asub2d(:,:)   ! always allocated with NCOL=10 columns
end type

type data_t
   type( subgroup_t ), allocatable :: as(:)
end type

If originally I had a a2d(*,5000) array, now I have a as(500) array, each element containing a asub2d(*,10) array. The number of rows of each asub2d are still potentially smaller than the number of rows of a2d (the number of rows being the max of what is needed for each column), and the 576 bits overhead is now for 10 columns instead of a single column.

So, are good, now?

We are not… to my surprise, when I monitored the memory usage of the process while running, it was 15-20% above what I expected! And I found the explanation: the as(i)%asub2d(:,:) components are not allocated very contiguously in the memory, there are significant gaps in between. By checking a few addresses, I could see that the gaps were about 1500 bytes on average, for an average size of and asg(:,:) array of 3000 elements, hence 12 000 bytes, which is about 12% wasted memory. I don’t get why it happens…

All of this is disappointing… It would be easier with C pointers, which have only 64 bits overheads, but still there could be some gaps between all the individual columns. So, I am thinking now about using good old flat 1D arrays, with indexes to access the original columns.

type data_t
   real, allocatable :: a1d(:) 
   integer, allocatable :: idx(:)   ! idx(j)=index in 1d array of the original j-th column
end type
1 Like

I don’t think fortran can match this. The C model is that arrays are equivalent to pointer addressing (64-bit addresses on modern hardware), while the fortran model is that they have also rank and bound information encoded in the metadata. That gives fortran many advantages as a high level language, such as the ability to check bounds during execution and to ensure that the extents match when doing vector and matrix algebra, but it comes at the cost of storing that metadata. Both allocatables and fortran pointers have this extra metadata, so I don’t think there is an easy workaround strictly within the fortran language to avoid that overhead.

This is likely because the low-level allocations are looking for page boundaries, or multiple-of-16 boundaries, or multiple-of-256 boundaries, and such. These restrictions on the low-level addresses makes it easier to use vector hardware and gpus and so on. This is memory and cpu dependent, so it may change from machine to machine, or even with compiler options, but the fortran programmer does not have direct control over these low-level details. This is also why, for example, some compilers store 80-bit floating point numbers in 128-bits, apparently wasting 48 bits of padding for every array element.

2 Likes

An alternative might be to allocate a big chunk and have the jagged type point to smaller pieces of that large chunk of memory. It will take a bit more work, but that way you do avoid wasting memory.

1 Like

If you know the sizes of each 2d array component at the start of your program, you might consider a sparse matrix like storage scheme. This would allow you to store everything in one very large flat 1D array and then use CSR (or is it CRS I can never remember) like indexing and pointer rank remapping (assuming everything is contiguous) to extract the 2D arrays you need locally. The advantage of this is you could then use something like PSBLAS to do your CG in parallel.

1 Like

This is more more or less what I’m going to do, with a flat 1D array and indexes of the original columns. I won’t use an existing format such as CSR, though (as the structure of my data can be basically described just with two indexes per original column).

Thats why I said CSR “like” :smile: You might also want to look at the Morfeus Framework from our friends at the Sourcery Institute. Morfeus is a reworking to modern Fortran OOP of Stefano Toninel’s NEMO Finite Volume CFD code. Toninel’s code is described in his dissertation here which uses a Fortran 90/95 Object-based approach to do an OO CFD code. I like to point out this reference as an example of how you can do most of what is useful from OO using just what was in Fortran prior to F2003. The F2003 additions make some things easier but they are (as Toninel’s dissertation points out) not necessary to get a flexible code based on OO design principles (and it is the OO design that is in my opinion the most useful part of OO, more so than how you actually program OO). Morfeus and Nemo both use PSBLAS as the basis for the underlying solvers.

It depends on the underlying malloc implementation. Typically malloc have certain heuristics. Probably it tries to meet some alignment constraints like @RonShepard suggests.

Technically, the gaps aren’t wasted, but they should be free for smaller allocations.

Concerning your jagged array - with the integer indexes idx it is practical to have one extra. So for n columns you’d have n+1 indexes. This way you can use them as lower and upper bounds. In the sense:

Type(data_t), target :: a
Integer, pointer :: lb(:), ub(:)

Allocate(a%a1d(nsz), a%idx(ncol+1))

lb => a%idx(1:ncol)
ub => a%idx(2:ncol+1)  ! exclusive

In practice you can also skip the pointers and just use a%a1d(a%idx(i):a%idx(i+1)-1) to reference column i.

1 Like