Incredibly slow allocation with jagged arrays

I found that allocation of an array of derived types can be between ~1,000 and ~10,000 times slower than the same with fixed sizes. With the following derived types:

module my_types
   type :: fixed
      integer :: a(512)
   end type fixed

   type :: dyn
      integer, allocatable :: a(:)
   end type dyn  
end module my_types

one array with ~1M-~10M elements can take up to O(1) seconds only to be allocated!

 fixed alloc time =   1.7999671E-05
 dynam alloc time =   5.0477996E-02  seconds.

Are there any ways that this could be mitigated somehow? (Besides, of course, reducing the number of allocations with temporary storage) I believe that if no initializers are present, then all the compiler should do is to flag the allocatable as .not.allocated i.e. point to no memory.

I think that’s the point: the compiler has to initialize the %a(:) component to “non allocated” for each element of the array, which means writing to the allocated memory for the array. Whereas for the fixed version the compiler just has to reserve virtual memory without initialiazing anything (hence no actual write), and just reserving virtual memory is quite fast.

2 Likes

I believe yours is exactly the point: doublechecking with C++ I see that performance for a std::vector initializer is similar to that of the fortran derived type array.

1 Like

Just curious, is there a reason a compiler can’t do this at compile or link time. Does it have to wait until run time.

There is no component that can be initialized until the space has been reserved for the object that contains it. In this case that doesn’t happen until the allocation at runtime.

The array of derived type is itself an allocatable, so it can happen only at runtime:

   type(dyn), allocatable :: d(:)
   allocate(d(1000000))

It would be diffferent if it was a static array:

   type(dyn) :: d(1000000)

Yes it must deal with the initialization. If I replace the Fortran 1D array with a simple representation of it:

   type :: arr1d
      integer(c_intptr_t) :: data = 0
      integer(c_size_t) :: ub = 0
      integer(c_size_t) :: lb = 1
   end type 

I get faster, but ~same order of magnitude, performance:

 arrDT alloc time =   2.6528999E-02
 dynam alloc time =   7.2440997E-02

The faster initialization is directly linked to the lesser amount of memory that has to be “first touched”:

   print*, "arr1d storage_size", storage_size(f)
   print*, "dyna  storage_size", storage_size(d)

 arr1d storage_size         192
 dyna  storage_size         576
1 Like

Interesting @PierU : so ifx uses 9 data points to describe a 1D array, while gfortran uses 8 (512 bits). I wonder (besides rank?) what else do they store, go figure.
Anyways if I make my DT the same size:

   type :: arr1d
      integer(c_intptr_t) :: data = 0
      integer(c_size_t) :: ub = 0
      integer(c_size_t) :: lb = 1
      integer(c_size_t) :: rank = 1
      integer(c_size_t) :: five = 1
      integer(c_size_t) :: six = 1
      integer(c_size_t) :: seven = 1
      integer(c_size_t) :: eight = 1
   end type 

then the allocation time becomes basically identical.

1 Like

@FedericoPerini ,

Chances here are quite high your findings here are specific to Intel Fortran and its design of derived type structure and their implementation in their processor that might yield its users “operating” (OPEX) benefits elsewhere (say, possibly, vectorization on Intel CPUs) but at certain (CAPEX) cost e.g., allocation and setup of the structure data.

My hunch is your analysis as to the “alloc time” will be inconclusive or a nothing-burger with other processors - I haven’t checked, but try gfortran and NAG Fortran if you can

And I doubt this has anything to Fortran language per se and its semantics, say in the ISO standard.

So you may want to also follow-up with Intel team at their forum, if you remain concerned:
Intel® Fortran Compiler - Intel Community

Gfortran’s array descriptor is defined here gcc/libgfortran/libgfortran.h at 52370c839edd04df86d3ff2b71fcdca0c7376a7f · gcc-mirror/gcc · GitHub

The descriptor includes things like the type, strides, length of an element, version, as well as the data and the upper/lower bounds.

1 Like

you may want to check with the Compiler Explorer link on the top post - performance with other compilers seems in line with that of ifx.

So… Why create types with arrays in them? If you have no idea how long the array will be then you might as well create a linked list. Yes; the performance of finding node 1000 in a linked list is horrible compared to an array, but you can read, write, insert, and delete from a linked list. And (de)allocating nodes for a linked list returns very quickly. Just make sure to hold the root of the list and make sure you set all unused pointers to null().

1 Like

Allocatable arrays can be worked with like a vector, growing and shrinking as needed. More than likely though, a maximum size is only needed once, and after that you get all the benefits of normal arrays like fast looping, vectorization being possible, etc. I think linked lists in general are much less in fashion compared to vectors and hash maps.

1 Like

As long as I use default initialisation (allocatable, null pointer or null values) I see comparable performance for gfortran, ifort as well as ifx, no matter whether I use fixed size, allocatable or pointer arrays. So it all comes down to how fast a memory copy of a (default initialised) template can be performed I guess.

I once worried about performance penalty of default initialisation. However, if you really do something non-trivial with the values in the arrays, then the performed operations (including read from and write to memory) takes much more time then the default initialisation.

The only exception I remember was a very performance sensible simple and small hash map with fixed size (from which only a half or less was used as is usual with hash maps). Any size slightly too large became visible due to the necessary initialisation of the memory, which was mostly not used.

The reason for my use case (unstructured CFD with AMR) is exactly what you’re pointing out @Knarfnarf: storing adjacencies for some of the FV quantities is better done with a (overprovisioned) sorted array rather than a list, because search is faster (I’ve tried basically all data structures). Anyways, it seems like even an array of linked lists would use more or less the same time, I guess due to the same initialization overhead:

:100:

Now, if one has a structure

type :: edge
   ! Cell adjacency
   type(sorted_array) :: cells
end type edge

we may argue that fixed 2D array storage would be faster (it is), but that comes at a tremendous cost in terms of code maintainance (all functions that operate on one edge should take this as an additional input, which also means having access to the whole cpu’s mesh topology):

integer(FVI), allocatable :: cells_2_edges(:,:)

This is also what I see. It’s surprising to me but at the end of the day, if I have a structure array with 1M elements, each of them being 512bits, and my RAM writes at 10GB/s, the initialization time will be around

1e6 * (512/8) * 1.0e-10 ~ 0.0064s ~ O(0.01s)

Does that include tree structures such as Javier Bonet’s Alternating Direction Trees (ADT) or KDtrees. These are used in just about every unstructured CFD code I’ve seen. That being said, I use my own vector (dynamic array) class and jagged arrays (to support a variety of cell types (tets, hexes, prisms etc) to store connectivity data. I also wrote my own package of utitlities that mimic MATLABS set functions to find common nodes etc. I agree that a list is not the best choice. However, tree structures are suppose to yield the fastest search (or so I’ve been told).

Fortran implementations of ADT and KDtrees can be found in the NASA cfdtools repository.

1 Like

My base mesh strucutre is an MPI “forest” of unstructured trees. Conventional cell types (tet/hex/prism/pyramid) are handled with fixed-size arrays (8 nodes, 6 faces, 12 edges), and I’m currently working to extend it to general polyhedral cells. I believe I’ve been using any possible data structure - 3D geometry is the least forgiving: everything’s got to be top notch or cpu/gpu time explodes easily.

If you’re interested, I shared some of the details at the seminar I gave at UW-Madison a couple weeks ago here: that shows some the automated meshing capability I’m building (conformal, adaptive, unstructured mesh + cut-cell boundaries).