I found that allocation of an array of derived types can be between ~1,000 and ~10,000 times slower than the same with fixed sizes. With the following derived types:
module my_types
type :: fixed
integer :: a(512)
end type fixed
type :: dyn
integer, allocatable :: a(:)
end type dyn
end module my_types
one array with ~1M-~10M elements can take up to O(1) seconds only to be allocated!
fixed alloc time = 1.7999671E-05
dynam alloc time = 5.0477996E-02 seconds.
Are there any ways that this could be mitigated somehow? (Besides, of course, reducing the number of allocations with temporary storage) I believe that if no initializers are present, then all the compiler should do is to flag the allocatable as .not.allocated i.e. point to no memory.
I think that’s the point: the compiler has to initialize the %a(:) component to “non allocated” for each element of the array, which means writing to the allocated memory for the array. Whereas for the fixed version the compiler just has to reserve virtual memory without initialiazing anything (hence no actual write), and just reserving virtual memory is quite fast.
I believe yours is exactly the point: doublechecking with C++ I see that performance for a std::vector initializer is similar to that of the fortran derived type array.
There is no component that can be initialized until the space has been reserved for the object that contains it. In this case that doesn’t happen until the allocation at runtime.
Interesting @PierU : so ifx uses 9 data points to describe a 1D array, while gfortran uses 8 (512 bits). I wonder (besides rank?) what else do they store, go figure.
Anyways if I make my DT the same size:
type :: arr1d
integer(c_intptr_t) :: data = 0
integer(c_size_t) :: ub = 0
integer(c_size_t) :: lb = 1
integer(c_size_t) :: rank = 1
integer(c_size_t) :: five = 1
integer(c_size_t) :: six = 1
integer(c_size_t) :: seven = 1
integer(c_size_t) :: eight = 1
end type
then the allocation time becomes basically identical.
Chances here are quite high your findings here are specific to Intel Fortran and its design of derived type structure and their implementation in their processor that might yield its users “operating” (OPEX) benefits elsewhere (say, possibly, vectorization on Intel CPUs) but at certain (CAPEX) cost e.g., allocation and setup of the structure data.
My hunch is your analysis as to the “alloc time” will be inconclusive or a nothing-burger with other processors - I haven’t checked, but try gfortran and NAG Fortran if you can
And I doubt this has anything to Fortran language per se and its semantics, say in the ISO standard.
So… Why create types with arrays in them? If you have no idea how long the array will be then you might as well create a linked list. Yes; the performance of finding node 1000 in a linked list is horrible compared to an array, but you can read, write, insert, and delete from a linked list. And (de)allocating nodes for a linked list returns very quickly. Just make sure to hold the root of the list and make sure you set all unused pointers to null().
Allocatable arrays can be worked with like a vector, growing and shrinking as needed. More than likely though, a maximum size is only needed once, and after that you get all the benefits of normal arrays like fast looping, vectorization being possible, etc. I think linked lists in general are much less in fashion compared to vectors and hash maps.
As long as I use default initialisation (allocatable, null pointer or null values) I see comparable performance for gfortran, ifort as well as ifx, no matter whether I use fixed size, allocatable or pointer arrays. So it all comes down to how fast a memory copy of a (default initialised) template can be performed I guess.
I once worried about performance penalty of default initialisation. However, if you really do something non-trivial with the values in the arrays, then the performed operations (including read from and write to memory) takes much more time then the default initialisation.
The only exception I remember was a very performance sensible simple and small hash map with fixed size (from which only a half or less was used as is usual with hash maps). Any size slightly too large became visible due to the necessary initialisation of the memory, which was mostly not used.
The reason for my use case (unstructured CFD with AMR) is exactly what you’re pointing out @Knarfnarf: storing adjacencies for some of the FV quantities is better done with a (overprovisioned) sorted array rather than a list, because search is faster (I’ve tried basically all data structures). Anyways, it seems like even an array of linked lists would use more or less the same time, I guess due to the same initialization overhead:
type :: edge
! Cell adjacency
type(sorted_array) :: cells
end type edge
we may argue that fixed 2D array storage would be faster (it is), but that comes at a tremendous cost in terms of code maintainance (all functions that operate on one edge should take this as an additional input, which also means having access to the whole cpu’s mesh topology):
This is also what I see. It’s surprising to me but at the end of the day, if I have a structure array with 1M elements, each of them being 512bits, and my RAM writes at 10GB/s, the initialization time will be around
Does that include tree structures such as Javier Bonet’s Alternating Direction Trees (ADT) or KDtrees. These are used in just about every unstructured CFD code I’ve seen. That being said, I use my own vector (dynamic array) class and jagged arrays (to support a variety of cell types (tets, hexes, prisms etc) to store connectivity data. I also wrote my own package of utitlities that mimic MATLABS set functions to find common nodes etc. I agree that a list is not the best choice. However, tree structures are suppose to yield the fastest search (or so I’ve been told).
Fortran implementations of ADT and KDtrees can be found in the NASA cfdtools repository.
My base mesh strucutre is an MPI “forest” of unstructured trees. Conventional cell types (tet/hex/prism/pyramid) are handled with fixed-size arrays (8 nodes, 6 faces, 12 edges), and I’m currently working to extend it to general polyhedral cells. I believe I’ve been using any possible data structure - 3D geometry is the least forgiving: everything’s got to be top notch or cpu/gpu time explodes easily.
If you’re interested, I shared some of the details at the seminar I gave at UW-Madison a couple weeks ago here: that shows some the automated meshing capability I’m building (conformal, adaptive, unstructured mesh + cut-cell boundaries).