Aligned allocation

When writing subroutines which are amenable to vectorization it can be helpful to provide alignment hints. This requires that the memory be aligned at boundaries of size of the vector register length.

There are several ways to obtain aligned memory:

Perhaps someone else is aware of alignment control directives from other vendors (Cray, NVIDIA, AMD)? gfortran and flang don’t appear to have them. (Alignment attributes are available for C/C++ with gcc and clang.) If you know of other options, I will add them to list.

Going forward, the OpenMP directive combined with !$omp simd for explicit vectorization control appears to be the most portable option, assuming that vendors will come to support it.

What would be required to make alignment part of the base language allocate() statement? Does it even make sense without a memory model or OS support for dealing with different CPU modes and protection rings?


References

4 Likes

In f77 and earlier, the only alignment requirements might have been that a 4-byte entity be aligned on 4-byte boundaries, or that an 8-byte entity be aligned on 8-byte boundaries. This was before derived types were added to the language, of course. So the common approach in f77 involved declaration of an 8-byte entity, usually double precision or real*8, sometimes integer*8, and then use type punning through dummy arguments, equivalence, or storage associatoin to coerce other “temporary” entities to that alignment. A loc() function is not required for this. So you might waste a byte here and there in order to force an integer*4 array to be aligned on an 8-byte boundary, but the benefit was that the code was relatively portable with no performance surprises.

Nowadays with vector hardware, alignment can be more extensive, perhaps requiring 16-,32-, or 64-byte alignment. Since the standard does not mandate or even acknowledge byte addressing, I don’t think it would be appropriate to introduce any byte-specific semantics into, for example, the allocate() statement. I think the current practice is that if, for example, a 16-byte entity is being allocated, or declared locally, then it is aligned on a 16-byte boundary. Are there situations where this convention is inadequate to achieve optimal performance, and some additional addressing requirements are necessary?

How does malloc() work in C? Does it always return addresses with the maximal alignment requirements, without knowing what it is allocating? In this sense, the fortran allocate() has more information available than C malloc().

In some SIMD cases it can be beneficial to align memory and make sure that no peel or tail loops are needed. Say for an array of floats, instead of aligning it on the 8-byte boundary, you’d maybe like 16, 32, or 64 instead.

malloc returns a buffer of bytes, without any knowledge of how they are used. It is defined such that

If allocation succeeds, returns a pointer that is suitably aligned for any object type with fundamental alignment.

I won’t attempt to explain what this means in C terms, but ultimately in practice this boils down to 8 or 16 bytes, which happens to be also the size of long double.