Aligned allocation

ivanpribec · July 27, 2023, 2:08pm

When writing subroutines which are amenable to vectorization it can be helpful to provide alignment hints. This requires that the memory be aligned at boundaries of size of the vector register length.

There are several ways to obtain aligned memory:

compiler directives
- Intel Fortran: !DIR$ ATTRIBUTES ALIGN: n :: object
- IBM XL Fortran: !IBM* ALIGN(n, object)
compiler flags
- Intel Fortran: -align arraynbyte
- NVIDIA nvfortran: -⁠Mcache_align
manual memory management “F77 style” (use of a static memory pool and the non-standard loc extension to position arrays at aligned boundaries)
use pointer arrays with memory allocated via C (or C++) library routines:
- aligned_alloc (C11, C++17)
- operator new[] (aligned version since C++17)
- posix_memalign (POSIX)
- _aligned_malloc (Windows)
- _mm_malloc (Intel, GCC, clang, on x86-64)
- mkl_malloc (Intel oneMKL, C and Fortran versions available)
- fftw_malloc (FFTW)
the OpenMP 5.1 allocate directive with align clause (!$omp allocate align(...))

Perhaps someone else is aware of alignment control directives from other vendors (Cray, NVIDIA, AMD)? gfortran and flang don’t appear to have them. (Alignment attributes are available for C/C++ with gcc and clang.) If you know of other options, I will add them to list.

Going forward, the OpenMP directive combined with !$omp simd for explicit vectorization control appears to be the most portable option, assuming that vendors will come to support it.

What would be required to make alignment part of the base language allocate() statement? Does it even make sense without a memory model or OS support for dealing with different CPU modes and protection rings?

References

RonShepard · July 27, 2023, 3:55pm

In f77 and earlier, the only alignment requirements might have been that a 4-byte entity be aligned on 4-byte boundaries, or that an 8-byte entity be aligned on 8-byte boundaries. This was before derived types were added to the language, of course. So the common approach in f77 involved declaration of an 8-byte entity, usually double precision or real*8, sometimes integer*8, and then use type punning through dummy arguments, equivalence, or storage associatoin to coerce other “temporary” entities to that alignment. A loc() function is not required for this. So you might waste a byte here and there in order to force an integer*4 array to be aligned on an 8-byte boundary, but the benefit was that the code was relatively portable with no performance surprises.

Nowadays with vector hardware, alignment can be more extensive, perhaps requiring 16-,32-, or 64-byte alignment. Since the standard does not mandate or even acknowledge byte addressing, I don’t think it would be appropriate to introduce any byte-specific semantics into, for example, the allocate() statement. I think the current practice is that if, for example, a 16-byte entity is being allocated, or declared locally, then it is aligned on a 16-byte boundary. Are there situations where this convention is inadequate to achieve optimal performance, and some additional addressing requirements are necessary?

How does malloc() work in C? Does it always return addresses with the maximal alignment requirements, without knowing what it is allocating? In this sense, the fortran allocate() has more information available than C malloc().

ivanpribec · November 20, 2023, 8:39pm

In some SIMD cases it can be beneficial to align memory and make sure that no peel or tail loops are needed. Say for an array of floats, instead of aligning it on the 8-byte boundary, you’d maybe like 16, 32, or 64 instead.

malloc returns a buffer of bytes, without any knowledge of how they are used. It is defined such that

If allocation succeeds, returns a pointer that is suitably aligned for any object type with fundamental alignment.

I won’t attempt to explain what this means in C terms, but ultimately in practice this boils down to 8 or 16 bytes, which happens to be also the size of long double.

Topic		Replies	Views
Memory alignment for SIMD	9	1263	November 22, 2023
OpenMP and Allocate	21	1252	November 20, 2023
Best practice of allocating memory in Fortran? Help	19	8530	May 1, 2022
Allocatable vs adjustable	81	1701	January 6, 2025
Why is this Fortran code so much faster than its C++ counterpart? Help	11	1747	June 21, 2022

Aligned allocation

Related topics