Memory alignment for SIMD

Is there a way to ensure memory alignment of allocated arrays ie to 16-byte boundaries for calling SIMD instructions? Ideally with gfortran. Thanks!

This post might be of interest.

1 Like

As @mEm has pointed out, unless your compiler supports the OpenMP 5.x standard, or your compiler has custom directives or external flags, you’ll probably have to do your memory allocation via a C routine. This is unpleasant, because you are forced to use pointer arrays.

For example the Intel compilers provide _mm_alloc and _mm_free, which can be called via a Fortran wrapper (generalized to cover the types/kinds/ranks you need):

interface
    function mm_malloc(size,align) bind(c,name="_mm_malloc")
        import c_size_t, c_ptr
        integer(c_size_t), value :: size, align
        type(c_ptr) :: mm_malloc
    end function
    subroutine mm_free(p) bind(c,name="_mm_free")
        import c_ptr
        type(c_ptr), value :: p
    end subroutine
end interface

integer :: n
real(8), pointer :: aligned(:) => null()
type(c_ptr) :: buffer

n = 1000
buffer = mm_malloc(n * c_sizeof(1.0_8), 16_c_size_t)
if (c_associated(buffer)) then
    call c_f_pointer(buffer,aligned,[n])
end

With gfortran, I’m not sure which library you need to link to obtain the _mm_alloc and _mm_free. It could be they are macros wrapping something else, so you’d need to write a Fortran compatible C wrapper first. Maybe you could use F2018 C descriptors to make this easier.

Perhaps using aligned_alloc is slightly easier, since it comes with the C standard library:

! test_aligned_alloc.f90
!
program test_aligned_alloc
use, intrinsic :: iso_c_binding
implicit none
interface
    ! size must be a multiple of alignment
    function c_aligned_alloc(alignment,size) bind(c,name="aligned_alloc")
         import c_size_t, c_ptr
         integer(c_size_t), value :: alignment, size
         type(c_ptr) :: c_aligned_alloc
    end function
    subroutine c_free(ptr) bind(c,name="free")
        import c_ptr
        type(c_ptr), value :: ptr
    end subroutine
end interface

integer, parameter :: wp = c_float

type(c_ptr) :: a_p = c_null_ptr
real(wp), pointer :: a(:)
integer :: n, npad

integer, parameter :: balign = 16

n = 5
print *, "n                  = ", n

npad = n + (balign - mod(n * c_sizeof(1.0_wp), balign))/c_sizeof(1.0_wp)
print *, "npad               = ", npad

! size of allocated array must be an integer multiple of the alignment
a_p = c_aligned_alloc( int(balign,c_size_t), npad * c_sizeof(1.0_wp))
call c_f_pointer(a_p,a,[n])
print *, "mod(loc(a),balign) = ", mod(loc(a),balign)

call random_number(a)
call square(a)

print '(F10.3,2X)', a

nullify(a)
call c_free(a_p)
a_p = c_null_ptr

contains 
    subroutine square(a)
        real(wp), intent(inout) :: a(:)
        a = a**2
    end subroutine
end program

The omp_alloc function also works with gfortran v13, unfortunately, the allocators directive for allocatable arrays is not supported yet:

! test_omp_alloc.f90
!
program test_aligned_alloc
use, intrinsic :: iso_c_binding
use omp_lib
implicit none

integer, parameter :: wp = c_float

integer(omp_memspace_handle_kind)  :: a_memspace = omp_default_mem_space
type(omp_alloctrait)                :: a_traits(1) = &
                                        [omp_alloctrait(omp_atk_alignment,16)]
integer(omp_allocator_handle_kind)  :: a_alloc

real(wp), pointer :: a(:)
type(c_ptr) :: a_p = c_null_ptr
integer :: i, n

a_alloc = omp_init_allocator( a_memspace, 1, a_traits)

n = 5
a_p = omp_alloc(n * c_sizeof(1.0_wp), a_alloc)
call c_f_pointer(a_p, a, [n])
print *, "mod(loc(a),16) = ", mod(loc(a),16)

call random_number(a)
call square(a)

print '(F10.3,2X)', a

! Edit 2: inserted lines to free memory
nullify(a)
call omp_free(a_p, a_alloc)
call omp_destroy_allocator(a_alloc)

contains 
    subroutine square(a)
        real(wp), intent(inout) :: a(:)
        a = a**2
    end subroutine
end program

Even if you obtain aligned memory, GNU Fortran doesn’t have a directive to force aligned loads and stores. The !$omp simd aligned(...) is there, but its unclear what type of variables are allowed to appear in it.

Currently, gfortran appears to follow the OpenMP 4.0 standard, which lists the restriction:

  • The type of list items appearing in the aligned clause must be C_PTR or Cray Fortran pointer, or the list item must have the POINTER or ALLOCATABLE attribute.

In OpenMP 5.0, this is changed to:

  • If a list item on the aligned clause has the ALLOCATABLE attribute, the allocation status must be allocated.
  • If a list item on the aligned clause has the POINTER attribute, the association status must be associated.
  • If the type of a list item on the aligned clause is either C_PTR or Cray pointer, the list item must be defined

Note there is no restriction regarding assumed-size or other arrays (on the stack).

In OpenMP 5.2 the aligned clause is changed again:

  • Each list item must have C_PTR or Cray pointer type or have the POINTER or ALLOCATABLE attribute. Cray pointer support has been deprecated.
  • If a list item has the ALLOCATABLE attribute, the allocation status must be allocated.
  • If a list item has the POINTER attribute, the association status must be associated.
  • If the type of a list item is either C_PTR or Cray pointer, it must be defined. Cray pointer support has been deprecated.

gfortran and the Intel Fortran compilers appears to interpret these differently: Compiler Explorer

Edit: In the upcoming OpenMP 6.0 standard (currently just a technical preview), the restrictions on aligned items are changed again:

Each list item must be an array.

2 Likes

if you are using Intel Fortran, it’s much easier. Use the -align arrayXbyte option. for example:
ifx -align array64byte foo.f90

2 Likes

With Intel Fotran compiler (ifort), I often use the following directives too:

 ...
 !dir$ attributes align:64 :: rdummy
 allocate(rdummy3(n))
 !dir$ assume_aligned rdummy(1):64
...
1 Like

I have again been looking at this issue of alignment, in particular for ALLOCATE of private arrays in OpenMP being positioned on a shared HEAP. The issues can quickly get more complex, especially when including GPU off-loading; ( “hetrogeneous” memory is the new problem! ).

But I am amazed at how complex the implementation for OpenMP Ver 5.1 and 5.2 have become. I invite anyone to read the OpenMP Ver 5.2+ Specifications.

I was hoping there could be something as simple as:

“ALLOOCATE ( array(n,m), ALIGN=4096, STAT=stat )”

which could start the array allocation on the heap at a memory address where mod(address,4096) = 0, ie the start of a physical memory page. Other values of =8, =16 etc may suit AVX registers, but AVX alignment has moved on from this also.

Unfortunately this can never happen with our hardware agnostic Fortran Standard, which I think is a shame.

IMO all of this should be transparently managed by the compiler, possibly with the help of command-line flags such as the one indicated by @greenrongreen

If you want to align on memory pages, then you also need inquiry functions to get the size of a page. Which in turn assumes that the memory management is based on pages, and so on…

But I’m still wondering (really): what is the motivation for aligning on pages? I get the reasons why aligning on vector register size and/or on cache lines can help, but I don’t get them on pages…

About aligning on vector register size: to which degree is it important? If an array is not properly aligned I assume that only the first elements cannot be processed by vector instructions and that the vector instructions start at the first aligned element : am I correct? If yes, it mainly matters for small arrays, while the performance hit is minimal for long arrays.

unaligned - yes, most vectorizers will create a “peel” loop to handle a few array elements to get to alignment. Then it will use vector-sized instructions in the kernel for most of the iterations. Any leftovers are done in a “remainder” section. And it can generate aligned loads/stores for aligned accesses, and unaligned load/stores for the other accesses. Consider A, B, and C all vector aligned. This expression:

C(i) = B(i) + C(i+1)

The compiler will use aligned loads for B. Unaligned for C. Aligned stores to C.

The downside for this approach, particularly for the allocation of many small arrays, is that the heap would be fragmented, on purpose by the programmer. Other allocations, without the ALIGN= keyword, might fill in some of the holes later, but there could be a lot of heap address space consumed for just a small amount of used memory on each page. For example, a million arrays of size 96 bytes would normally consume ~96MB of memory (not counting the metadata for simplicity), but with this ALIGN setting, it would instead consume ~4GB of heap space. I’m not saying this would be a bad idea in every case, it just seems like aiming the shotgun at your foot and hoping it doesn’t go off.