If we are considering enhancements to the ALLOCATE statement, an option that could help is ALIGN=, with options of 8, 16 (bytes) or importantly, for OpenMP where the heap is shared between all threads, ALIGN=PAGE.
Surely I am not the only one that wants a simple change that could improve efficiency by removing one of the causes of memory consistency problems.
The OpenMP 5.0 standard already covers your needs in that regard: Memory Allocators
In OpenMP 5.1, they also introduced an align clause in the allocate directive
Edit: In Open 5.2 the !$omp allocate directive statement decorating an allocate statement was deprecated. Instead, a new !$omp allocators construct was introduced. The difference is shown below:
Do you know where there is a Fortran code example of this being achieved, ie selecting that an allocatable array in a !$OMP region is positioned to the start of a memory page ?
Also,
from what version of Gfortran is this syntax supported ?
is their any guidance if this needs to be applied to all allocate statements in a thread, or if once one array is positioned on a new page, then subsequent arrays for this thread can share the page(s).
I have wondered if this needs to be placed in a !$OMP CRITICAL region.
Have you been able to implement the Ver 5.1+ syntax ?
When Ver 5.1 was released, I found these memory options in OpenMP 5.1+ to be very difficult to understand, especially when they were not supported by the then available Fortran compilers.
To address the memory consistency problem, I would like to select that each heap memory page is only used by a single thread, but the OpenMP syntax available, based on target or device did not appear to address the alternatives of stack or heap memory. It was more for GPU off-loading. A workable example could help greatly.
Unfortunately Fortran is very quiet on the roles of stack and heap.
! omp_alloc_test.f90 --
! Demonstration of OpenMP 5.2 allocators
use omp_lib
integer, parameter :: dp = kind(1.0d0)
real(dp), allocatable :: x(:), y(:)
real(dp) :: a
integer(omp_memspace_handle_kind ) :: xy_memspace = omp_default_mem_space
type(omp_alloctrait) :: xy_traits(1) = &
[omp_alloctrait(omp_atk_alignment,64)]
integer(omp_allocator_handle_kind) :: xy_alloc
integer :: i, n
xy_alloc = omp_init_allocator( xy_memspace, 1, xy_traits)
n = 1000
!$omp allocators allocate(xy_alloc: x,y)
allocate(x(n),y(n))
! loc is a non-standard extension, but in practice all compilers have it
if(modulo(loc(x),64) /= 0 .and. modulo(loc(y),64) /=0 ) then
print*,"ERROR: x|y not 64-byte aligned"
stop
endif
y = 0
x = 4
a = 1.0_dp/7.0_dp
! daxpy
!$omp simd simdlen(4) aligned(x,y: 64)
do i = 1, n
y(i) = y(i) + a*x(i)
end do
deallocate(x,y)
call omp_destroy_allocator(xy_alloc)
end
Modern Fortran Explained (2018) mentions heap (p105) and stack (p120). The f2023 standard mentions heap only once, in C.13.1, and STACK only as an example of a derived type in 7.5.4.2 Note 2. Section 1 Scope may help to explain why the standard does not say more.
Is this considered a good question ?
Having local thread data on the same memory page that is shared between threads is a performance problem. Seperating memory pages between threads can mitigate this significant contributor to delays in guaranteeing cache coherency.
There is some performance hit if different cores try accessing the same cache line, and at least one of them is writing to it. Cache lines are usually 64 or 128 bytes long. Sure, by putting the local variables on different pages this problem is avoided, but it’s kind of overkill, as it’s enough if they are on different cache lines (unless I’m missing something…).
Anyway, tests on gfortran show that the compiler is actually taking care of that:
use omp_lib
implicit none
real, allocatable :: a(:), b(:)
integer :: i, it
integer*8 :: p(8,2)
call omp_set_num_threads(8)
!$OMP PARALLEL PRIVATE(it,a,b)
it = omp_get_thread_num()
allocate( a(10),b(10) )
p(it+1,:) = [loc(a),loc(b)]
deallocate( a )
!$OMP END PARALLEL
do i = 1, 8
print*, p(i,:)/4096
end do
end
I can’t test the mingw compiler in Fortran. I’ve converted the code to C and tested on godbolt.org (it proposes the mingw compiler for C, not for Fortran), and indeed it reuses the same pages for different threads, and even the same cache lines (which is the real issue). Compiler Explorer
To me it’s a pitfall of the compiler. Or is it because of limitations of Windows?
@PierU, I have modified your code to also report the stack page for each thread. The code now runs as I would expect.
L1 cache can be much larger than 64 or 128 bytes long, which enables memory pages to be loaded.
use omp_lib
implicit none
real*8, allocatable :: a(:), b(:)
integer :: i, it
integer*8 :: p(3,8)
call omp_set_num_threads(8)
!$OMP PARALLEL PRIVATE(it,a,b) SHARED(p)
it = omp_get_thread_num()
allocate ( a(10),b(10) )
p(:,it+1) = [ loc(a), loc(b), loc(it) ]
deallocate ( a, b )
!$OMP END PARALLEL
do it = 1, 8
write (*,11) (p(i,it)/4096, mod(p(i,it),4096),i=1,3)
end do
11 format ( 3(5x,i0,' :',i5) )
end
Of course, but I was talking about the cache line size, which is different from the cache size. The data are copied from/to the L1 cache by segments of 64 bytes (or 128 bytes depending on the CPU), and these segments are names cache lines. So what is important is that the local variables of the different threads do not belong to a same cache line, even if they belong to the same page.
Does your CPU (or others you are using) have multiple NUMA nodes? Perhaps it’s the first-touch placement preventing you from exploiting the full bandwidth available? (On Linux you can find out the number of “nodes” with lscpu; on Windows perhaps systeminfo would show it? When the Intel HPC toolkit is available, cpuinfo will show you the number of “packages”.)
Yes, this is a workable patch. My practical approach for ensuring allocate arrays are on different pages is to round each array size up to a multiple of the memory page size (4 KBytes). (Starting each array at the start of a page would be better.)
We can have these strategies, but one of the main problems I am having with OpenMP relates to variability of thread performance, which makes confirmation of strategies difficult.
Variability of clock rate between threads is a significant issue for my computation. I have also had significant success using BARRIER to keep threads aligned so that the use of very large shared arrays are syncronised between threads in L3 cache.
Optimising memory and cache for OpenMP is a complex and varied issue.