OpenMP and Allocate

If we are considering enhancements to the ALLOCATE statement, an option that could help is ALIGN=, with options of 8, 16 (bytes) or importantly, for OpenMP where the heap is shared between all threads, ALIGN=PAGE.

Surely I am not the only one that wants a simple change that could improve efficiency by removing one of the causes of memory consistency problems.

3 Likes

The OpenMP 5.0 standard already covers your needs in that regard: Memory Allocators

In OpenMP 5.1, they also introduced an align clause in the allocate directive

Edit: In Open 5.2 the !$omp allocate directive statement decorating an allocate statement was deprecated. Instead, a new !$omp allocators construct was introduced. The difference is shown below:

! OpenMP 5.0
!$omp allocate(x,y) allocator(xy_alloc)
allocate(x(n),y(n))

! OpenMP 5.2
!$omp allocators allocate(xy_alloc: x,y)
allocate(x(n),y(n))
2 Likes

@ivanpribec

Thanks for this feedback on OpenMP updates.

Do you know where there is a Fortran code example of this being achieved, ie selecting that an allocatable array in a !$OMP region is positioned to the start of a memory page ?

Also,
from what version of Gfortran is this syntax supported ?
is their any guidance if this needs to be applied to all allocate statements in a thread, or if once one array is positioned on a new page, then subsequent arrays for this thread can share the page(s).
I have wondered if this needs to be placed in a !$OMP CRITICAL region.

Have you been able to implement the Ver 5.1+ syntax ?

When Ver 5.1 was released, I found these memory options in OpenMP 5.1+ to be very difficult to understand, especially when they were not supported by the then available Fortran compilers.

To address the memory consistency problem, I would like to select that each heap memory page is only used by a single thread, but the OpenMP syntax available, based on target or device did not appear to address the alternatives of stack or heap memory. It was more for GPU off-loading. A workable example could help greatly.

Unfortunately Fortran is very quiet on the roles of stack and heap.

Why is it important to have allocated objects in different threads on different pages?

Section 11 of the OpenMP 5.2 Examples Book..

It is not supported yet in gfortran. From the Intel compilers you’ll have to use the latest one ifx 2023.2.1.

Here is an example that seems to work in Compiler Explorer,

! omp_alloc_test.f90 --
!     Demonstration of OpenMP 5.2 allocators
use omp_lib

integer, parameter :: dp = kind(1.0d0)
real(dp), allocatable :: x(:), y(:)
real(dp) :: a

integer(omp_memspace_handle_kind )  :: xy_memspace = omp_default_mem_space
type(omp_alloctrait)                :: xy_traits(1) = &
                                            [omp_alloctrait(omp_atk_alignment,64)]
integer(omp_allocator_handle_kind)  :: xy_alloc

integer :: i, n

xy_alloc = omp_init_allocator( xy_memspace, 1, xy_traits)

n = 1000

!$omp allocators allocate(xy_alloc: x,y)
allocate(x(n),y(n))

! loc is a non-standard extension, but in practice all compilers have it
if(modulo(loc(x),64) /= 0 .and. modulo(loc(y),64) /=0 ) then
   print*,"ERROR: x|y not 64-byte aligned"
   stop
endif

y = 0
x = 4
a = 1.0_dp/7.0_dp

! daxpy
!$omp simd simdlen(4) aligned(x,y: 64)
do i = 1, n
   y(i) = y(i) + a*x(i)
end do

deallocate(x,y)
call omp_destroy_allocator(xy_alloc)

end

Modern Fortran Explained (2018) mentions heap (p105) and stack (p120). The f2023 standard mentions heap only once, in C.13.1, and STACK only as an example of a derived type in 7.5.4.2 Note 2. Section 1 Scope may help to explain why the standard does not say more.

For good reason. An engineer or numerical analyst should be able to focus on his domain, not on how the machine memory works.

Maybe, we can discuss stack/heap and OpenMP in a new thread, not to derail the allocate enhancement discussion any further.

1 Like

@ivanpribec ,

Thanks very much for the info you provided

I think your comment ignores a lot of practical issues related to performance and use of OpenMP.
As an engineer, achieving a result can be important.

1 Like

Is this considered a good question ?
Having local thread data on the same memory page that is shared between threads is a performance problem. Seperating memory pages between threads can mitigate this significant contributor to delays in guaranteeing cache coherency.

Thanks for the “good question”…

There is some performance hit if different cores try accessing the same cache line, and at least one of them is writing to it. Cache lines are usually 64 or 128 bytes long. Sure, by putting the local variables on different pages this problem is avoided, but it’s kind of overkill, as it’s enough if they are on different cache lines (unless I’m missing something…).

Anyway, tests on gfortran show that the compiler is actually taking care of that:

use omp_lib
implicit none

    real, allocatable :: a(:), b(:)
    integer :: i, it
    integer*8 :: p(8,2)

    call omp_set_num_threads(8)

    !$OMP PARALLEL PRIVATE(it,a,b)
    it = omp_get_thread_num()
    allocate( a(10),b(10) )
    p(it+1,:) = [loc(a),loc(b)]
    deallocate( a )
    !$OMP END PARALLEL
    
    do i = 1, 8
        print*, p(i,:)/4096
    end do
end

Outputs:

                 7061                 7061
          34198437888          34198437888
          34198339584          34198339584
          34198372352          34198372352
          34198306816          34198306816
          34198274048          34198274048
          34198241280          34198241280
          34198405120          34198405120

Allocations in the different threads are on fully different pages. I would expect any decent compiler to do that.

1 Like

This is the point of my posts !
Not all compilers do as you think ?

What memory addresses are you reporting ?
34198437888 * 4096 = 1.40077E+14

  It is now Wednesday, 15 November 2023 at  0:54:23.362
gcc.ver=11.1.0
gcc_dir=C:\Program Files (x86)\gcc_eq\gcc_11.1.0
gcc_path=C:\Program Files (x86)\gcc_eq\gcc_11.1.0\bin;C:\Program Files (x86)\gcc_eq\gcc_11.1.0\libexec\gcc\x86_64-w64-mingw32\11.1.0
 AMD Ryzen 9 5900X 12-Core Processor
================================================================ 
  It is now Wednesday, 15 November 2023 at  0:54:23.597
options=-O2 -march=native -ffast-math -fopenmp
                 3307                 3307
                 3307                 3307
                 3307                 3307
                 3307                 3307
                 3307                 3308
                 3307                 3307
                 3307                 3307
                 3307                 3307

I can’t test the mingw compiler in Fortran. I’ve converted the code to C and tested on godbolt.org (it proposes the mingw compiler for C, not for Fortran), and indeed it reuses the same pages for different threads, and even the same cache lines (which is the real issue).
Compiler Explorer

To me it’s a pitfall of the compiler. Or is it because of limitations of Windows?

The ones returned by the loc() function

Indeed… Could an @admins split the discussion?

Done. Let me know if I did it correctly.

@PierU, I have modified your code to also report the stack page for each thread. The code now runs as I would expect.
L1 cache can be much larger than 64 or 128 bytes long, which enables memory pages to be loaded.

use omp_lib
implicit none

    real*8, allocatable :: a(:), b(:)
    integer :: i, it
    integer*8 :: p(3,8)

    call omp_set_num_threads(8)

    !$OMP PARALLEL PRIVATE(it,a,b) SHARED(p)
      it = omp_get_thread_num()
      allocate ( a(10),b(10) )
      p(:,it+1) = [ loc(a), loc(b), loc(it) ]
      deallocate ( a, b )
    !$OMP END PARALLEL
    
    do it = 1, 8
       write (*,11) (p(i,it)/4096, mod(p(i,it),4096),i=1,3)
    end do
 11 format ( 3(5x,i0,' :',i5) )
end
  It is now Wednesday, 15 November 2023 at  9:10:20.820
gcc.ver=11.1.0
gcc_dir=C:\Program Files (x86)\gcc_eq\gcc_11.1.0
gcc_path=C:\Program Files (x86)\gcc_eq\gcc_11.1.0\bin;C:\Program Files (x86)\gcc_eq\gcc_11.1.0\libexec\gcc\x86_64-w64-mingw32\11.1.0
 AMD Ryzen 9 5900X 12-Core Processor
================================================================ 
  It is now Wednesday, 15 November 2023 at  9:10:21.273
options=-O2 -march=native -ffast-math -fopenmp
     3355 : 3568     3356 : 3392     2351 : 2444
     3355 :  192     3355 :  288     3871 : 3628
     3355 :  384     3355 : 1616     4383 : 3628
     3355 :  192     3355 : 1328     4895 : 3628
     3355 : 3664     3357 :  624     5407 : 3628
     3355 : 1712     3355 :  656     5919 : 3628
     3357 :  528     3357 : 3024     6431 : 3628
     3357 :  144     3357 : 2352     6943 : 3628

Of course, but I was talking about the cache line size, which is different from the cache size. The data are copied from/to the L1 cache by segments of 64 bytes (or 128 bytes depending on the CPU), and these segments are names cache lines. So what is important is that the local variables of the different threads do not belong to a same cache line, even if they belong to the same page.

I am always surprised that the delays due to achieving cache coherence are not much worse, especially for the array “p” case in your code example.

So the discussion has been about false sharing.. These documents from Intel, Oracle, and University of Oregon do a reasonable job explaining it.

In case of the array p, the easiest fix would be to pad the array along the first dimension to match the cache line length (64 bytes typically):

    integer*8 :: p(8,8)   ! first dimension is 64 bytes. 

    call omp_set_num_threads(8)

    !$OMP PARALLEL PRIVATE(it,a,b) SHARED(p)
      it = omp_get_thread_num()
      allocate ( a(10),b(10) )
      p(:,it+1) = [ loc(a), loc(b), loc(it) ]
      deallocate ( a, b )
    !$OMP END PARALLEL

Does your CPU (or others you are using) have multiple NUMA nodes? Perhaps it’s the first-touch placement preventing you from exploiting the full bandwidth available? (On Linux you can find out the number of “nodes” with lscpu; on Windows perhaps systeminfo would show it? When the Intel HPC toolkit is available, cpuinfo will show you the number of “packages”.)

Edit: the AMD Ryzen is single package,

Model name:                      AMD Ryzen 9 5900X 12-Core Processor
..
Thread(s) per core:              2
Core(s) per socket:              12
Socket(s):                       1
...
L1d cache:                       384 KiB (12 instances)
L1i cache:                       384 KiB (12 instances)
L2 cache:                        6 MiB (12 instances)
L3 cache:                        64 MiB (2 instances)
NUMA node(s):                    1
NUMA node0 CPU(s):               0-23

Yes, this is a workable patch. My practical approach for ensuring allocate arrays are on different pages is to round each array size up to a multiple of the memory page size (4 KBytes). (Starting each array at the start of a page would be better.)
We can have these strategies, but one of the main problems I am having with OpenMP relates to variability of thread performance, which makes confirmation of strategies difficult.

Variability of clock rate between threads is a significant issue for my computation. I have also had significant success using BARRIER to keep threads aligned so that the use of very large shared arrays are syncronised between threads in L3 cache.

Optimising memory and cache for OpenMP is a complex and varied issue.