I never use closures like
!$omp end parallel do for OpenMP codes. Just want to ask if that is OK.
I never use closures like
It is just a style preference, I always start / end with the following and explicitly declare most default options:
!$OMP PARALLEL DO DEFAULT (NONE) &
!$OMP& PRIVATE (…) &
!$OMP& SHARED (…) &
!$OMP& SCHEDULE (DYNAMIC)
!$OMP END PARALLEL DO
With gfortran -fimplicit-none -fopenmp -fstack-arrays, I get warnings of omitted variables/arrays, which helps to identify silly omissions.
I was not aware that abbreviations are possible. As a personal style choice I opt for the full closures. These match the symmetry of Fortran constructs.
I also like to use
default(none) to force me to think which variables need to be shared.
I have a question: why you add
-fstack-arrays flag? I know that this option allow compiler to put the allocatable array on stack memory, but I have no idea why should I use it for debugging or improving efficiention.
Managing the stack with OpenMP is an issue I have spent a lot of time investigating.
Typically with OpenMP, I will:
- try to default arrays to stack, but specifically select (large) arrays to heap by using ALLOCATE.
- Private copies will go on the heap if they are allocated in the same routine as the !$OMP definition, else if provided as routine arguments this is uncertain. All private copies of local/automatic arrays go to the stack.
- Large SHARED arrays will be ALLOCATE onto the heap.
- I might also allocate “large” (heap) arrays as size rounded up to 4k bytes as full memory pages, although I can not guarantee then to start on a page boundary.
- I define each stack as large (500MBytes). gFortran by default makes all thread stacks the same as the primary stack. As physical memory is allocated to each stack as used, this does not increase the physical memory used.
My my experience is limited to large shared arrays (20GBytes) on Intel I7 and AMD Ryzen with dual channel memory where memory <> cache bottleneck is my biggest problem.
These strategies are in an attempt to reduce inefficency due to memory consistency and cache coherence problems between threads, although these strategies are not a complete solution.
They do not appear to produce a worse outcome, although my experience is limited.
I find efficient use of both OpenMP and AVX via Fortran to be a black art, as the OS does a lot behind the scenes.
At the moment I am using Windows 10 and have just gone from 20H2 to 21H2 introducing problems with strange allocation of threads to “logical processors”, not preferencing different cores and problems with hyper-threading. I thought that was a Windows 11 issue. AMD have suggested I contact Microsoft, which looks just too difficult.
OpenMP does not easily address hyper-threading, which is probably getting too specific, although different classes of “logical processors” is becoming a significant issue with little Fortran programmer access. Do others see this problem ?
I am not sure this is the case. My Win/gFortran experience is ALLOCATE arrays always go on the heap, although this could vary for each compiler and OS implementation. I would be interested to know of any different response.