You could also have the loop body as a subroutine, with local dynamic variables.
Surely the explicit use of âprivate ( ip, im, jp, jm, phi_old, term1, term2, theta, m )â provides the clearest outcome and best documents the code.
Why be so obscure !!
(use explicit private could be a comment to many posts in this thread)
So it appears that, in the absence of a clause specifying the attribute, the default is shared.
From previous comments, this appears to be what Gfortran does with theta, m and phi_old.
Ifort, however, with itâs experience of auto-parallel, appears to identify that these should be private and so produces a better (non-conforming?) result.
My experience using OpenMP suggests that relying on this default action is not a good approach, which is probably why I did not recall the default shared.
I was hoping that the block-scoped version would contribute a new angle to readers of this thread, of what private and shared even mean and why they are required for correct execution.
In C with its { } scopes and practice of declaring variables on the spot private() is less used.
Not only this will hurt the performances (traversing arrays has a cost in terms of cache miss and of first touch), but the right (and still easy) way is to declare as PRIVATE the variables that needs to be private. OpenMP is well designed, just use the features it offers.
It all depends if the code is directly written with OpenMP in mind or if this is an existing code that is OpenMPized ⌠In the latter case, just declaring some variables as private doesnât require changing the serial code, so you donât have to retest it. In the former case, the block-scoped approach is indeed cleaner.
The inverse 1/B_C is also known as the arithmetic or computational intensity. Some authors even call it a computational force.
Changing the local loop variables into full 2-d arrays will artificially increase the code balance for no good reason, pushing the code (further) down into the bandwidth-limited regime of the roofline performance model.
In our code, we always use !$OMP PARALLEL DEFAULT(SHARED) and then go through the parallel section with a fine-tooth comb and declare all variables which should be exclusive to each thread with !$OMP PRIVATE. This is required because the compiler would have a hard time figuring out our intent for each variable.
Interestingly, your variables i and j are private by default because they are loop variables. See here.
Placing the block structure inside the loops implies that the stack allocation and deallocation is done each pass inside the innermost loop. For openmp parallelization, you really only want that overhead to occur once per thread, independently of the number of loop iterations. Of course, the compiler can recognize this and can optimize the allocation steps, but that puts the programmer in a position of specifying an incorrect algorithm, and then relying on the compiler optimization to correct it. It is always better, in principle, for the programmer to specify directly and clearly his intentions (to the compiler and also to human readers of the code).
To follow up on this âpuzzledâ, OpenMP diagnostics could be much improved.
There is a problem in using OpenMP, where the compilers I have used (Ifort and Gfortran) do not provide good diagnostics when interpreting !$OMP directives.
The worst case is when you mis-type !$OMP or donât correctly identify the continuation lines. In these cases, the only identifying feature can be that there is no performance improvement, which is the same case for a memory bandwidth bottleneck.
A helpful report could include the shared/private/first private status of all variables or arrays referenced in the !$OMP region, although these can be hidden in called routines.
I combine IMPLICIT NONE and DEFAULT (NONE) to report any variable I have not explicitly referenced in the shared/private directive.
Note, for variables or arrays that are not re-defined in the OMP region, assuming shared is a safe outcome as only those re-defined may need the private attribute. Including this could simplify the DEFAULT (NONE) response.
Is this an area that compilers should address, or am I missing this compiler feature ?
Very interesting! I have one question though: the two loops over i and j seem independent, so I would write do collapse (2) to improve performance. Without the do collapse, only the first loop is parallelized, I think.