OpenMP code implementation in gfortran and intel

ivanpribec · June 26, 2024, 2:47pm

One more way to ensure correctness would be to move this variables into an internal scope:

Shahid:

        !$omp parallel do
        do j = 1, Ny
            do i =1, Nx
                block:
                real ( kind = 8 ) :: phi_old, term1, term2
                real ( kind = 8 ) :: theta, m
                integer ( kind = 4 ) :: ip, im, jp, jm

                ! ... loop body ...

                end block    
            end do
        end do
        !$omp end parallel do

Shahid · June 27, 2024, 1:24am

Yeah, that’s true! visual validation is the way I like to check the correctness of the code.

Shahid · June 27, 2024, 1:36am

Yes. This is true to check the portability of the code.

Shahid · June 27, 2024, 1:38am

exactly

Shahid · June 27, 2024, 1:51am

Just had another round of tests. I always get the same results with ifx on windows 10.

Shahid · June 27, 2024, 2:21am

I just installed the latest version of intel Fortran compiler Version 2024.2.0 and run the tests.

JohnCampbell · June 27, 2024, 3:35am

You could also have the loop body as a subroutine, with local dynamic variables.

Surely the explicit use of “private ( ip, im, jp, jm, phi_old, term1, term2, theta, m )” provides the clearest outcome and best documents the code.

Why be so obscure !!

(use explicit private could be a comment to many posts in this thread)

JohnCampbell · June 27, 2024, 4:53am

Thanks for this reference.

So it appears that, in the absence of a clause specifying the attribute, the default is shared.

From previous comments, this appears to be what Gfortran does with theta, m and phi_old.
Ifort, however, with it’s experience of auto-parallel, appears to identify that these should be private and so produces a better (non-conforming?) result.
My experience using OpenMP suggests that relying on this default action is not a good approach, which is probably why I did not recall the default shared.

ivanpribec · June 27, 2024, 5:42am

I also prefer the OpenMP shared() clause.

I was hoping that the block-scoped version would contribute a new angle to readers of this thread, of what private and shared even mean and why they are required for correct execution.

In C with its { } scopes and practice of declaring variables on the spot private() is less used.

Shahid · June 27, 2024, 5:56am

I think the easiest way is to declare these variables as arrays.

theta(i,j)
m(i,j)
phi_old(i,j)
term1(i,j)
term2(i,j)

then using only

!$omp parallel do private(i,j,ip,im,jp,jm)

can make it run on both compilers without any problem ( I just had a try).

PierU · June 27, 2024, 6:08am

Not only this will hurt the performances (traversing arrays has a cost in terms of cache miss and of first touch), but the right (and still easy) way is to declare as PRIVATE the variables that needs to be private. OpenMP is well designed, just use the features it offers.

It all depends if the code is directly written with OpenMP in mind or if this is an existing code that is OpenMPized … In the latter case, just declaring some variables as private doesn’t require changing the serial code, so you don’t have to retest it. In the former case, the block-scoped approach is indeed cleaner.

ivanpribec · June 27, 2024, 6:48am

@PierU is absolutely right here.

One of the critical parameters when it comes to optimizing code performance is the code balance,

B_C = \frac{\text{data traffic (bytes)}}{\text{arithmetic operations (flops)}}

The inverse 1/B_C is also known as the arithmetic or computational intensity. Some authors even call it a computational force.

Changing the local loop variables into full 2-d arrays will artificially increase the code balance for no good reason, pushing the code (further) down into the bandwidth-limited regime of the roofline performance model.

jkd2022 · June 27, 2024, 10:35am

@PierU, @ivanpribec and others are giving you good advice!

In our code, we always use !$OMP PARALLEL DEFAULT(SHARED) and then go through the parallel section with a fine-tooth comb and declare all variables which should be exclusive to each thread with !$OMP PRIVATE. This is required because the compiler would have a hard time figuring out our intent for each variable.

Interestingly, your variables i and j are private by default because they are loop variables. See here.

RonShepard · June 27, 2024, 3:07pm

Placing the block structure inside the loops implies that the stack allocation and deallocation is done each pass inside the innermost loop. For openmp parallelization, you really only want that overhead to occur once per thread, independently of the number of loop iterations. Of course, the compiler can recognize this and can optimize the allocation steps, but that puts the programmer in a position of specifying an incorrect algorithm, and then relying on the compiler optimization to correct it. It is always better, in principle, for the programmer to specify directly and clearly his intentions (to the compiler and also to human readers of the code).

JohnCampbell · July 4, 2024, 2:55am

To follow up on this “puzzled”, OpenMP diagnostics could be much improved.

There is a problem in using OpenMP, where the compilers I have used (Ifort and Gfortran) do not provide good diagnostics when interpreting !$OMP directives.

The worst case is when you mis-type !$OMP or don’t correctly identify the continuation lines. In these cases, the only identifying feature can be that there is no performance improvement, which is the same case for a memory bandwidth bottleneck.

A helpful report could include the shared/private/first private status of all variables or arrays referenced in the !$OMP region, although these can be hidden in called routines.

I combine IMPLICIT NONE and DEFAULT (NONE) to report any variable I have not explicitly referenced in the shared/private directive.
Note, for variables or arrays that are not re-defined in the OMP region, assuming shared is a safe outcome as only those re-defined may need the private attribute. Including this could simplify the DEFAULT (NONE) response.

Is this an area that compilers should address, or am I missing this compiler feature ?

aledinola · September 5, 2024, 10:37am

Very interesting! I have one question though: the two loops over i and j seem independent, so I would write do collapse (2) to improve performance. Without the do collapse, only the first loop is parallelized, I think.

Does this make sense?

Topic		Replies	Views
Parallelization on GPU with Intel compiler Intel	55	2719	September 20, 2024
Why the performance is poorer after using OpenMP? Help	20	5696	June 2, 2022
Poor openmp scaling with ifort but not gfortran Help	12	1552	December 23, 2021
Associate private behavior in OpenMP do loops? Language enhancement	13	956	July 21, 2023
How to use IFX and offload openMP to GPU?	0	1629	April 2, 2022

OpenMP code implementation in gfortran and intel

Related topics