OpenMP and FORTRAN

suki · April 10, 2024, 12:01pm

Hi,

I put my 30-year old FORTRAN code on 2D Euler-flow turbulence simulations here:

It runs fine in my 12-core-CPU-15GB RAM laptop with all cores getting triggered, but wondering if I still missed any trick in optimization like gfortran -O3 -Ofast -openmp hooks.
Added openmp stuff like:
USE OMP_LIB
!$OMP PARALLEL

Am I missing any other trick in the book …

Thanks,
Suki

Arjen · April 10, 2024, 12:17pm

Welcome to the forum!

I don’t know your code, so it is difficult to make any specific recommendations, but you may want to experiment with coarrays. That would turn your program into a set of processes that run in parallel and exchange data at designated points. OpenMP works by threads and these are all in the same program, so that may or may not be efficient (threads can get in each other’s way more easily than separate processes). But it does take a bit of work.

On the other hand, OpenMP also allows to use offloading to GPUs and you might even combined coarrays and OpenMP.

Just a few suggestions

suki · April 10, 2024, 5:33pm

Hi,

Thanks for your suggestions.
I don’t have any GPUs in my laptop, but I can run it on some fancy GPU on a cloud provider.

I intend to port it to Mojo (a Python fork: https://www.modular.com/max/mojo)
To benchmark Fortran vs Python vs Mojo I am trying to make sure that I do the best I can with the Fortran code.

Suki

PierU · April 10, 2024, 7:54pm

OpenMP is often the optimum parallelization solution on a shared memory systems, in the sense that:

it is simple, and once the directives are added the code can still be compiled without OpenMP if desired
Many OpenMP parallelizations are straightforward

However, they are not all straightforward, but it’s difficult to comment more without seeing the code. Also, it can be trickier on NUMA (Non Uniform Memory Access) architectures (typically many AMD CPUs are NUMA).

You can start evaluating to which point your OpenMP parallelization is optimized by timing your serial code first, then with OpenMP and different numbers of threads: 1, 2, 4, 8…

sshestov · April 11, 2024, 7:26am

Hi! I’m not a Fortran expert, neither a turbulence CFD guy. But I do have some experience with MHD modeling (including turbulence analysis) and using of corresponding numerical simulation codes (MPI-AMRVAC and Lare3D in my case). Both codes a relatively recent; they are written in Fortran, they use MPI parallelization (while OpenMP is also there) and thus can be run on clusters, they are focused either on 3D or N-D dimensions (i.e. for sure not limited to 2D).
I know that authors, which are actually university teams, constantly update their codes, introducing new numerical schemas and approaches, improving performance; during recent tens of years there appeared a lot o different implementations (at least implementations, I’m not sure if it’s sufficient to call them “methods”) of various methods and schemas in plasma numerical simulations.

Thus the answer to you question depends on your motivation in particular. Why do you want to improve your code? Apparently, there must be some room for improvement (at least you can introduce parallelization and/or ability to run it in MPI regime say on clusters or GPUs). But do you have enough resources to implement them? Staying with Fortran you can be sure you get the maximum speed and can relative easily scale your code to larger grids.

JohnCampbell · April 11, 2024, 8:13am

I looked at the code free_vortices_2d.f90, as it appears to be stand-alone, and tried to clean up the OMP implimentation by :

#moving arrays into a module, to avoid any possible stack problems ( your arrays could be allocatable and smaller?)
module vor_arrays
integer, parameter :: max_m = 100000
REAL, DIMENSION(max_m,3) :: vortices !-- array holds 2D vortices
REAL, DIMENSION(max_m,3) :: vor !-- temporary array holds 2D vortices
end module vor_arrays

#including the following code to control thread numbers and test alternatives.
call omp_set_num_threads ( num_threads )

#moving the following code into subroutine rvelocity to clean up OMP usage.
Especially explicitly defining SHARED and PRIVATE attributes of variables.
I think this may have been a problem with your tests, especially array vor.
!$OMP PARALLEL DO SHARED ( vortices, vor, M, DT, sigma ) &
!$OMP& PRIVATE (i, x,y,w, dx,dy, j, qx,qy,qw, pqx,pqy,rpq, gexp_value )

#Included functions elapse_seconds () and delta_seconds () to provide timing

#I did not like your file unit numbers, so used lu_NIT = 11 for file access.

#At the end of the run I got the following message
“Note : The following floating-point exceptions are signalling: IEEE_UNDERFLOW_FLAG”
This can occur with random generated data and these exceptions can extend run times.
Changing to 8-byte reals may help this, due to extended range.

The code I generated is:
free_vortices2d.f90 (5.2 KB)

My Windows build is:

set prog=free_vortices2d

set basic=-fimplicit-none -fallow-argument-mismatch -O2 -march=native -ffast-math
set vec=-fimplicit-none -fallow-argument-mismatch -O3 -march=native -ffast-math -funroll-loops --param max-unroll-times=2
set omps=-fopenmp -fstack-arrays
set omp=-fopenmp

del %prog%.exe

gfortran %prog%.f90 %basic% %omp% -v -o %prog%.exe

%prog%

The net result is performance is closely related to number of threads.
Some points may be helpful, while others may just be an arbitary style .

PierU · April 11, 2024, 9:13am

Following @JohnCampbell comments I took a look at your code, and indeed before worrying about the best performances you should first worry about the correctness: your !$OMP DO loop doesn’t define any private variable and you have plenty of race conditions (different threads trying simultaneously to update the same variables in memory). You can’t get correct results with the code as it is.

suki · April 11, 2024, 5:33pm

I get the same results with and without OMP and have tested them with Lua-Love2D simulations.
I get a 4-fold improvement in speed with OMP (running 12 CPU cores).
So there must be something that is working in my code.

Suki

suki · April 11, 2024, 5:36pm

BTW:I am grateful for all the comments and suggestions.

Suki

PierU · April 11, 2024, 6:17pm

My bad, all the local variables of the routine are de facto private to each thread, as the routine is called from within the // region. So it’s actually OK. I am always always putting the worksharing constructs within the same routine as the thread creation construct, so I always have to define the private variables, and that why I was initially surprised by the absence of any private clause…

Have you timed just the parallel region, or the whole program execution? What is your exact CPU model?

JohnCampbell · April 12, 2024, 5:03am

Attached is an improved version which transposes the arrays, allows for different real kinds and tests a range of thread counts.

Surprisingly;
array transpose has only a small effect, but would improve cache efficiency
real(8) is slightly faster than real(4)
underflow still occurs with real(8) (but might be less frequent)

speed vs threads shows reasonable omp efficiency on AMD 5900X for 50 steps
1 thread : 47.22 sec
2 threads : 24.38 sec (97% )
4 threads : 13.07 sec (90%)
8 threads : 7.36 sec (80%)
12 threads : 5.60 sec (70%)

These % decline are reasonable as the elapsed time includes the single thread update and reporting of results…

This updated test code is
free_vortices2d8.f90 (5.8 KB)
log of the test is
free_vortices2d8.txt (25.6 KB)

Use of OMP is effective for this computation.

PierU · April 12, 2024, 6:30am

This is probably because the original 2nd dimension has only 3 elements. This means that for the vortice or vor array there are 3 active cache lines, instead of a single one if the array is tranposed: this is not that bad and does not result in too many cache misses.

suki · April 12, 2024, 7:11am

Thanks a million!

Suki

JohnCampbell · April 12, 2024, 8:33am

My testing of this free_vortices2d.f90 using Gfortran Ver 12.3.0 is producing some unusual results due to IEEE_UNDERFLOW_FLAG.
While this may be due to the artificial values for initial values in array vortices, these UNDERFLOW events can play havoc with any performance testing.

Does anyone know if this IEEE_UNDERFLOW support is new to Gfortran 12.3.0 ?

Significantly, at STOP, the program reports a “Note”, but does not exit.
If I batch a sequence of programs to run, having a program failing to exit could be a significant issue.

I am not familiar with this behaviour, and certainly do not want this as a default performance !!

Is this another example of the dangers of moving to a newer versions of the compiler ?

Update
I resolved the IEEE_UNDERFLOW problems with changes to ggexp and reduction in run times, but the program does not terminate successfully. A bug somewhere ?

  function ggexp (rs,sigma)
    use kinds
    implicit none
      REAL(wp) :: ggexp, rs,sigma, value, result

      value = -rs/sigma

      if ( value < -45 ) then
        result = 1.0_wp
      else
        result = 1.0_wp - exp ( value )
      end if

      ggexp = result

  end function ggexp

RonShepard · April 12, 2024, 4:37pm

I looked into defining this magic constant in the code, and this is what I came up with:

program texp
   use, intrinsic ::  iso_fortran_env, only: wp=>real64
   real(wp), parameter :: esmallx = log(0.25_wp*epsilon(1.0_wp))
   write(*,*) epsilon(esmallx), esmallx, exp(esmallx), 1.0_wp - exp(esmallx)
end program texp

$ gfortran texp.f90 && a.out
   2.2204460492503131E-016  -37.429947750237048        5.5511151231257759E-017   1.0000000000000000

I’m not entirely sure why the 0.25_wp is necessary in that parameter expression, but if it is removed, then the 1.0_wp-exp(esmallx) expression is not equal to 1.0.

In any case, this seems to give the correct value for the function result while avoiding the underflow, and it should work for any real kind.

suki · April 12, 2024, 4:38pm

Yes, the same here!

Suki

suki · April 12, 2024, 4:48pm

Yes, my
GNU Fortran (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0
also breaks-off at 8 threads;
8 threads are in use
Note: The following floating-point exceptions are signalling: IEEE_UNDERFLOW_FLAG IEEE_DENORMAL

suki · April 12, 2024, 6:05pm

Oops!
According to this it is nothing like an ERROR at all.:
https://stackoverflow.com/questions/44308577/ieee-underflow-flag-ieee-denormal-in-fortran-77

If we comment out the STOP command this message stops.

suki · April 12, 2024, 7:12pm

Hi,

12th Gen Intel® Core™ i5-1235U × 12
GNU Fortran (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0

gfortran -Ofast -O3 -o free_vortices2d free_vortices2d8.f90 -fopenmp
./free_vortices2d

Total run in seconds:
1 thread 50.7344
2 threads 25.9453
4 threads 28.3242
8 threads 29.4375
12 threads 27.9766

No improvement after 2 threads (in fact slight deterioration).
Should I be using ifort instead of gfortran?
When I did replace gfortran by ifort but I get strange results …

JohnCampbell · April 13, 2024, 7:14am

Thanks Ron,

I (quickly) tried -45 as it is about LOG (1.e-20), but now adopted int ( log ( epsilon(1.0_wp) ) ) = -36, based on your more rigorous approach.
-36 is better as it eliminates many more unnecessary EXP calcs.

The key number to eliminate IEEE_UNDERFLOW is int ( log ( tiny(1.0_wp) ) ), but any values less than epsilon(x) are unnecessary.

This has elminiated the IEEE_UNDERFLOW interupts and so has significantly speed up the calculation.

My performance with GFortran on my Ryzen 5900X (without hyper-threading) shows good multi-thread efficiency for up to 12 threads.
Excluding file write (an option in latest test), I report the OMP run time components for 50 steps for 1,2,4,8 & 12 threads reporting,
1 thread : 13.24 sec : 100%
2 thread : 6.67 sec : 99%
4 thread : 3.43 sec : 96%
8 thread : 1.78 sec : 93%
12 thread : 1.23 sec : 90%

Perhaps other reported poor OMP performance is with slower hardware where over-heating or small cache is an issue ? Eliminating IEEE_UNDERFLOW interrupts is a significant change to performance.

this latest program version is
free_vortices2d8.f90 (9.0 KB)

the run time log is
free_vortices2d8.txt (15.5 KB)

However, I still have a problem with !$OMP with Gfortran 12.3.0 on Win 10, as the program does not exit, but hangs. I have not been able to identify this cause.

There also remains a problem with subroutine rvelocity, which has a do i : do j scan of all vortices interaction, when most of these appear unnecessary. Perhaps there should be a better filter process, if this was not an OMP benchmark ?

Topic		Replies	Views
-fopenmp on gfortran GNU	18	704	March 2, 2024
Parallelization on GPU with Intel compiler Intel	55	2599	September 20, 2024
Nvfortran comparison of do concurrent vs OpenMP code Help	24	805	September 9, 2024
Learning coarrays, collective subroutines and other parallel features of Modern Fortran Help	48	2403	May 11, 2021
Poor openmp scaling with ifort but not gfortran Help	12	1543	December 23, 2021

OpenMP and FORTRAN

Related topics