OpenMP and FORTRAN

JohnCampbell · April 13, 2024, 7:35am

No, Gfortran is very good. I think the problem is your 12th Gen Intel® Core™ i5-1235U.

Google search finds The i5–1235U is faster in tasks that use only 1–2 CPU cores or threads, such as photo editing. This is because the i5–1235U only has 2 high-performance CPU cores, its other 8 cores are much weaker “efficiency” cores.23 Dec 2023.

It is surprising there are only 2 performance cores, but 8 efficiency cores !
Then there is the problem with cooling with any recent Intel laptop.

suki · April 13, 2024, 7:59am

Yes John, you are absolutely on the dot; it generates a lot of heat for which I have a cooling fan underneath.

Suki

PierU · April 13, 2024, 8:09am

Indeed. Not only the speed-up is reduced once Efficiency cores are used, but I think that OpenMP runtimes do not handle well (and maybe can’t handle well) the mix between performance and efficiency cores.

Moreover, this CPU has 10 physical cores (not 12) + 2 logical cores. When running with 12 threads the logical cores are used, but logical cores generally do not help speeding-up such computations.

Given all of this, the limited speed-up is not surprising.

Both are linked: more performance cores would result in more heating, and ultimately in automatic throttling (frequency downclocking) of the performance cores to reduce the temperature.

PierU · April 13, 2024, 8:18am

On CPUs with Efficiency Cores, it would be worth trying the schedule(nonmonotonic:dynamic) clause in !$omp do . The dynamic schedule can help when the workload is not balanced between the threads (and maybe when the threads do not run on equally performant cores). The problem is that this schedule has more overheads than the default static one, but with the recent nonmonotonic modifier the overheads get very limited. But I’m not sure that gfortran supports it right now (recent versions of ifx do).

suki · April 13, 2024, 8:19am

Oops!

mojo run num_cpu_cores.mojo
Number of physical cores: 10
Number of logical cores: 12
Number of performance cores: 10

Is this mojo script:

from sys.info import num_physical_cores, num_logical_cores, num_performance_cores

fn main():
	print('Number of physical cores:',num_physical_cores())
	print('Number of logical cores:',num_logical_cores())
	print('Number of performance cores:', num_performance_cores())

being pedantic or wrong?

PierU · April 13, 2024, 8:23am

Definitely wrong (at least if your CPU is really an i5 1235U)

suki · April 13, 2024, 8:28am

Comes from here:

lscpu|head

Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      39 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             12
On-line CPU(s) list:                0-11
Vendor ID:                          GenuineIntel
Model name:                         12th Gen Intel(R) Core(TM) i5-1235U
CPU family:                         6
Model:                              154

JohnCampbell · April 13, 2024, 10:16am

@suki
Given you have reported your program performance is topping out at 2 threads, this appears to be consistent with the google thread of only 2 performance cores, not “Number of performance cores: 10”

I have not had any experience with using “efficiency” cores for OpenMP, but assuming their instruction excludes avx instructions, I would think if using threads with such an imbalanced performance would exclude most of the computation types I have.

You could possibly try adding !$OMP& SCHEDULE (DYNAMIC)
that is assuming Gfortran -fopenmp even tries to use the other cores.

I have asked questions of if Gfortran or OpenMP implimentations utilise “efficiency” cores but never received a clear answer. !$OMP PARALLEL DO can be more sensitive to threads of different performance, so I don’t intend to try.

I currently use a Ryzen 5900X where I have turned off “2-way simultaneous multithreading”, as I get no benefit from the extra threads.
On my Intel 8700K, I also do not use hyper-threading, although most of my calculations are with large arrays where memory bandwidth might be the problem.

For multi-threaded computation, a desktop with adequate cooling could be a much better option. The AMD (nonX) processors running at a slower clock rate look to be a much better solution than an Intel room heater ! I wonder where the glossy marketed processors are taking us.

PierU · April 13, 2024, 10:42am

The Intel approach makes sense in a general-purpose laptop (e.g. not targeted at HPC), where the cooling capabilities are limited by the size of the laptop. The performance cores can handle peak computations demands for a short period of time, and the efficiency cores handle all background processes that are generally not very demanding.

Your Ryzen 5900X has a 105W TDP, which is much too high for most of the laptops.

Apple has here an edge with its Apple Silicon chips, which have a much better efficiency (flops/W) then the current x86 chips. A Macbook Pro can get better performances than equivalent x86 laptops, and without heating that much.

suki · April 13, 2024, 11:01am

On a pseudo-server-grade PC we assembled it performs quite well doubling speed (again only till 4 performance cores):

$lscpu|head

Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Address sizes:                      39 bits physical, 48 bits virtual
Byte Order:                         Little Endian
CPU(s):                             8
On-line CPU(s) list:                0-7
Vendor ID:                          GenuineIntel
Model name:                         Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
CPU family:                         6
Model:                              94

$mojo run num_cpu_cores.mojo

Number of physical cores: 4
Number of logical cores: 8
Number of performance cores: 4

$gfortran ...
Total run in seconds:
1 thread 16.3750
2 threads 8.3750
4 threads 4.4375
8 threads 3.8750

suki · April 13, 2024, 4:01pm

It looks like horses for courses:
https://github.com/Sukii/vortex_simulations/tree/main/benchmark

With mojo SIMD arrays doing better on 12th Gen Interl-5 12-core Laptop CPUs,
while Fortran OpenMP doing well i7 6700 8-core PC CPUs

Suki

JohnCampbell · October 6, 2024, 3:36am

I have now identified that this problem is probably due to call omp_set_num_threads ()
This is occurring on my windows 10 PC and was introduced in Gfortran Ver 11.3.0 and later versions.
It does not occur in Gfortran Ver 11.1.0 and earlier versions.

I am not sure if this is in Gfortran or the windows thread managment library for equation.com’s implementation of Gfortran.
This is a code that I have to demonstrate the problem.
My batch test is

call set_gcc 11.1.0

set program=test4

del %program%.exe

gfortran %program%.f90 -O3 -march=native -fopenmp -o %program%.exe

%program%

A simple reproducer below which exhibits the problem if “call omp_set_num_threads (4)” is included.

 program test

 !  small reproducer version of OpenMP program hanging on Win 10 / Gfortran 11.3.0
 !  gfortran test.f90 -O3 -march=native -fopenmp -o test.exe
 
  use iso_fortran_env
   implicit none

    integer, parameter :: num = 1000
    real    :: A(num)

    real    :: RA
    integer :: i

    write (*,*) 'Vern : ',compiler_version ()
    write (*,*) 'Opts : ',compiler_options ()

    call omp_set_num_threads (4)    ! this causes problem for Gfortran 11.3.0 +
  
     write ( *,*) 'Test n=',num
     A = 1
     ra = 0

  !$OMP PARALLEL DO private (i) shared (A), REDUCTION (+: RA)
     do i = 1, size(A)
       RA = RA + A(i)**2
     end do
  !$OMP END PARALLEL DO

     RA = sqrt (RA)
     print*,RA,' OpenMP', sqrt(real(num))

! problem is demonstrated if peogram hangs here and does not exit

 end program test

If others can test to identify if it occurs with other OS or other Gfortran implementations, I would be interested to know.

MattAl · December 9, 2024, 12:18pm

Not sure if it’s useful, but I had no failures on:

Centos7: gcc 4.8.5, 6.3.0, 7.1.0, 8.2.0, 10.2.0, 12.1.0
Rocky9: gcc 11.4.1, 12.2.0, 14.2.0

Topic		Replies	Views
Learning coarrays, collective subroutines and other parallel features of Modern Fortran Help	48	2416	May 11, 2021
Poor openmp scaling with ifort but not gfortran Help	12	1552	December 23, 2021
-fopenmp on gfortran GNU	18	724	March 2, 2024
Why the performance is poorer after using OpenMP? Help	20	5697	June 2, 2022
A simple example to compare coarrays and openmp	10	2602	February 20, 2022

OpenMP and FORTRAN

Related topics