Poor openmp scaling with ifort but not gfortran

I’ve been seeing some strange openmp scaling behavior that I’m not sure how to explain.
I have a simple test program that scales nicely when compiled with gfortran but poorly when compiled with ifort.

My test program is the following:

module parserMod
  use function_parser, only : fparser_array
  implicit none

  type(fparser_array), save :: parser
  !$omp threadprivate(parser)

end module parserMod
subroutine parallelMarbles(marbles, numThreads)

  use parserMod, only : parser
  use iso_fortran_env, only: wp => real64
  use iso_fortran_env, only : output_unit

  real(wp), dimension(6,200000), intent(inout) :: marbles
  integer, intent(in)                     :: numThreads

  integer :: indx
  character(len=1), dimension(3), parameter :: parserVars = ['x', 'y', 'z']

  ! All threads initialize the parser
  !$omp parallel num_threads(numThreads)
  call parser%parse(parserVars,parserVars)
  if (parser%error()) then
    call parser%print_errors(output_unit)
    stop 99
  !$omp end parallel

  !$omp parallel do default(none) &
  !$omp private(indx) &
  !$omp shared(marbles) &
  !$omp num_threads(numThreads)
  do indx = 1, size(marbles(1, :))    
      marbles(1,indx) = 1
      call doWork(marbles(:,indx))
  end do
  !$omp end parallel do

end subroutine parallelMarbles
subroutine doWork(marble)
  use omp_lib, only : omp_get_thread_num
  use parserMod, only : parser
  use iso_fortran_env, only: wp => real64
  implicit none
  real(wp), dimension(6), intent(inout) :: marble

  integer :: indx
  do indx = 1, 200
    marble(2) = mod(indx, 6 + omp_get_thread_num())*marble(1)
    marble(3) = mod(indx, 5 + omp_get_thread_num())*marble(1)
    marble(4) = mod(indx, 4 + omp_get_thread_num())*marble(1)

    call parser%evaluate(marble(1:3), marble(4:6))
    marble(1) = sum(marble(2:))
  end do

end subroutine doWork
program testOMP
  use iso_fortran_env, only: wp => real64

  ! real(wp), allocatable, dimension(:,:)    :: marbles
  real(wp), dimension(6,200000)    :: marbles
  integer                                  :: numThreads
  real                                     :: singleTime, threadTime

  integer :: startTime, endTime, countRate, countMax
  character(len=25)   :: varString

  ! allocate(marbles(6,200000))
  do numThreads = 1, 4, 3
    write(*,*) 'Calling parallel marbles with ', numThreads, ' threads.'  

    call system_clock(startTime, countRate, countMax)
    call parallelMarbles(marbles, numThreads)
    call system_clock(endTime)

    threadTime = (dble(endTime) - dble(startTime))/dble(countRate)

    write (varString, '(F25.6)') threadTime
    write (*, '(A)') ' Loop time = ' // trim(adjustl(varString)) // ' seconds.'

    if (numThreads .eq. 1) then
      singleTime = threadTime

    write (varString, '(F25.6)') singleTime / threadTime
     write (*, '(A)') ' Speedup = ' // trim(adjustl(varString)) // 'x.'

    write(*,*) '------------------------------------------------------'
  end do
end program testOMP

The test uses the fortran_function_parser module:
GitHub - jacobwilliams/fortran_function_parser: Modern Fortran Function Parser.

Any insight into what might be going wrong in ifort and how I might improve the performance with ifort would be greatly appreciated.

1 Like

@jacobwilliams I noticed you’re on here so I thought I might ask you directly.
The test program I posted here basically just calls your fortran_function_parser in an openmp do loop to populate a large preallocated array with values.
The parser is not threadsafe as far as I can tell so I declared it threadprivate so each thread gets its own parser.
Do you have any experience calling the function parser from within an openmp loop, and if so have you seen any similar poor performance with intel’s compiler?

Thanks in advance for any insights you might have.

For a few years now, I have been trying to understand OpenMP performance, especially working with large arrays.
Two frequent problems with testing OpenMP implementations are:

  1. insufficient workload for the OpenMP DO loop; a trivial calculation in the loop is not going to overcome the overhead of initiating the !$OMP region. It takes about 5 micro seconds to initiate a region. That is about 20,000 processor cycles, which looks huge to me. There is also a slight overhead for SCHEDULE(DYNAMIC) vs SCHEDULE(STATIC). Dynamic can be preferred where the thread workloads can be variable. The DYNAMIC overhead can be a minor issue but does highlight the problem of balancing workload between threads.
  2. Increased thread counts can involve increased memory demand. When the memory demand of the combined thread calculation exceeds the cache size, memory access can quickly exceed the memory bandwidth, stalling the thread gains. This appears to be a black art that I am yet to master. A simple OpenMP example is dot_product. Looks good, but to scale up to overcome the startup delay, it will always fail on memory bandwidth. There might be a sweet spot for array size, but my real problems never have that characteristic.

Minor speed differences between gFortran and iFort may come down to optimisation strategies, especially for IF usage or possibly positioning for use of L1 cache.
The more important question should be is OpenMP providing a significant improvement from the single thread case. Hopefully this is a more significant gain than between gFortran and iFort. Where OpenMP is not providing a gain, this is a more challenging problem.
As I am using gFortran for OpenMP, it is good to know both compilers are sharing the better performance for different calculations.


@JohnCampbell thanks so much for your response! (I was beginning to lose hope of getting any responses…)
I think both your points are very good ones. Though I think in this case the 2nd point about memory demand is probably more relevant.
My test case is basically just a single omp do loop that calls a doWork subroutine at each iteration.
The serial version takes about 6.7 seconds in ifort and gfortran.
What’s weird is that using 4 threads gives a ~3.6x speedup in gfortran but a 0.75x speedup in ifort (i.e. slower runtime than serial).

This seems like more than a minor speed difference between gFortran and iFort. In this case gFortran gives near perfect scaling, and ifort doesn’t scale at all!

I’ve played around in vtune a bit and it says that the average cycles per instruction are much higher for the ifort version of my test than for the gfortran version. It also says that there is a much higher L1D replacement % and L2 replacement % in the ifort version.

So it seems like the cache usage is poor in ifort and that is killing the performance.
It’s hard for me to pinpoint what exactly the issue is though and what might be done about it.
The main work routine calls the fortran_function_parser that I linked above in the question. This is a module that “compiles” a math function passed in as a character string and evaluates it. The code is written in modern OOP style with a derived type with allocatable components and type bound procedures. I’m wondering if this “modern” style is tripping up ifort. Though again I’ve been having trouble pinpointing the issue.

Thanks again for your response and I’d very much appreciate any and all additional insights or suggestions.

I do not use OOP at all, but what you are describing (much higher L1 replacement) could suggest that iFort is storeing the type structure in a different way to gFortran.

My approach with derived types is to use them as a problem definition data structure, but use local arrays for inner loop calculation.
My reasoning is that I have a better understanding of the way the calculated data is stored and referenced in the inner loop and so I might better understand the memory demands of the inner loop. My understanding assumes OOP is more expensive for memory usage.

This approach helps my understanding of performance, but others may disagree. Some OOP implementations report good performance which could imply my simplified understanding is wrong.

I do think poor memory usage is a significant failing in my OpenMP implementations.

I do have an interesting time stepping example, when using a shared 20-30 GByte array in a multi-threaded problem. By using !$OMP BARRIER to keep all threads at a similar stage of repeated calculation, this achieves a greater % L3 cache sharing between threads. This results in a halving of total elapsed time (from 5 hours to 3 hours). I look forward to testing this on a new processor with better DDR5 memory bandwidth to see if this supports my understanding of the problem.
The calculation approach demands each thread must scan the large array twice for each time step and I can’t see a way to redefine the group calculation.
This is with dual channel memory. I don’t know what more channels would provide ? And throw in different classes of cores, there is always a new twist to understand !!

Some types of calculations are more suited to OpenMP than others.

A general remark regarding performance (serial and parallel): performance is not portable.

One important question is therefore the used hardware. Maybe is not aware of the processor layout, e.g. because it is an AMD CPU. There are also openMP options to pin threads to certain cores.

Thanks for the additional insights.
Your approach to derived types seems wise. They can help with organizing the code but for the most compute intensive parts of the code simpler is probably better.

Your time stepping example is very interesting. I’ve always heard that avoiding barriers and synchronization is very important for getting good parallel performance so your result is a nice counterintuitive example.

Thanks for your response.
A non Intel CPU would definitely be a good explanation for ifort’s poor performance but in my case I am using Intel CPUs.
I have also played around with OMP_PLACES and OMP_PROC_BIND to set thread affinity but this didn’t seem to help much.

Is it possible to try using MPI instead of openMP?

For ifort, sometimes enabling heap-arrays may have impact on performance too.

I have not had any success using OMP_PROC_BIND or GOMP_CPU_AFFINITY on Windows 7 or 10 OS.
Has anyone been able to demonstrate this can improve performance ?
For my type of shared memory/data calculation, involving a large shared array, OpenMP as a shared memory model looks to have an advantage over (my understanding of) MPI.

I can confirm the findings using ifort (IFORT) 20200623 and gcc version 10.3.0 (Ubuntu 10.3.0-1ubuntu1~20.04) on an Intel(R) Xeon(R) CPU E5-2687W 0

maws01 ➜  gfortran -Ofast -march=native -fopenmp *.f90 
maws01 ➜  ./a.out
 Calling parallel marbles with            1  threads.
 Loop time = 3.174000 seconds.
 Speedup = 1.000000x.
 Calling parallel marbles with            4  threads.
 Loop time = 0.818000 seconds.
 Speedup = 3.880196x.
maws01 ➜  ifort -fast -xHost -qopenmp *.f90
ld: /opt/intel/compilers_and_libraries_2020/linux/lib/intel64/libiomp5.a(ompt-general.o): in function `ompt_pre_init':
(.text+0x2281): warning: Using 'dlopen' in statically linked applications requires at runtime the shared libraries from the glibc version used for linking
maws01 ➜  ./a.out 
 Calling parallel marbles with            1  threads.
 Loop time = 4.291700 seconds.
 Speedup = 1.000000x.
 Calling parallel marbles with            4  threads.
 Loop time = 41.886398 seconds.
 Speedup = 0.102460x.

The poor performance is directly related to

call parser%evaluate(marble(1:3), marble(4:6))

replacing this with

marble(1:3) = evaluate(marble(1:3) , marble(4:6))


elemental function evaluate(a,b) result(c)                                                          
  use iso_fortran_env                                                                               
  real(real64), intent(in) :: a,b                                                                   
  real(real64) :: c                                                                                 
  c = (a*b)**2                                                                                      
end function 

gives a near optimal speedup for gfortran and ifort.

I must admit that I did not look into parser%evaluate in detail, but it seems quite complex with many branches, testing for the status of allocated arrays etc. Such things should be avoided in a hot loop.

Thank you for taking the time to confirm my findings. I very much appreciate it.
I have reported this issue to Intel and they are looking into it.
One interesting thing I noticed is that this issue doesn’t seem to occur if I give the parser a more complicated expression to evaluate.
So for example changing:

  ! All threads initialize the parser
  !$omp parallel num_threads(numThreads)
  call parser%parse(parserVars,parserVars)
  if (parser%error()) then
    call parser%print_errors(output_unit)
    stop 99
  !$omp end parallel

to the following:

  ! All threads initialize the parser
  !$omp parallel num_threads(numThreads)
  call parser%parse('sin(x) +cos(y)+cos(z)*exp(x*y)',parserVars)
  if (parser%error()) then
    call parser%print_errors(output_unit)
    stop 99
  !$omp end parallel

gives good scaling in ifort as well.

Do you have absolute number for new example? If the more complex evaluation takes significantly more time and runs fully parallel, any non-parallelizable overhead becomes less important.