Why is my code compiled with GFortran on Windows slower than on Ubuntu?

Why use DO CONCURRENT ?
What does it imply ? (I am not sure)
Potentially (depending on the compiler) it could initialise some MPI interfaces, which would be totally unnecessary. All you need is DO.
If you put it in the code, then the next person to maintain the code will have to answer these questions.

Your code sample also identifies “mgauss_ik”, which I presume modifies the loop count depending on the data set (of different random numbers). This could change the calculation extent, based on the use of RANDOM_NUMBER which differs between compilers.

I looked at function pYq_i_detail (90% of computation in my profiling), as it did not appear to respond to -ffast-math or -O. I suspect it does not utilise AVX, so I tried to introduce array instructions to see if it worked any better. It did not :frowning:

Handling of IEEE exceptions is very compiler dependent. I have found “equation.com”:gfortran to be slow for this. These are more often to occur in performance tests, rather than real data. I wasted months on this issue until mecej4 identified this problem for me. You need sufficiently realistic data sets for testing to avoid these unwanted side issues.

I introduced some profiling into the code, and produced the following:

#### Delta_Sec Summary ####   12

 Id Description                      Elapsed    Calls
  1 _START                            0.0000        1
  2 # pYq_i_detail                    3.5952    10201
  3 INITIALISED Yji                   0.0006        1
  4 prep > gauss_thetas               0.0961      102
  5 prep > MC_gauss_ptheta_w_sig      0.0000      102
  6 Metroplis_gik_k_more_o_log        0.4669       50
  7 CC Metroplis_gik_all_o_log        0.0679       50
  8 CC mgauss_ik(i,k)                 0.0006       50
  9 steptest report                   0.0095       50
 10 cpu_time report                   0.0011       50
 11 ANALYSED                          0.0026        1
 12 _FINISHED                         4.2406    10659
  calls to pYq_i_detail =             10198404
 Program end normally.

note: '# pYq_i_detail is reported at exit from subroutine MC_gauss_ptheta_w_sig; called from subroutine prep, where m = mgauss_ik(i,k). This count is significant for the comparison.
(Times are for i5-2300)

1 Like

@mecej4 , @JohnCampbell , @oscardssmith if you are interested in the Julia version, below is the link,

You could copy all files in one folder, then in the cmd window do,

julia EM_mix.jl

My Julia version is 1.6.1. I remember on Windows, previously it costs 1.5s, now it costs 3s, :rofl:. perhaps due to I upgraded my windows 10 from 1909 to 21H2.

I am not expert in Julia, so my Julia version may not be the most performant and may look weird.
I know if the Julia looks like Fortran, it should perform like Fortran, so My Julia code looks like Fortran perhaps. LOL.
The thing of Julia is that, Julia experts can do many ‘fine tuning’ here and there in their code to make it fast, but I feel those stuff should perhaps mostly done by the compiler. It bothers me a little bit, if I have to manually do all those optimization stuff myself here and there.

As long as I see Julia version did not perform as good as intel Fortran on Windows, I did not install Julia and test it on Linux anymore. :sweat_smile:

Anyway, the point is not optimizing the code, the code is just a tiny illustration code and do not have real use.

I just wish goftran’s performance on Windows could be consistent and could be about as good as on Linux. Also hope that there is a easy way to use gfortran and mpi on windows. I know cygwin64 can use gfortran + openmpi on windows. However cygwin64 gfortran may be slow on Windows for some complicate big code, adding openmpi just barely recover its single core performance as on Linux. :rofl:

I really appreciate your endeavor @JohnCampbell ! Thank you so much!

About do concurrent, I agree.
I use it for the hope that it can really do some parallelization automatically, and perhaps it can make things work in GPU. But intel’s compiler seems have some issue with it, here is a post about the issue and you also replied there :slight_smile:

I personally did not find too much performance advantage of do concurrent, other than it can make the code look more concise perhaps.

Thank you for being so careful :+1: :100: “mgauss_ik” yeah it is just to dynamically adjust the number of samples (for the given i,k) used for Monte Carlo integral like below,


where n_ik is actually line 136 in samplers.f90,

“mgauss_ik” is actually not very useful, can just comment line 221 to 224 in samplers.f90 as below,

and just do

mgauss_ik = mgauss

so mgauss_ik will always be a constant which is mgauss. So for each n_ik the number of Monte Carlo samples are the same as mgauss which is typically 1000.
The reason for “mgauss_ik” is that, say k=2 so 2 gaussian mixing, the total number samples for n_i1 and n_i2 is a fixed number, which is k*mgauss, if mgauss=1000 and k=2, so k*mgauss=2000. However perhaps n_i1 needs more samples than n_i2, so I may distribute 1500 samples on n_i1, and 500 on n_i2. So “mgauss_i1=1500”, “mgauss_i2=500”, etc. In this way, the total 2000 samples are more efficient distributed on n_i1 and n_i2, instead of just giving 1000 samples for each.
No worry, in short, “mgauss_ik” does not really influence the code and not depend on seed too much. You know, if the result of a Monte Carlo simulation heavily depend on random number seed, then something must be wrong :rofl:

By the way, how did you get the profile information below?

#### Delta_Sec Summary ####   12

 Id Description                      Elapsed    Calls
  1 _START                            0.0000        1
  2 # pYq_i_detail                    3.5952    10201
  3 INITIALISED Yji                   0.0006        1
  4 prep > gauss_thetas               0.0961      102
  5 prep > MC_gauss_ptheta_w_sig      0.0000      102
  6 Metroplis_gik_k_more_o_log        0.4669       50
  7 CC Metroplis_gik_all_o_log        0.0679       50
  8 CC mgauss_ik(i,k)                 0.0006       50
  9 steptest report                   0.0095       50
 10 cpu_time report                   0.0011       50
 11 ANALYSED                          0.0026        1
 12 _FINISHED                         4.2406    10659
  calls to pYq_i_detail =             10198404
 Program end normally.

I tried gprof on windows, but it always generate empty prof file, perhaps I will open a new topic asking this question.

Again, thank you so much! :+1: :100: :slight_smile:

I did the profiling “manually” by placing “call delta_sec ( description )” at the end of a section of code where timing could be informative.
I am modifying the original code to achieve this.
It is based on SYSTEM_CLOCK using 8-byte integers for higher precision. ( rate ~ 3 million implies 1000 processor cycles per tick so can not profile tight code, but much better than CPU_TIME that reports only 64 ticks per second )
I start with inserting a few calls then adapted to identify key areas.
It is an itterative process, by identifying places that can best identify relative times of significant performance. An easy process to monitor key performance.
There is an overhead of too many calls to delta_sec. Using ‘# …’ description helps to suppress every call report but is useful to understand relative times.
subroutine delta_sec ( description ) is a simple idea, that can be modified to suit the program being profiled.

It is based on Salford FTN95 compiler that profiles all routines compiled with /profile option.

The Delta_Sec Summary give a clear indication of relative times/importance.
Special descriptions are:
_START starts the summary process; should be first call.
_FINISHED reports times, for restart of final report
description(1:1) = ‘#’ is used to accumulate times, but not do reports (good if lots of calls)

!  first few lines of code
    open ( 6, file='EM_mix.log' )

    call delta_sec ( '_START' )
    call delta_sec ( '# pYq_i_detail' )

!  code for profiling report
    subroutine delta_sec ( description )
      character*(*) description
!
      integer*8          :: tick, rate
      integer*8          :: last_tick=-1        ! last tick delta_sec was called
      real*8             :: sec, all_sec = 0
!
      logical            :: do_summary = .false.
      integer*4, save    :: nt=0, i
      character*30, save :: list_of_descriptions(50)=' '
      integer*4, save    :: num_calls(50)=0
      real*8, save       :: times(50)=0

!   Get ticks since last call
      call system_clock ( tick, rate )
      if ( last_tick < 0 .or. description == '_START') then
        last_tick  = tick
        all_sec    = 0
        do_summary = description == '_START'
        nt         = 0
      end if

!   report this time interval : ignore if #....
      sec = dble(tick-last_tick) / dble(rate)
      all_sec = all_sec + sec
      if ( description(1:1) /= '#' )  &
      write (6,11) description, sec, all_sec
      last_tick = tick
!
      if ( .not. do_summary ) return
!
!   save all times for final summary report if selected
      do i = 1,nt
        if ( list_of_descriptions(i) /= description ) cycle
        times(i)     = times(i)     + sec
        num_calls(i) = num_calls(i) + 1
        exit
      end do

!   add if new description to list of descriptions
      if ( i > nt ) then
        if ( nt < size(times) ) nt = nt+1
        list_of_descriptions(nt) = description
        times(nt)                = times(nt)     + sec
        num_calls(nt)            = num_calls(nt) + 1
      end if

!   report summary times if finished
      if ( description == '_FINISHED') then
        write (6,10) nt
        times(nt) = sum(times(1:nt))
        num_calls(nt)   = sum(num_calls(1:nt))
        do i = 1,nt
          write (6,12) i, list_of_descriptions(i), times(i), num_calls(i)
          times(i)     = 0
          num_calls(i) = 0
        end do
      end if

  10  format (/'#### Delta_Sec Summary #### ',i4//  &
               ' Id Description                      Elapsed    Calls')
  11  format ('#### delta_sec #### ',a,t50,2f10.4)
  12  format (i3,' ',a, f10.4, i9)
    end subroutine delta_sec
1 Like

A current Reddit thread demonstrates that the speed of an executable generated by a compiler can depend greatly on the options used.

1 Like

Thanks @Beliavsky :grinning:
My flag for this small code is just

-Ofast -march=native

I mean, in my experience, for simple or complicated code, exactly the same code, the same flags, the performance of different gfortran versions on Windows seems not the most consistent, compared with on Linux.
If using some particular optimization flags for gfortran on Windows can make the code as fast as its native speed on Linux, it could be great. But ideally I wish one can use the same flags for both Windows and Linux, and speed on Windows and Linux are equally fast.

The Reddit post that @Beliavsky pointed to demonstrates nothing, I’m sorry to point out, although I wholeheartedly agree that compiler flags can greatly affect the speed of the resulting program. There are lots of compiler flags in the commands shown on the Reddit page, but none of the files listed in the commands are source files! Only linker options have any effect in such a situation, and using different libraries (e.g., RefBlas versus MKL) could change the run-durations. That, however, is not the point being stressed.

The wish expressed by @CRquantum, “But ideally I wish one can use the same flags for both Windows and Linux, and speed on Windows and Linux are equally fast” is not realistic for any application that spends some time in the compiler’s RTL and/or system services. On Windows, EXEs and DLLs produced by Gfortran need wrapper code (or translation layer) that converts the Linux C library and system calls to Windows compatible library and system calls.

Thanks @mecej4 .
But why intel Fortran’s performance is consistent on Windows and Linux? I mean same code, same optimization flags, I always find that Intel Fortran’s performance is the same on Windows and Linux.
Just say on Windows,
perhaps at the compiling stage, both intel Fortran and gfortran have no problem. But at the linking and building stage, perhaps the fact that Intel Fortran relying on Visual Studio while gfortran relying on the translation layer make the difference. The result is, while Intel Fortran’s performance is consistent on Windows and Linux, gfortran’s performance is compromised and different Windows version of gfortrans’ performance are different from each other.

On windows, it seems there are basically two branches of gfortran.

  1. gfortran in Cygwin64 and MSYS2 performs the same (perhaps the one in MinGW64 performs the same too). This version of gfortran performs good for my this small code, however for more complicated code it can perform 6X slower than on Linux.

  2. gfortran from Equation.com. It is 4X slower than Cygwin64 gfortran for my small code, but not too bad (30-50% slower than on Linux) for more complicated code.

On Windows, it seems using Intel OneAPI could be the best choice, not only from performance point of view, but also considering that it has MPI integrated.
On the other hand, gfortran performs well in any linux related case, no matter native Linux, or Linux in WSL, Hyper-V, and virtual machines like Vmware.

As said, if on Windows there could be a very performant gfortran with MPI configured, it can be really great to build some commercial software on it. This may be good for Fortran community. Because we need more people to really use it and rely on it, directly or indirectly. Like, imagine if Microsoft Office or many famous video games are mostly written in Fortran, then I guess there would be constant money/funding devoted to Fortran community therefore help Fortran develop better. Otherwise Fortran may remain at mostly academic area and HPC region, without the users base to be large and diverse enough, its potential may be limited in some way. Well that is off topic. :sweat_smile:

In short, I just want gfortran to be good on Windows too! :sweat_smile:

The gfortran approach of supporting many different hardware platforms or operating systems is achieved by having different “.dll” interfaces to the OS. The paticular variants of this is important.

My interpretation of this thread is that the gfortran interface to Windows provided by equation.com (eq) may be deficient for exp and IEEE error handling, in comparison to other implementations.
To generalise to “6X slower than on Linux” is not a valid conclusion.
My testing of ming-w64 and EQ versions of 64-bit gfortran shows the EQ version to be roughly 5% faster for the tests that I do, although how the managing of cache varies is a significant mystery. I have no experience of Linux but would not expect a significant change.

-Ofast ?
I have assumed this to be an agressive option, so prefer -O2, or -O3 for code where I interpret optimisation should not cause problems, eg simple DO loops.
I nearly always use “-fimplicit-none -march=native -ffast-math -fopenmp -fstack-arrays” as go-to options and sometimes “-g” or “-funroll-loops --param max-unroll-times=2”. ( should not use goto :slight_smile: )
Using some of these is more a hope they will help, unlike -fopenmp which does definately change the compile outcome.
Following on from another thread, I need to reinvestigate the use of “-ffast-math -fopenmp”, especially where “low arithmetic intensity” is identified.

1 Like

I have now tried using the gfortran profiling utility : gprof
This is my first use of gprof, so my option selection may not be the best, but I get a useful table.
It is much easier than my approach.

The batch file I used to test with the profiling report is:

del *.o
del *.mod
del em_mix.exe

set options=-g -fimplicit-none -fallow-argument-mismatch -march=native -pg

gfortran ran.f90 samplers.f90 em_mix.f90 %options% -o em_mix.exe

dir *.exe

em_mix

gprof -b -J -p em_mix.exe > em_mix_profile.log

notepad em_mix_profile.log

notepad em_mix.log

The resulting profile log I recovered is useful

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 56.00      1.82     1.82                             exp
 26.46      2.68     0.86 10198404     0.00     0.00  __samplers_MOD_pyq_i_detail
  3.38      2.79     0.11                             __logl_internal
  3.08      2.89     0.10    10200     0.01     0.09  __samplers_MOD_mc_gauss_ptheta_w_sig
  2.15      2.96     0.07                             __fentry__
  2.15      3.03     0.07                             _mcount_private
  1.54      3.08     0.05  1000000     0.00     0.00  __samplers_MOD_pyq_more_o_log
  1.54      3.13     0.05      100     0.50     1.20  __samplers_MOD_metroplis_gik_k_more_o_log
  0.92      3.16     0.03  6040500     0.00     0.00  __random2_MOD_ran1
  0.62      3.18     0.02                             __cosl_internal
  0.62      3.20     0.02                             log
  0.31      3.21     0.01  4000008     0.00     0.00  __random_MOD_randn
  0.31      3.22     0.01      208     0.05     0.05  __random_MOD_gaussian
  0.31      3.23     0.01                             __sinl_internal
  0.31      3.24     0.01                             cos
  0.31      3.25     0.01                             sin
  0.00      3.25     0.00    10660     0.00     0.00  delta_sec_
  0.00      3.25     0.00      102     0.00     0.10  __samplers_MOD_gauss_thetas
  0.00      3.25     0.00       90     0.00     0.00  __samplers_MOD_corrchk_internal
  0.00      3.25     0.00       51     0.00    19.02  __samplers_MOD_prep
  0.00      3.25     0.00       50     0.00     0.40  __samplers_MOD_metroplis_gik_all_o_log
  0.00      3.25     0.00       50     0.00    21.82  __samplers_MOD_steptest
  0.00      3.25     0.00        9     0.00     0.00  __samplers_MOD_get_musigma
  0.00      3.25     0.00        1     0.00     0.00  __random_MOD_savern
  0.00      3.25     0.00        1     0.00     0.00  __random_MOD_setrn
  0.00      3.25     0.00        1     0.00     0.00  __samplers_MOD_get_datetime
  0.00      3.25     0.00        1     0.00     0.00  __samplers_MOD_get_musigma_maxll
  0.00      3.25     0.00        1     0.00     0.00  __samplers_MOD_push_yji
  0.00      3.25     0.00        1     0.00     0.00  __samplers_MOD_samplers_init

This clearly identifies the exp intrinsic and function pyq_i_detail as the main time usage in “self seconds”
The “cumulative seconds” is 3.25 seconds, which is less than 5.07 seconds I have obtained from SYSTEM_CLOCK, but is hopefully explained in the documentation.
It does not provide call counts for intrinsics exp, log, sin and cos

gprof is a very easy way to identify where time is being spent.
It would be useful to perform this test on the range of gfortran implementations you have available.

2 Likes

Thank you very much @JohnCampbell indeed and I really appreciate your help!
The grof in the equation.com version of gfortran somehow always give me empty profile results :rofl:
I created a new thread at

If you or someone met similar issues before you may reply from there.

Thank you very much indeed :slight_smile:

Thanks @JohnCampbell , I think (based on my experience), if you have code that speed is important and you need to use gfortran, you may really try to use gfortran in Linux. There a decent chance that you code on Linux with gfortran can run noticeably faster than gfortran on Windows. After all, as the name gfortran (gnu fortran) indicates, Linux may be its native battlefield. Especially if your code requires MPI, in Ubuntu you only need to do

sudo apt install gfortran mpich

gfortran and mpi will just plug and play. On windows it seems not that easy.