Why is my code compiled with GFortran on Windows slower than on Ubuntu?

@CRQuantum, I think that many of the statements and conclusions stated in the various posts in this thread are unfair to Gfortran. Not much has been said about which versions of Gfortran were used on Windows, and which emulation layer/DLL support infrastructure was used.

I am attempting to make amends.

A brief scan of your sources made me suspect that it uses 8-byte integers in many places where 4-byte integers would have sufficed. This choice, naturally, affects performance, but I did not want to spend any time to alter this aspect of the code.

I took your Gitlab sources, and commented out most of the WRITE statements, until the program produced just 16 lines of output. Here are the run times on my NUC (small box PC with low power laptop processor i7-10710U, laptop memory, on a ramdisk, balanced power setting, Win11-64).

Ifort 2021.5, /O2            : 0.55 s
-same-, but  /fast           : 0.38 s

Cygwin Gfortran 11.2, -O2    : 0.79 s
Eq.com Gfortran 12.0, -O2    : 2.76 s 

I strongly suspected that the difference in the Gfortran times is attributable to Cygwin using (naturally) Cygwin1.DLL rather than the MinGW DLLs. If so, most of the slowdown that you noticed is attributable not to the compiler but to the runtime (which is also used by GCC, G++, etc.). I ran a test to settle my suspicion. I used Cygwin Gfortran to produce .o files, and linked the .o files using the MinGW Gfortran. The resulting a.exe took 2.8 s. This proves that there can be drastic differences in run times of Fortran programs depending on which versions of the GCC RTL are used. We may note in this connection that MinGW was last updated in 2013, whereas Cygwin was updated just two months ago.

You also wrote that Gprof failed to give you profile output. Here is part of the output that I obtained from Cygwin Gprof:

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total
 time   seconds   seconds    calls  ms/call  ms/call  name
 38.24      0.26     0.26 10198404     0.00     0.00  __samplers_MOD_pyq_i_o
 35.29      0.50     0.24                             _mcount_private
  8.82      0.56     0.06  3999800     0.00     0.00  __random_MOD_randn
  5.88      0.60     0.04                             __fentry__
  2.94      0.62     0.02      208     0.10     0.10  __random_MOD_gaussian
  2.94      0.64     0.02      100     0.20     0.50  __samplers_MOD_metroplis_gik_k_more_o_log
  2.94      0.66     0.02       51     0.39     5.87  __samplers_MOD_prep
  2.94      0.68     0.02                             exp

If you skip the line for _mcount, which is the profiling routine itself, you may note that the biggest time consumed is in function PYQ_I, and that function accounts for over a third of the run time.

You may attempt to modify your code to reduce the number of calls to PYQ_I, or to make it a vector function instead of a scalar, if that is feasible. Secondly, as I mentioned earlier, see if you can be more judicious in using 8-byte integer variables.

4 Likes