Why is my code compiled with GFortran on Windows slower than on Ubuntu?

I have used two sources of gfortran on windows; mingw-w64 and 64-bit equation.com. ( each alternative requires careful setting of the path environment variable, which may be an issue for this thread?)

What I note is that equation.com’s version produces much larger .exe files than MinGW-W64, which I attribute to use of fewer .dll dynamic links. As a consequence, there is a slower initial startup of the .exe, but less overhead once the program starts. By using timers initiated during the run, this eq… version has slightly faster measured computation performance, as .dll loading is not included in my test run times.

This is an interesting thread, as I have not found (equation.com’s) gfortran on windows to have poor performance, although it is important to note my testing is for a different run profile, where my tests are for intense computation over minutes or hours, not fractions of a second, where the startup time is significant. You only connect the .dll’s once then this delay is not repeated.

Most of the comparison of Julia to gfortran in this thread appears to focus on the startup and some intrinsics, while I am more focused on multi-threaded AVX computation. I can’t conceive that Julia would be faster than gfortran for the types of computation I am doing, but there are always going to be types of computing that suit a particular language.

I reviewed the code in post #9 to see if I could identify types of coding that may not suit gfortran.

  • lots of tab characters in the code, which is not portable and made it difficult to test with other compiler tools I use.
  • I am not familiar with ishft (i,j), especially where i and j are different kinds. I suspect “j” should be a standard integer?
  • auto-allocate is used, eg “qt1 = mu01 + sig01*gaussian(nsub1)“, although this is not a significant cpu time usage.
  • subroutine steptest uses “do concurrent( i=1:nsub, k=1:kmix )”, although this is not a significant cpu time usage. Not sure why this is adopted ?

However, from my win64 > equation.com:gfrotran testing, most of the time is consumed in Function pYq_i_detail. (called 10 million times). This is provided as an external function argument to subroutine MC_gauss_ptheta_w_sig, which is called via subroutine prep.
Changing from being used as a supplied function argument to an explicit function use, this did not change the performance.
It uses intrinsic exp and **2 and does not appear to utilise avx.

Perhaps Julia has a better exp implementation ?

1 Like

This is a bit late, but I just now noted that you have a potential major bug in your source file ran.f90 .

At the beginning, you have the declaration

integer, private, parameter :: i8=selected_int_kind(15)

Later, you have several variable declarations that use the kind number i8, such as

integer(kind=i8),  parameter :: mult1 = 44485709377909_8

This is correct if and only if i8 = 8. Whether this is true or not depends on the compiler. For Silverfrost FTN95 (Windows) and NAG (all platforms), this is not true.

Before you forget and run into trouble, change _8 to _i8 in your code, and take care to avoid repeating this mistake. Fortunately for you, the error will probably be caught at compile time by a compiler for which i8 /= 8, but you cannot count on that.

1 Like

I tried to investigate why Windows gfortran’s implementation of function pYq_i_detail is reported to be slower than others.
I tried to introduce array syntax, below, but this did not achieve a run time improvement.
(ooops, just noted an error with mean(mi), but did not change error report below!!)

  function pYq_i_detail_array (theta,i) ! in principle this should be faster than pYq_i.

!  modified to introduce array syntax into calculation, wityhout improvement !!

    real(kind=r8) :: pYq_i_detail_array, theta(dim_p)
    integer(kind=i4) :: i

    integer(kind=i4) :: j
    real(kind=r8) :: fact_mean, log_pYq_i, product_sigma_inv, mean(mi), sigma_inv(mi), log_pYq_j(mi)

    calls_pYq_i_detail = calls_pYq_i_detail + 1

    fact_mean         = D/theta(2)
    do j=1,mi
      mean(j)       = fact_mean / exp (theta(1)*t(j))
      sigma_inv(j)  = abs(sig*mean(j))
    end do
    log_pYq_j(:)      = (Yji(:,i)-mean(:))/sigma_inv(:)
    log_pYq_i         = -half * dot_product (log_pYq_j, log_pYq_j)
    product_sigma_inv = normpdf_factor_pYq_i / product(sigma_inv)

    pYq_i_detail_array  = product_sigma_inv * exp(log_pYq_i)
    return
  end function pYq_i_detail_array

However, the most likely indicator as to why the Windows vesrion is slower is the following warning at the end of the run:
Note: The following floating-point exceptions are signalling: IEEE_INVALID_FLAG IEEE_DIVIDE_BY_ZERO IEEE_UNDERFLOW_FLAG

It looks like the data set based on random values is triggering IEEE warnings, which can significantly increase run time.
Perhaps Julia and other implementations are not doing the checking they should ?

I think you need to generate a more reasonable data set.

1 Like

Thank you @mecej4 and @JohnCampbell .

Eh, yeah, I mean, the thing is, exactly the same code, let us just say gfortran.
The underflow warning existing on both Linux and Windows.
However, gfortran on Linux perform normal which tooks 0.5s,
However on WIndows it took 3s for equation.com version, it took 0.7s for cygwin64 version (perhaps this version is the same your MinGW64 version).

The 0.5s .vs. 3s is probably not caused by overhead, it is consistently 6X slower.
You may change the value of mgauss from 1000 to 5000 as show in line 55 in the file EM_mix.f90, you will see on linux it took 2.5s while on windows it took around 15s, again linux version is 6X faster.

I mean, again, for this small code, equation.com gfortran version on Windows simply 6X slower than on Linux. While cygwin64 version of gfortran seems perform almost the same as on Linux.

However, for more complicated code, I found that equation.com gfortran version on Windows perform similar with on Linux (on Windows it performs 30% slower than Linux but still acceptable). But the cygwin64 version can be 6X slower than on Linux. So on Windows, one cannot simply conclude which gfortran version is the best.

But the bottom line is, I think, or I hope, for the same code same optimization flags, gfortran’s performance on Windows can be almost the same as on Linux. If there are huge performance difference, probably the main problem is not on the code itself. After all, no one wants to write a gfortran code for windows particularly, right? :rofl:

PS.

On Linux it has the same warning but perform just fine.
Intel Fortran does not show warning, and its performance is consistent on Linux and Windows.
Julia code does not have that warning.

1 Like

Thank you @mecej4 , yeah the ran.f90 is a little sloppy in this code. I can change that _8 to _i8.
But that does not solve the performance puzzle on Windows. But thank you very much indeed all the same :slight_smile:

Thank you @JohnCampbell too!

Yes, j should be just integer, I just need to change all the _8 to _i8 in ran.f90, such as

   integer(kind=i8),  parameter :: mask24 = ishft(1_i8,24)-1
   integer(kind=i8), parameter :: mask48 = ishft(1_i8,48)-1

Eh, the code is below, may I ask what is the problem of using do concurrent?

	do concurrent( i=1:nsub, k=1:kmix ) ! change mgauss_ik to minimize delta Ni.
		mgauss_ik(i,k) = min(max(int( wknik_sig(i,k)/norm_sig(i)*mgauss_tot),mgauss_min),mgauss_max)    
	enddo

About optimization, @JohnCampbell
function pYq_i_detail is basically below,

I know in this small code function pYq_i_detail is the most time consuming part, so I tried my best to optimize this part, and did many experiments, finally I believe my this implementation of function pYq_i_detail should be the fastest one can get. :slight_smile: Overall, what I did is basically instead of calcualting the product of exp(stuff)*exp(stuff)*exp(stuff)*exp(stuff)*exp(stuff)..., I calculate the sum=stuff+stuff+stuff+... first, then finally do one exp(sum). In fact, in some cases, the real thing we need is just the sum, so using sum can escape the exp(sum) exploding issue.

IEEE warnings can significantly change run times on windows.

I have had this performance problem in the past, when trying to benchmark a linear equation solver with a large “array” of random numbers.
Can you produce a more realistic data set ?

I also note that “mgauss” can vary depending on the data, so a different random number set can change the computation ?

1 Like

Thank you very much @JohnCampbell !

Can try.
The warning message is below,

Uhm, may I ask, but why cygwin64 version of gfortran perform fine on Windows for this code (it took 0.7s)?
On the other hand, if the real data really have underflow problem, then how can we ‘fix’ that?

I can be absolutely wrong, but I feel underflow may not be a very uncommon issue, and it seems more or less should be the compiler’s job to handle it neatly.
Intel Fortran on windows, for example, does not show any such message, and its performance is consistent on Linux and Windows.

mgauss is a value which means the samples of Monte Carlo simulation. Its value is set at line 55 in EM_mix.f90 and then it never change.
In this code, we try to find parameters to fit the data Y(mi,nsub) as at line 122. The data has some random noise in it, so it depend on the random seed a little bit,

However, if you increase nsub at line 47 from 100 to 200 or more as below,

so the data become bigger, then the effect of random noise in the data decreased, then the value of the random seed which is irn at line 43 should not influence the result noticeably. Although the value of likelihood may depend on the random number seed (because different seed give different data Y(mi,nsub) so different likelihood), the parameter evaluation should not be influenced by the seed noticeably.
Such as below

The value of LogLike which is log likelihood depend on the data therefore seed, however the parameter estimation such as w1,w2,Mu1,Mu2,MuV,Sig1,Sig2,SigV,Sigma does not depend on seed noticeably.

You may also increase itermax at line 46 from 50 to 100 or any number, simply to increase the run time by doing more iterations.

If mgauss is set bigger than like 1000, and nsub>=200. Then merely changing random number seed should not influence the parameter estimations very much.

In terms of speed, because the total number of iteration is fixed, the time for each iteration is almost the same, so different random number seed should not influence the computing time noticeably.

Thank you very much @JohnCampbell .
Uhm, in fact, I am not very sure which part of my code triggered the ieee warning.
But again, cygwin64 gfortran has no big performance issue for this small code.
My data which is Y(mi,nsub) at line 122 in EM_mix.f90, a sample is like below, 100*5 data.
100 rows means nsub=100 patients for example, the 5 column is like mi=5 observations (like the value of some drug concentration) for each patient at time = 1.5, 2.0, 3.0, 4.0, 5.5.
The data is not very weird I think.

  2.65401349   2.40026224   1.99623561   1.41169297   0.88217734
  3.45838822   2.98550611   2.92380038   2.08054658   1.33617050
  3.82526123   3.03327603   2.34121218   1.52205476   0.92273596
  2.28668992   2.01973948   1.63948803   1.02627141   0.66650634
  3.40903589   2.82301185   2.34513287   2.01947901   1.34372061
  2.83453767   2.20061227   1.69325516   1.28140681   0.73161632
  2.75068773   2.23587612   1.71830050   1.01674311   0.64559501
  4.04768132   4.01698762   3.53190277   2.52058298   2.14937107
  4.08435874   3.40789423   2.75980292   1.39004878   1.10938533
  2.88367066   2.28895940   1.43275093   1.08778145   0.75014338
  3.12304854   2.84925010   2.36672949   1.54044628   1.02207904
  3.72852625   2.74760371   2.19501208   2.06158678   1.02571065
  3.37808119   2.68092516   2.31043504   1.77355439   0.88879246
  2.53444154   1.59567106   1.26306014   0.82380825   0.43744635
  2.85534255   1.93740693   1.85237050   1.24226383   0.77331081
  3.78665876   3.26490532   2.95064268   1.67801971   1.31745667
  3.05567463   2.65547087   1.81873971   1.23136769   0.69043155
  2.80767064   2.89798552   1.86058972   1.57507134   1.16067511
  3.30489986   2.74398369   2.21825740   1.83461938   0.99241819
  2.98543229   2.24426962   1.18329046   0.66491925   0.51613033
  3.61874634   2.09498495   1.41921642   0.97643742   0.64182735
  4.26231605   4.47587570   3.15080034   1.71895424   1.46397885
  3.50932076   2.61622640   2.56780792   2.02193280   1.51401681
  3.36647134   2.56527024   2.01409438   1.67567619   1.04806922
  4.50191529   3.54567059   2.74994640   2.20710948   1.83609478
  3.22450873   3.24666074   2.62405445   1.98347337   1.27931557
  2.98190040   2.95397037   1.62222461   1.28686358   0.93436784
  4.53695504   3.35934880   3.42288141   2.50322969   1.74250924
  2.62175799   2.46776757   1.86818667   1.46210614   0.85735527
  2.87686982   2.58862926   1.56346886   1.10365948   0.80911058
  3.78168621   3.68270753   3.00255413   1.93803745   1.71793505
  3.51136623   3.39427042   2.09630591   1.53876649   1.23730296
  3.62972062   4.14914361   2.77826374   2.03943793   1.71782968
  2.35057638   2.53769530   1.82456819   1.29633972   0.87441877
  1.89964379   1.79119682   1.24603350   0.81811392   0.42941111
  2.75157807   2.74380225   2.16641154   1.88319029   1.17524137
  2.64993979   2.17434263   1.48572844   1.05423214   0.61687510
  3.95709936   3.42263700   2.50621633   1.75887524   1.04011591
  2.80681462   2.06820029   1.47908019   1.03778862   0.56211488
  2.70812624   2.66001341   2.06492028   1.36820939   0.85063758
  3.10427718   3.65212115   1.75620884   1.68815232   1.03946008
  3.64235765   3.56826708   3.51482799   2.40909684   1.76638592
  4.06031791   2.97112810   2.49563006   1.43784176   1.14875997
  3.34082492   3.11935202   1.84614646   1.58291198   1.03015544
  3.22193708   2.54302857   1.79634980   1.11142443   0.65130767
  3.58215208   2.62073121   2.66638423   1.84202624   1.18265738
  3.35761282   2.39419717   2.23732805   1.91448176   1.03428109
  3.28333030   2.19348442   1.91633994   1.25456125   0.80776490
  3.10520090   2.09426322   1.32296942   1.03168506   0.61926324
  3.38021460   2.03462263   1.34017676   0.88276649   0.66037949
  3.40869343   3.41133803   1.93802175   1.81579843   1.06323731
  3.26968162   2.48983221   1.83106822   1.49248039   0.91877651
  3.46696457   3.40990458   2.69489217   1.65735733   1.46487218
  3.83003837   3.16419455   2.44522934   1.75292499   1.28435507
  3.86301454   2.60621378   2.81370680   2.28488143   1.99767924
  3.55586075   3.72374276   2.62138282   1.83746650   1.59064934
  4.10385990   3.54228113   2.74143151   2.12231404   1.62115274
  2.94285017   3.02266528   1.62344637   1.52739370   1.07545293
  3.30932595   2.75741807   1.77969176   1.41303613   0.88816817
  3.22341324   2.55816465   1.95459119   1.16511846   0.86159415
  2.84202064   2.43873974   1.60431253   1.36484335   0.86942491
  3.45614629   3.01040741   1.96346018   1.82010097   1.37410590
  3.00944969   2.89417444   1.58785266   1.10963363   0.64515890
  2.60424738   2.11331112   1.75463928   1.11998501   0.55229481
  2.99971059   2.47432836   1.56259031   1.14480975   0.67926735
  4.17905512   2.84980486   2.53602972   1.95078086   1.21125491
  3.93488338   3.34035560   2.61486487   1.81986854   1.22665323
  3.07076678   3.22755472   2.28705685   1.86866746   1.22210486
  4.47686033   3.72474003   3.15337092   2.16680869   1.50938806
  3.81167928   2.98266971   2.39673097   1.41967711   1.28361822
  2.65561655   2.72189958   2.30066879   1.76456974   1.26550255
  2.72273553   2.33356039   1.75159175   1.35047290   0.78941703
  2.90349003   2.08078892   1.38951964   0.93077741   0.49728869
  2.93430976   2.22440618   1.82547359   1.41442592   1.09560305
  2.91292812   2.54891278   1.45965571   1.36586297   0.63560401
  3.29525021   2.54570345   2.23982574   1.51381464   1.08854350
  3.94182345   2.97545842   3.02333913   2.54221231   2.21077273
  2.70802491   2.70036692   1.93903276   1.26357362   0.87813818
  3.10064585   2.80358061   2.30459697   1.58583013   1.05427075
  4.37507098   3.57259749   2.88659928   2.35402341   1.53044956
  1.69788520   1.18806704   0.68466912   0.38963738   0.16488946
  1.85341767   1.33030428   0.67399323   0.39109777   0.08488546
  2.13477615   1.40412914   0.76514415   0.38096703   0.19578517
  1.84198433   1.63604246   0.93591308   0.58238534   0.22486241
  1.88057122   1.40977237   0.74471384   0.52176656   0.20271034
  1.88576417   1.56837264   0.88622076   0.45210230   0.16227644
  1.98985910   1.50505439   0.80204031   0.47339445   0.17327696
  1.85425535   1.30655019   0.74395125   0.37028203   0.19013104
  2.06818492   1.30300840   0.67451415   0.43671893   0.15880358
  2.36072067   1.47919785   1.03115733   0.56690544   0.26369481
  1.87204096   1.65757104   0.79792967   0.59266594   0.25140979
  2.00575566   1.53834981   0.88645263   0.50812817   0.23976710
  1.86504634   1.15799719   0.77256567   0.48029002   0.17740237
  2.22557612   2.13963805   0.83146404   0.53317038   0.17601324
  1.79107457   1.74749490   0.99333983   0.55446976   0.24992792
  2.15386902   1.57688534   0.64048563   0.40716069   0.19150684
  2.00206904   1.66489366   1.03325675   0.62492999   0.32246391
  2.14706874   1.56950613   0.79770883   0.51951425   0.14801337
  1.63459898   1.12621802   0.57041317   0.33630108   0.14360490
  1.61198564   1.27536164   0.62245615   0.34712257   0.11828325

Why use DO CONCURRENT ?
What does it imply ? (I am not sure)
Potentially (depending on the compiler) it could initialise some MPI interfaces, which would be totally unnecessary. All you need is DO.
If you put it in the code, then the next person to maintain the code will have to answer these questions.

Your code sample also identifies “mgauss_ik”, which I presume modifies the loop count depending on the data set (of different random numbers). This could change the calculation extent, based on the use of RANDOM_NUMBER which differs between compilers.

I looked at function pYq_i_detail (90% of computation in my profiling), as it did not appear to respond to -ffast-math or -O. I suspect it does not utilise AVX, so I tried to introduce array instructions to see if it worked any better. It did not :frowning:

Handling of IEEE exceptions is very compiler dependent. I have found “equation.com”:gfortran to be slow for this. These are more often to occur in performance tests, rather than real data. I wasted months on this issue until mecej4 identified this problem for me. You need sufficiently realistic data sets for testing to avoid these unwanted side issues.

I introduced some profiling into the code, and produced the following:

#### Delta_Sec Summary ####   12

 Id Description                      Elapsed    Calls
  1 _START                            0.0000        1
  2 # pYq_i_detail                    3.5952    10201
  3 INITIALISED Yji                   0.0006        1
  4 prep > gauss_thetas               0.0961      102
  5 prep > MC_gauss_ptheta_w_sig      0.0000      102
  6 Metroplis_gik_k_more_o_log        0.4669       50
  7 CC Metroplis_gik_all_o_log        0.0679       50
  8 CC mgauss_ik(i,k)                 0.0006       50
  9 steptest report                   0.0095       50
 10 cpu_time report                   0.0011       50
 11 ANALYSED                          0.0026        1
 12 _FINISHED                         4.2406    10659
  calls to pYq_i_detail =             10198404
 Program end normally.

note: '# pYq_i_detail is reported at exit from subroutine MC_gauss_ptheta_w_sig; called from subroutine prep, where m = mgauss_ik(i,k). This count is significant for the comparison.
(Times are for i5-2300)

1 Like

@mecej4 , @JohnCampbell , @oscardssmith if you are interested in the Julia version, below is the link,

You could copy all files in one folder, then in the cmd window do,

julia EM_mix.jl

My Julia version is 1.6.1. I remember on Windows, previously it costs 1.5s, now it costs 3s, :rofl:. perhaps due to I upgraded my windows 10 from 1909 to 21H2.

I am not expert in Julia, so my Julia version may not be the most performant and may look weird.
I know if the Julia looks like Fortran, it should perform like Fortran, so My Julia code looks like Fortran perhaps. LOL.
The thing of Julia is that, Julia experts can do many ‘fine tuning’ here and there in their code to make it fast, but I feel those stuff should perhaps mostly done by the compiler. It bothers me a little bit, if I have to manually do all those optimization stuff myself here and there.

As long as I see Julia version did not perform as good as intel Fortran on Windows, I did not install Julia and test it on Linux anymore. :sweat_smile:

Anyway, the point is not optimizing the code, the code is just a tiny illustration code and do not have real use.

I just wish goftran’s performance on Windows could be consistent and could be about as good as on Linux. Also hope that there is a easy way to use gfortran and mpi on windows. I know cygwin64 can use gfortran + openmpi on windows. However cygwin64 gfortran may be slow on Windows for some complicate big code, adding openmpi just barely recover its single core performance as on Linux. :rofl:

I really appreciate your endeavor @JohnCampbell ! Thank you so much!

About do concurrent, I agree.
I use it for the hope that it can really do some parallelization automatically, and perhaps it can make things work in GPU. But intel’s compiler seems have some issue with it, here is a post about the issue and you also replied there :slight_smile:

I personally did not find too much performance advantage of do concurrent, other than it can make the code look more concise perhaps.

Thank you for being so careful :+1: :100: “mgauss_ik” yeah it is just to dynamically adjust the number of samples (for the given i,k) used for Monte Carlo integral like below,


where n_ik is actually line 136 in samplers.f90,

“mgauss_ik” is actually not very useful, can just comment line 221 to 224 in samplers.f90 as below,

and just do

mgauss_ik = mgauss

so mgauss_ik will always be a constant which is mgauss. So for each n_ik the number of Monte Carlo samples are the same as mgauss which is typically 1000.
The reason for “mgauss_ik” is that, say k=2 so 2 gaussian mixing, the total number samples for n_i1 and n_i2 is a fixed number, which is k*mgauss, if mgauss=1000 and k=2, so k*mgauss=2000. However perhaps n_i1 needs more samples than n_i2, so I may distribute 1500 samples on n_i1, and 500 on n_i2. So “mgauss_i1=1500”, “mgauss_i2=500”, etc. In this way, the total 2000 samples are more efficient distributed on n_i1 and n_i2, instead of just giving 1000 samples for each.
No worry, in short, “mgauss_ik” does not really influence the code and not depend on seed too much. You know, if the result of a Monte Carlo simulation heavily depend on random number seed, then something must be wrong :rofl:

By the way, how did you get the profile information below?

#### Delta_Sec Summary ####   12

 Id Description                      Elapsed    Calls
  1 _START                            0.0000        1
  2 # pYq_i_detail                    3.5952    10201
  3 INITIALISED Yji                   0.0006        1
  4 prep > gauss_thetas               0.0961      102
  5 prep > MC_gauss_ptheta_w_sig      0.0000      102
  6 Metroplis_gik_k_more_o_log        0.4669       50
  7 CC Metroplis_gik_all_o_log        0.0679       50
  8 CC mgauss_ik(i,k)                 0.0006       50
  9 steptest report                   0.0095       50
 10 cpu_time report                   0.0011       50
 11 ANALYSED                          0.0026        1
 12 _FINISHED                         4.2406    10659
  calls to pYq_i_detail =             10198404
 Program end normally.

I tried gprof on windows, but it always generate empty prof file, perhaps I will open a new topic asking this question.

Again, thank you so much! :+1: :100: :slight_smile:

I did the profiling “manually” by placing “call delta_sec ( description )” at the end of a section of code where timing could be informative.
I am modifying the original code to achieve this.
It is based on SYSTEM_CLOCK using 8-byte integers for higher precision. ( rate ~ 3 million implies 1000 processor cycles per tick so can not profile tight code, but much better than CPU_TIME that reports only 64 ticks per second )
I start with inserting a few calls then adapted to identify key areas.
It is an itterative process, by identifying places that can best identify relative times of significant performance. An easy process to monitor key performance.
There is an overhead of too many calls to delta_sec. Using ‘# …’ description helps to suppress every call report but is useful to understand relative times.
subroutine delta_sec ( description ) is a simple idea, that can be modified to suit the program being profiled.

It is based on Salford FTN95 compiler that profiles all routines compiled with /profile option.

The Delta_Sec Summary give a clear indication of relative times/importance.
Special descriptions are:
_START starts the summary process; should be first call.
_FINISHED reports times, for restart of final report
description(1:1) = ‘#’ is used to accumulate times, but not do reports (good if lots of calls)

!  first few lines of code
    open ( 6, file='EM_mix.log' )

    call delta_sec ( '_START' )
    call delta_sec ( '# pYq_i_detail' )

!  code for profiling report
    subroutine delta_sec ( description )
      character*(*) description
!
      integer*8          :: tick, rate
      integer*8          :: last_tick=-1        ! last tick delta_sec was called
      real*8             :: sec, all_sec = 0
!
      logical            :: do_summary = .false.
      integer*4, save    :: nt=0, i
      character*30, save :: list_of_descriptions(50)=' '
      integer*4, save    :: num_calls(50)=0
      real*8, save       :: times(50)=0

!   Get ticks since last call
      call system_clock ( tick, rate )
      if ( last_tick < 0 .or. description == '_START') then
        last_tick  = tick
        all_sec    = 0
        do_summary = description == '_START'
        nt         = 0
      end if

!   report this time interval : ignore if #....
      sec = dble(tick-last_tick) / dble(rate)
      all_sec = all_sec + sec
      if ( description(1:1) /= '#' )  &
      write (6,11) description, sec, all_sec
      last_tick = tick
!
      if ( .not. do_summary ) return
!
!   save all times for final summary report if selected
      do i = 1,nt
        if ( list_of_descriptions(i) /= description ) cycle
        times(i)     = times(i)     + sec
        num_calls(i) = num_calls(i) + 1
        exit
      end do

!   add if new description to list of descriptions
      if ( i > nt ) then
        if ( nt < size(times) ) nt = nt+1
        list_of_descriptions(nt) = description
        times(nt)                = times(nt)     + sec
        num_calls(nt)            = num_calls(nt) + 1
      end if

!   report summary times if finished
      if ( description == '_FINISHED') then
        write (6,10) nt
        times(nt) = sum(times(1:nt))
        num_calls(nt)   = sum(num_calls(1:nt))
        do i = 1,nt
          write (6,12) i, list_of_descriptions(i), times(i), num_calls(i)
          times(i)     = 0
          num_calls(i) = 0
        end do
      end if

  10  format (/'#### Delta_Sec Summary #### ',i4//  &
               ' Id Description                      Elapsed    Calls')
  11  format ('#### delta_sec #### ',a,t50,2f10.4)
  12  format (i3,' ',a, f10.4, i9)
    end subroutine delta_sec
1 Like

A current Reddit thread demonstrates that the speed of an executable generated by a compiler can depend greatly on the options used.

1 Like

Thanks @Beliavsky :grinning:
My flag for this small code is just

-Ofast -march=native

I mean, in my experience, for simple or complicated code, exactly the same code, the same flags, the performance of different gfortran versions on Windows seems not the most consistent, compared with on Linux.
If using some particular optimization flags for gfortran on Windows can make the code as fast as its native speed on Linux, it could be great. But ideally I wish one can use the same flags for both Windows and Linux, and speed on Windows and Linux are equally fast.

The Reddit post that @Beliavsky pointed to demonstrates nothing, I’m sorry to point out, although I wholeheartedly agree that compiler flags can greatly affect the speed of the resulting program. There are lots of compiler flags in the commands shown on the Reddit page, but none of the files listed in the commands are source files! Only linker options have any effect in such a situation, and using different libraries (e.g., RefBlas versus MKL) could change the run-durations. That, however, is not the point being stressed.

The wish expressed by @CRquantum, “But ideally I wish one can use the same flags for both Windows and Linux, and speed on Windows and Linux are equally fast” is not realistic for any application that spends some time in the compiler’s RTL and/or system services. On Windows, EXEs and DLLs produced by Gfortran need wrapper code (or translation layer) that converts the Linux C library and system calls to Windows compatible library and system calls.

Thanks @mecej4 .
But why intel Fortran’s performance is consistent on Windows and Linux? I mean same code, same optimization flags, I always find that Intel Fortran’s performance is the same on Windows and Linux.
Just say on Windows,
perhaps at the compiling stage, both intel Fortran and gfortran have no problem. But at the linking and building stage, perhaps the fact that Intel Fortran relying on Visual Studio while gfortran relying on the translation layer make the difference. The result is, while Intel Fortran’s performance is consistent on Windows and Linux, gfortran’s performance is compromised and different Windows version of gfortrans’ performance are different from each other.

On windows, it seems there are basically two branches of gfortran.

  1. gfortran in Cygwin64 and MSYS2 performs the same (perhaps the one in MinGW64 performs the same too). This version of gfortran performs good for my this small code, however for more complicated code it can perform 6X slower than on Linux.

  2. gfortran from Equation.com. It is 4X slower than Cygwin64 gfortran for my small code, but not too bad (30-50% slower than on Linux) for more complicated code.

On Windows, it seems using Intel OneAPI could be the best choice, not only from performance point of view, but also considering that it has MPI integrated.
On the other hand, gfortran performs well in any linux related case, no matter native Linux, or Linux in WSL, Hyper-V, and virtual machines like Vmware.

As said, if on Windows there could be a very performant gfortran with MPI configured, it can be really great to build some commercial software on it. This may be good for Fortran community. Because we need more people to really use it and rely on it, directly or indirectly. Like, imagine if Microsoft Office or many famous video games are mostly written in Fortran, then I guess there would be constant money/funding devoted to Fortran community therefore help Fortran develop better. Otherwise Fortran may remain at mostly academic area and HPC region, without the users base to be large and diverse enough, its potential may be limited in some way. Well that is off topic. :sweat_smile:

In short, I just want gfortran to be good on Windows too! :sweat_smile:

The gfortran approach of supporting many different hardware platforms or operating systems is achieved by having different “.dll” interfaces to the OS. The paticular variants of this is important.

My interpretation of this thread is that the gfortran interface to Windows provided by equation.com (eq) may be deficient for exp and IEEE error handling, in comparison to other implementations.
To generalise to “6X slower than on Linux” is not a valid conclusion.
My testing of ming-w64 and EQ versions of 64-bit gfortran shows the EQ version to be roughly 5% faster for the tests that I do, although how the managing of cache varies is a significant mystery. I have no experience of Linux but would not expect a significant change.

-Ofast ?
I have assumed this to be an agressive option, so prefer -O2, or -O3 for code where I interpret optimisation should not cause problems, eg simple DO loops.
I nearly always use “-fimplicit-none -march=native -ffast-math -fopenmp -fstack-arrays” as go-to options and sometimes “-g” or “-funroll-loops --param max-unroll-times=2”. ( should not use goto :slight_smile: )
Using some of these is more a hope they will help, unlike -fopenmp which does definately change the compile outcome.
Following on from another thread, I need to reinvestigate the use of “-ffast-math -fopenmp”, especially where “low arithmetic intensity” is identified.

1 Like

I have now tried using the gfortran profiling utility : gprof
This is my first use of gprof, so my option selection may not be the best, but I get a useful table.
It is much easier than my approach.

The batch file I used to test with the profiling report is:

del *.o
del *.mod
del em_mix.exe

set options=-g -fimplicit-none -fallow-argument-mismatch -march=native -pg

gfortran ran.f90 samplers.f90 em_mix.f90 %options% -o em_mix.exe

dir *.exe

em_mix

gprof -b -J -p em_mix.exe > em_mix_profile.log

notepad em_mix_profile.log

notepad em_mix.log

The resulting profile log I recovered is useful

Flat profile:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 56.00      1.82     1.82                             exp
 26.46      2.68     0.86 10198404     0.00     0.00  __samplers_MOD_pyq_i_detail
  3.38      2.79     0.11                             __logl_internal
  3.08      2.89     0.10    10200     0.01     0.09  __samplers_MOD_mc_gauss_ptheta_w_sig
  2.15      2.96     0.07                             __fentry__
  2.15      3.03     0.07                             _mcount_private
  1.54      3.08     0.05  1000000     0.00     0.00  __samplers_MOD_pyq_more_o_log
  1.54      3.13     0.05      100     0.50     1.20  __samplers_MOD_metroplis_gik_k_more_o_log
  0.92      3.16     0.03  6040500     0.00     0.00  __random2_MOD_ran1
  0.62      3.18     0.02                             __cosl_internal
  0.62      3.20     0.02                             log
  0.31      3.21     0.01  4000008     0.00     0.00  __random_MOD_randn
  0.31      3.22     0.01      208     0.05     0.05  __random_MOD_gaussian
  0.31      3.23     0.01                             __sinl_internal
  0.31      3.24     0.01                             cos
  0.31      3.25     0.01                             sin
  0.00      3.25     0.00    10660     0.00     0.00  delta_sec_
  0.00      3.25     0.00      102     0.00     0.10  __samplers_MOD_gauss_thetas
  0.00      3.25     0.00       90     0.00     0.00  __samplers_MOD_corrchk_internal
  0.00      3.25     0.00       51     0.00    19.02  __samplers_MOD_prep
  0.00      3.25     0.00       50     0.00     0.40  __samplers_MOD_metroplis_gik_all_o_log
  0.00      3.25     0.00       50     0.00    21.82  __samplers_MOD_steptest
  0.00      3.25     0.00        9     0.00     0.00  __samplers_MOD_get_musigma
  0.00      3.25     0.00        1     0.00     0.00  __random_MOD_savern
  0.00      3.25     0.00        1     0.00     0.00  __random_MOD_setrn
  0.00      3.25     0.00        1     0.00     0.00  __samplers_MOD_get_datetime
  0.00      3.25     0.00        1     0.00     0.00  __samplers_MOD_get_musigma_maxll
  0.00      3.25     0.00        1     0.00     0.00  __samplers_MOD_push_yji
  0.00      3.25     0.00        1     0.00     0.00  __samplers_MOD_samplers_init

This clearly identifies the exp intrinsic and function pyq_i_detail as the main time usage in “self seconds”
The “cumulative seconds” is 3.25 seconds, which is less than 5.07 seconds I have obtained from SYSTEM_CLOCK, but is hopefully explained in the documentation.
It does not provide call counts for intrinsics exp, log, sin and cos

gprof is a very easy way to identify where time is being spent.
It would be useful to perform this test on the range of gfortran implementations you have available.

2 Likes

Thank you very much @JohnCampbell indeed and I really appreciate your help!
The grof in the equation.com version of gfortran somehow always give me empty profile results :rofl:
I created a new thread at

If you or someone met similar issues before you may reply from there.

Thank you very much indeed :slight_smile:

Thanks @JohnCampbell , I think (based on my experience), if you have code that speed is important and you need to use gfortran, you may really try to use gfortran in Linux. There a decent chance that you code on Linux with gfortran can run noticeably faster than gfortran on Windows. After all, as the name gfortran (gnu fortran) indicates, Linux may be its native battlefield. Especially if your code requires MPI, in Ubuntu you only need to do

sudo apt install gfortran mpich

gfortran and mpi will just plug and play. On windows it seems not that easy.