Comparing Fortran and Julia's Bessel function performance

I really like this thread as it shows how easily this community can be derailed by hinting to possible non-performance of Fortran … totally understandable given that the main sale point(i.e. when advising your PhD student what language to learn) is high performance. If the performance goes out of the window … what is left??

Some more marketing headache to come:

Comparing the C++ program below

#include <cmath>
#include <vector>
#include <ctime>
#include <numeric>
#include <random>
#include <iostream>
#define ff double
int main( void ){
  std::vector<ff> x(1000000),y(1000000);
  std::mt19937 g(12345);
  std::uniform_real_distribution<ff> dist(0,1);
  for(int i=0;i<x.size();++i) x[i]=dist(g)*(ff)200;
  x[0]=(ff)1;
  int n=10;
  double time=(double)0;
  for(int i=0;i<n;++i){
    std::clock_t c_start = std::clock();
    for(int i=0;i<x.size();++i) y[i]=j1(x[i]);
    std::clock_t c_end = std::clock();
    time+=(double)(c_end-c_start) / CLOCKS_PER_SEC;
  }
  ff time_elapsed_ms = time/(double)n;
  std::cout << "CPU time used: " << time_elapsed_ms << " s\n";
}

and the code provided by @FortranFan I got the following

gfortran -O3 -ffast-math tmp.f90 
rcs > ./a.out 
 BESSEL_J1 Evaluation: Mean compute time for      1000000  evaluations:
   7.1370466001098976E-002  seconds.
 Standard deviation in calculated results:    0.0000000000000000
ifort -O3 tmp.f90 
rcs > ./a.out 
 BESSEL_J1 Evaluation: Mean compute time for      1000000  evaluations:
  5.189895629882812E-002  seconds.
 Standard deviation in calculated results:   0.000000000000000E+000
g++ -O3 -ffast-math tmp.cpp 
rcs > ./a.out 
CPU time used: 0.0444952 s

implying that ifort needed 17% more time and 62% more time that the C++ program. But most likely I don’t compare apples with apples.

You have just proven that falling in love to something is the best way to blur your objectivity … and given that many people here are scientist (the incarnation of objectivity) … I am really amazed by this.

However, adding

for(int i=0;i<y.size();++i) std::cout<<y[i]<<std::endl;

to the end of the code yield the same time for the function evaluation … and the result on the screen.

But I guess you’ll keep telling me that I am not comparing apples with apples and at the bottom line resort to questioning the existence of apples at all. There is a famous term for this: “alternative facts”.

Please, everyone keep this discussion civil and on topic. If there is something to discuss which doesn’t fit the current threads topic, please open a new thread. For any posts, avoid any personal accusation and follow our code of conduct, see Welcome to Discourse.

Feel free to contact the @moderators group or me via PM if you have any questions.

9 Likes

This is what I’m getting on my M1 mac:

c++ performance

federico@Federicos-MacBook-Pro Downloads % g++-12 -O3 -ffast-math ./tcpp.cpp  
federico@Federicos-MacBook-Pro Downloads % ./a.out                          
CPU time used: 37.1698 ms, sum(z)=0

fortran performance

federico@Federicos-MacBook-Pro Downloads % gfortran -O3 -ffast-math ./tf90.f90
federico@Federicos-MacBook-Pro Downloads % ./a.out                            
CPU time used:   37.2193 ms, sum(z)=250248.58927183808

That’s essentially identical performance.
Here’s the two routines I’ve used, I believe they’re identical twins:

program tf90
    use iso_fortran_env
    implicit none

    real(real64), allocatable :: x(:),y(:),z(:)
    integer :: n,i,j
    real(real64) :: time,c_start,c_end
    allocate(x(1000000),y(1000000),z(1000000))
    call random_number(x); x=200*x
    x(1) = 1.0_real64
    n = 50
    time = 0.0_real64
    do i=1,n
        call cpu_time(c_start)
        Y = bessel_j1(x)
        call cpu_time(c_end)
        z(i) = sum(y)
        time = time+c_end-c_start
    end do
    print "('CPU time used: ',f9.4,' ms, sum(z)=',g0)",1000*time/n,sum(z)

end program
#include <cmath>
#include <vector>
#include <ctime>
#include <numeric>
#include <random>
#include <iostream>
#define ff double
int main( void ){
  std::vector<ff> x(1000000),y(1000000),z(1000000);
  std::mt19937 g(12345);
  std::uniform_real_distribution<ff> dist(0,1);
  for(int i=0;i<x.size();++i) x[i]=dist(g)*(ff)200;
  x[0]=(ff)1;
  int n=50;
  double time=(double)0;
  for(int i=0;i<n;++i){
    std::clock_t c_start = std::clock();
    for(int i=0;i<x.size();++i) y[i]=j1(x[i]);
    std::clock_t c_end = std::clock();
    z[i] = std::accumulate(y.begin(), y.end(), 0);
    time+=(double)(c_end-c_start) / CLOCKS_PER_SEC;
  }
  ff time_elapsed_ms = 1000.0*time/(double)n;
  std::cout << "CPU time used: " << time_elapsed_ms << " ms, sum(z)="
            << std::accumulate(z.begin(), z.end(), 0) << "\n";
}
6 Likes

So you checked the assembly of one?? Can you please let us know which one! And of course it would be good to understand which logic applies to out-optimized loops and reporting 0.043 sec of loop time. Shouldn’t it have been zero then?

BTW I also checked an equally structured Fortran code, compiled with ifort and gfortran, both did not out-optimize the loop. Again would be good to learn where your instant inference that my code is questionable come from.

Thanks a lot.

Hi.

thank you for putting a some work into your comment :slight_smile: … much appreciated.

After some elaborating I got the same number, but not for ifort which is slower. For all three C++ compilers I got nearly identical performance.

So at least we know that Fortran can still be as performant as C++ ( 10 years ago I would have put a bet that it must be vice versa!? :wink: ).

Nice,

if you share the numbers with us too, I think that would be a useful addition to the discussion.

1 Like

@rcs,

My opinion is the core messages and issues this particular thread or another like it earlier at this forum or at sites (esp. Julia forum) are not around anything with the current state of Fortran per se, but they have to do with the following:

  • there is some sort of obsession by some to show Julia as Nx faster than the “fastest” where N can range from something reasonable as 1.3 to difficult or impossible to comprehend values such as 10x or 1000x,
  • and then extending the thinking to even microbenchmarks but not being even remotely rigorous about the performance testing and benchmarking.

Since FORTRAN occasionally comes into the mix in the “fastest” category, the above comparisons come into play vis-a-vis Fortran also and the “fun” starts. However the “fun” part is only to get into the details around the N factor above. Once some rigor is brought to be bear and it gets proved the 10x to 1000x don’t make sense, the thread only continues because a part of the “some” folks above continue with Julia trying this and trying that and about how their “feelings” were “hurt”.

As to your findings, as I have mentioned elsewhere, modern C++ (and I mean by their standard revisions starting C++14 onward) has really taken off. The language standard and its entire ecosystem is remarkably advanced and the performance achieved in well-written codes is simply outstanding. In practically anything I have tried across a couple of processors (Intel, Microsoft, and gcc) and platforms (Windows OS and *UX), Fortran can at best matches the performance of modern C++ equivalent of the codes, in many cases modern C++ gives me 10 to 20% better performance overall across the above processors and platforms. Thus I understand and appreciate the advances by modern C++.

So, for me, the matter is not with Fortran being the fastest or any false pride around it. rather it has to do with getting better with performance benchmarking generally and being more thorough and rigorous about it.

There are some seriously questionable aspects about this in almost every benchmark I have come across with Julia codes, that is the issue.

2 Likes

Separately defining the first element of the x array to unity in the Bessel J1 Fortran test code I had posted earlier was only to include a simple-minded check on the computed result that would also ensure the compiler would not optimize away the calculation.

You copied the same approach with the first element definition in your C++ code but you didn’t do anything with the calculated results the way I showed in the Fortran code above, that was strange.

Hi.

Knowing that the program executes the loop and get the timing was sufficient for me. Making example cost lifetime, so I keep it to the necessary minimum.

1 Like