Why is this Fortran code so much faster than its C++ counterpart?

program main
  integer(8) :: n
  real(8), allocatable :: x(:)
  n = 3d9
  allocate(x(n), source = 0d0)
  x(1000) = acos(-1d0)
  print *, maxval(x)
end program
#include <vector>
#include <cmath>
#include <algorithm>
#include <iostream>
using namespace std;

int main()
{
  long n = 3e9;
  vector<double> x(n);
  x[999] = acos(-1.0);
  cout << *max_element(x.begin(), x.end());
  return 0;
}

I use GCC 12.1.1 without any optimization flags.

Here’s a side by side comparison in Compiler Explorer (both Fortran and C++ versions are in the editor).

A few observations:

  • The section

          .long   1413754136
          .long   1074340347
    

    corresponds to the value of acos(-1). I’ve checked with this Fortran program:

    use, intrinsic :: iso_fortran_env, only: int32, int64, real64
    integer(int32) :: p(2)
    p(1) = transfer(1413754136,1_int32)
    p(2) = transfer(1074340347,1_int32)
    print *, transfer(p,1_real64)
    end
    
  • Both programs use the exact same loop with a running comparison to search for the maximum value, moving I presume 1-element (8-bytes) at a time

    .L4:
          movsd   xmm1, QWORD PTR [rdx]
          add     rdx, 8
          maxsd   xmm1, xmm0
          movapd  xmm0, xmm1
          cmp     rax, rdx
          jne     .L4
    
  • I’m assuming printing is equally expensive in both run-time libraries.

My hypothesis: allocation of the arrays and initialization with zero is the main source of differences. The Fortran sourced allocation gets translated to a call to calloc. The C++ initialization on the other hand calls new and memset.

To test this hypothesis, I removed the maximum element search and just printed the first value. The assembly remains mostly the same, only the search loop has disappeared: Compiler Explorer

The time difference however remains:

$ g++-11 main.cpp -O2
$ time ./a.out
0

real	0m12.384s
user	0m5.867s
sys	    0m6.456s
$ gfortran main.f90 -O2
$ time ./a.out
   0.0000000000000000     

real	0m0.225s
user	0m0.002s
sys	    0m0.008s

For a further explanation see: Why malloc+memset is slower than calloc?

3 Likes

n = 3 * 10**9 of floating-point array elements (of 8 bytes) takes ~ 24 GB, so probably swapping occurs if the memory is insufficient?
(On my old MacMini, the speed of the two codes became very similar when I compiled them with -O3 and decreased the size to n = 3 * 10**8, so that the array fits in the memory (a few GB).)

I am not sure what " allocate(x(n), source = 0d0)" will do for memory allocation ( probably allocate 24 GB of memory) , but if you try the following code, it is much faster, as for this code, only the single memory page for x(1000) is allocated to the array, while the rest is still virtual, waiting to be used.
Not sure what maxval(x) does for unallocated memory pages ?

The reason why C++ is much slower could be how memory is being allocated to the program.

program main
  real*8, external :: delta_seconds
  real*8 dt
  integer(8) :: n
  real(8), allocatable :: x(:)
  dt = delta_seconds()
  n = 3d9
  allocate(x(n))   ! , source = 0d0)
  x(1000) = acos(-1d0)
  dt = delta_seconds()
  print *, maxval(x), dt, size (x,kind=8)
  dt = delta_seconds()
  print *, dt
end program

   real*8 function delta_seconds ()
     Integer*8 :: start = 0, rate = -1, finish
      if ( rate < 0 ) Call System_clock( start, rate )
      Call System_clock( finish )
      delta_seconds = dble( finish - start ) / dble ( rate )
      start = finish
   end function delta_seconds

From running this program as is, it appears that gFortran initialises X to zero (default of compiler) as memory pages are allocated ( during maxval(x) )

I’ve just tried a bit more with machines with larger memory (so that no swapping occurs like MacMini), and the difference remains between C++ and Fortran… Specifically, on a Linux machine with 64 GB memory, the timing is like:

[ g++-10 -O3 ]
real 0m15.643s
user 0m5.180s
sys  0m10.428s

[ gfortran-10 -O3 ]
real 0m6.943s
user 0m4.625s
sys  0m2.317s

Looking at the htop output, the RES entry shows ~2 GB for gfortran, while it shows ~22.4 GB for g++. (In both cases, VIRT becomes ~22.4 GB.) So I guess Ivan’s hypothesis above holds here…?

For comparison, I’ve also tried the same codes with M1 Mac (32 GB mem), which gives

[ g++-10 -O3 ]
real  0m7.309s
user  0m4.272s
sys   0m2.949s

[ gfortran-10 -O3 ]
real  0m6.057s
user  0m3.280s
sys   0m2.594s

In this case the difference becomes smaller than the Linux box above. This may be because M1’s memory (DDR5) is better than the Linux one (DDR3, a bit old, if I remember correctly…)

RE John’s code above:

  allocate(x(n))   ! , source = 0d0)
  x(1000) = acos(-1d0)
  print *, maxval(x), dt, size (x,kind=8)

isn’it invalid (= not standard-conforming) to access array elements that have not been assigned any value? (assuming maxval() accesses all elements of x(1:n))

1 Like

My Windows gfortran Ver 11.1 does an auto initialising of allocated arrays.

The purpose of my code was to test if “allocate(x(n), source = 0d0)” took more time than “allocate(x(n))”,ie to try and identify when physical memory was allocated to the array.
I would suggest that these 3 tests, ( these 2 fortran allocate options) plus the C++ test have been presented to identify the time taken to allocate physical memory and when.

It would be nice to think that Fortran does a better job of arranging memory alloction.

Also, It should be preferred to report “wall clock times”, not CPU time classes ( that user/sys suggests).

2 Likes

It would be interesting to try linking against TCMalloc instead of the system C/C++ libraries. I’m curious if TCMalloc can be used indirectly also from a Fortran executable.

1 Like

I think we use(d) tcmalloc in Fluidity, although I am not sure if it links against the Fortran side of things or the C side.

The main disadvantage of using ALLOCATE for OpenMP is that allocate uses a single heap pool.
Perhaps TCMalloc tries to address this problem ?
An alternative to address this problem could be the capability to allocate onto a new memory page, if the previous allocate was for a different thread. (not yet an option :frowning:
This performance problem of sharing memory pages between threads is only identifiable for small arrays, which could be made local/automatic and so placed onto the different thread stacks.
It is then preferrable to ALLOCATE for large arrays, where the sharing of memory pages between threads becomes less of a problem.
Another way of mitigating the problem is to allocate arrays whose size is a multiple of the memory page size (4kbytes); ie increasing the array size. (this appears to help)
In practice, with large allocate arrays, it is difficult to identify the inefficiency of a shared heap, with all the other problems, especially of variability of core performance.
My latest multi-thread AMD performance issues appear to relate to variable core clock speeds, not memory sharing between threads (where I do adjust allocate array sizes).
In Fortran, using alternatives to ALLOCATE can introduce added complexity to the array allocation, especially for OpenMP.
In practice, PRIVATE arrays can be easily manipulated between heap and stack by using a subroutine call inside the !$OMP region to control the way these arrays are defined, ie automatic/local vs allocate.

I thought Fortran codes are directly compiled to assembly codes. So you are saying Fortran codes are first translated to C codes?

malloc and calloc are operating system calls. Any POSIX compliant operating system will provide those library routines.

As for how compilers work, many of them do compile to some low-level intermediate language, or to a sequence of languages, rather than directly to assembly code. Some compilers, including NAG I think, compile to C. gfortran compiles several high-level languages, including fortran, c, and c++, to its own intermediate language before finally generating the assembly instructions. The llvm compilers do something similar. This is what allows compilers to be portable across various combinations of hardware and operating systems. Optimizations can be performed at all the steps of this process. You will sometimes see the terms “front-end” and “back-end” optimization. It might well be in the back end, within the intermediate language, where the decision is made whether to do a malloc or a calloc system call.

No. If you look at the example in Compiler Explorer, you will see:

MAIN__:
        push    rbp
        mov     esi, 1
        movabs  rbp, 24000000000
        push    rbx
        mov     rdi, rbp
        sub     rsp, 552
        call    calloc                    #call to system calloc
        test    rax, rax
        ...

In this particular case, the system calloc implementation did better than C++'s operator new + memset. But as others have shown, you cannot generalize this to all platforms (OS’s, architectures).