I’m assuming printing is equally expensive in both run-time libraries.
My hypothesis: allocation of the arrays and initialization with zero is the main source of differences. The Fortran sourced allocation gets translated to a call to calloc. The C++ initialization on the other hand calls new and memset.
To test this hypothesis, I removed the maximum element search and just printed the first value. The assembly remains mostly the same, only the search loop has disappeared: Compiler Explorer
The time difference however remains:
$ g++-11 main.cpp -O2
$ time ./a.out
0
real 0m12.384s
user 0m5.867s
sys 0m6.456s
$ gfortran main.f90 -O2
$ time ./a.out
0.0000000000000000
real 0m0.225s
user 0m0.002s
sys 0m0.008s
n = 3 * 10**9 of floating-point array elements (of 8 bytes) takes ~ 24 GB, so probably swapping occurs if the memory is insufficient?
(On my old MacMini, the speed of the two codes became very similar when I compiled them with -O3 and decreased the size to n = 3 * 10**8, so that the array fits in the memory (a few GB).)
I am not sure what " allocate(x(n), source = 0d0)" will do for memory allocation ( probably allocate 24 GB of memory) , but if you try the following code, it is much faster, as for this code, only the single memory page for x(1000) is allocated to the array, while the rest is still virtual, waiting to be used.
Not sure what maxval(x) does for unallocated memory pages ?
The reason why C++ is much slower could be how memory is being allocated to the program.
program main
real*8, external :: delta_seconds
real*8 dt
integer(8) :: n
real(8), allocatable :: x(:)
dt = delta_seconds()
n = 3d9
allocate(x(n)) ! , source = 0d0)
x(1000) = acos(-1d0)
dt = delta_seconds()
print *, maxval(x), dt, size (x,kind=8)
dt = delta_seconds()
print *, dt
end program
real*8 function delta_seconds ()
Integer*8 :: start = 0, rate = -1, finish
if ( rate < 0 ) Call System_clock( start, rate )
Call System_clock( finish )
delta_seconds = dble( finish - start ) / dble ( rate )
start = finish
end function delta_seconds
From running this program as is, it appears that gFortran initialises X to zero (default of compiler) as memory pages are allocated ( during maxval(x) )
I’ve just tried a bit more with machines with larger memory (so that no swapping occurs like MacMini), and the difference remains between C++ and Fortran… Specifically, on a Linux machine with 64 GB memory, the timing is like:
[ g++-10 -O3 ]
real 0m15.643s
user 0m5.180s
sys 0m10.428s
[ gfortran-10 -O3 ]
real 0m6.943s
user 0m4.625s
sys 0m2.317s
Looking at the htop output, the RES entry shows ~2 GB for gfortran, while it shows ~22.4 GB for g++. (In both cases, VIRT becomes ~22.4 GB.) So I guess Ivan’s hypothesis above holds here…?
For comparison, I’ve also tried the same codes with M1 Mac (32 GB mem), which gives
[ g++-10 -O3 ]
real 0m7.309s
user 0m4.272s
sys 0m2.949s
[ gfortran-10 -O3 ]
real 0m6.057s
user 0m3.280s
sys 0m2.594s
In this case the difference becomes smaller than the Linux box above. This may be because M1’s memory (DDR5) is better than the Linux one (DDR3, a bit old, if I remember correctly…)
isn’it invalid (= not standard-conforming) to access array elements that have not been assigned any value? (assuming maxval() accesses all elements of x(1:n))
My Windows gfortran Ver 11.1 does an auto initialising of allocated arrays.
The purpose of my code was to test if “allocate(x(n), source = 0d0)” took more time than “allocate(x(n))”,ie to try and identify when physical memory was allocated to the array.
I would suggest that these 3 tests, ( these 2 fortran allocate options) plus the C++ test have been presented to identify the time taken to allocate physical memory and when.
It would be nice to think that Fortran does a better job of arranging memory alloction.
Also, It should be preferred to report “wall clock times”, not CPU time classes ( that user/sys suggests).
It would be interesting to try linking against TCMalloc instead of the system C/C++ libraries. I’m curious if TCMalloc can be used indirectly also from a Fortran executable.
The main disadvantage of using ALLOCATE for OpenMP is that allocate uses a single heap pool.
Perhaps TCMalloc tries to address this problem ?
An alternative to address this problem could be the capability to allocate onto a new memory page, if the previous allocate was for a different thread. (not yet an option
This performance problem of sharing memory pages between threads is only identifiable for small arrays, which could be made local/automatic and so placed onto the different thread stacks.
It is then preferrable to ALLOCATE for large arrays, where the sharing of memory pages between threads becomes less of a problem.
Another way of mitigating the problem is to allocate arrays whose size is a multiple of the memory page size (4kbytes); ie increasing the array size. (this appears to help)
In practice, with large allocate arrays, it is difficult to identify the inefficiency of a shared heap, with all the other problems, especially of variability of core performance.
My latest multi-thread AMD performance issues appear to relate to variable core clock speeds, not memory sharing between threads (where I do adjust allocate array sizes).
In Fortran, using alternatives to ALLOCATE can introduce added complexity to the array allocation, especially for OpenMP.
In practice, PRIVATE arrays can be easily manipulated between heap and stack by using a subroutine call inside the !$OMP region to control the way these arrays are defined, ie automatic/local vs allocate.
malloc and calloc are operating system calls. Any POSIX compliant operating system will provide those library routines.
As for how compilers work, many of them do compile to some low-level intermediate language, or to a sequence of languages, rather than directly to assembly code. Some compilers, including NAG I think, compile to C. gfortran compiles several high-level languages, including fortran, c, and c++, to its own intermediate language before finally generating the assembly instructions. The llvm compilers do something similar. This is what allows compilers to be portable across various combinations of hardware and operating systems. Optimizations can be performed at all the steps of this process. You will sometimes see the terms “front-end” and “back-end” optimization. It might well be in the back end, within the intermediate language, where the decision is made whether to do a malloc or a calloc system call.
No. If you look at the example in Compiler Explorer, you will see:
MAIN__:
push rbp
mov esi, 1
movabs rbp, 24000000000
push rbx
mov rdi, rbp
sub rsp, 552
call calloc #call to system calloc
test rax, rax
...
In this particular case, the system calloc implementation did better than C++'s operator new + memset. But as others have shown, you cannot generalize this to all platforms (OS’s, architectures).