This is an interesting one, actually there is nothing wrong with the Fortran code, but maybe the way it is used. Check the repository and you will notice it is used from Python. Now the question which interested me more than why Julia is faster, is how bad can it be to call Fortran from Python.
I took the original code from here made a trial run on my machine (nothing fancy, Intel Sky Lake i7-6500U CPU @ 2.50GHz):
❯ python test_f2py.py
1000 0.02498769760131836
2000 0.10134744644165039
3000 0.2162013053894043
4000 0.3785884380340576
5000 0.5802724361419678
6000 0.828559398651123
7000 1.1193299293518066
8000 1.4555211067199707
9000 1.8339998722076416
10000 2.9962995052337646
Maybe the implementation was not optimal, taking Ondřej’s suggestions to check if we can improve with this one here:
! test-f2py.f90
module mymodule
use omp_lib
implicit none
contains
function eikR(a,k,n) result(M)
integer, intent(in) :: n
real(kind=8), intent(in) :: a(n)
complex(kind=8), intent(in) :: k
complex(kind=8) :: M(n,n)
real(8) :: Y
integer :: i, j
!$omp parallel do simd private(J, Y) collapse(2)
do j=1,n
do i=1,n
Y = k*sqrt(a(i)*a(i) + a(j)*a(j))
M(i,j) = cmplx(cos(Y), sin(Y))
end do
end do
end function eikR
end module
Indeed with a hand optimized implementation we get faster, but unfortunately not by much
❯ python test_f2py.py
1000 0.026028871536254883
2000 0.10151910781860352
3000 0.21696782112121582
4000 0.3690149784088135
5000 0.5648448467254639
6000 0.8063969612121582
7000 1.1044635772705078
8000 1.7926850318908691
9000 2.2729897499084473
10000 2.574355125427246
The actual question is how bad can the f2py wrapping and the usage in Python actually be, here is an equivalent Fortran main
! main.f90
use mymodule
use omp_lib
implicit none
integer, parameter :: dp = selected_real_kind(15)
real(dp), parameter :: pi = 4*atan(1.0_dp)
integer :: i, j, n
real(dp), allocatable :: a(:)
complex(dp) :: k
complex(dp), allocatable :: m(:, :)
real(dp) :: t1, t2
do i = 1, 10
n = 1000*i
a = [(2*j*pi, j = 0, n-1)]
k = cmplx(100.0_dp, 1.0_dp)
t1 = omp_get_wtime()
m = eikr(a, k, n)
t2 = omp_get_wtime()
print *, n, t2 - t1
end do
end
Using the same compile flags as for the f2py module we get a surprising outcome:
❯ gfortran -O3 -march=native -funroll-loops -fopenmp -ffast-math -floop-nest-optimize test-f2py.f90 main.f90
❯ ./a.out
1000 2.1003986003051978E-002
2000 6.6137752000940964E-002
3000 0.13244638699688949
4000 0.24651855599950068
5000 0.35877340099978028
6000 0.51193849599803798
7000 0.69200517200079048
8000 0.89750151500629727
9000 1.1269864560017595
10000 1.3638167019962566
It’s interesting to see that the f2py wrapper costs almost as much as the actual calculation we are performing.