Interact with python -- f2py or ctypes?

I just discovered here recently, very glad that there is a community to discuss about Fortran.

Most of time I use python for visualization or some simple calculation. Of course I sometimes encounter performance problems and I just wonder if I can call Fortran subroutines or functions in python to solve this problem.

I have some experience in using f2py or ctypes to call Fortran subroutines/functions in python. But I’m not sure which one is faster.
I’ve written some codes and compared them, I found that the performances are similar but f2py is slightly better. But this is only one case, I’m not sure that if using f2py is always better than using ctypes.

Does anyone has some experience for it? Or maybe knowing the other tool which is better than f2py and ctypes.
I decided to use f2py and ctypes just because I didn’t need to install anything else.


p.s.
If needed, here is the code what I compare between f2py and ctypes. They calculate the vertical component of vorticity.
BTW, pure numpy solution is the fastest in my test… I am not sure if the cost in calling external library slowdown the performance of calling fortran subroutine?

p.s.2
I’m not sure if this topic is suitable in here. If not I’ll delete this post as soon as possible.

! f2py -c solve_f2py.f90 -m solvef90f2py

subroutine vorticity_f90loop(nx, ny, u, v, dx, dy, vort)
    implicit none
    integer, intent(in) :: nx, ny
    real*8, intent(in) :: u(ny,nx), v(ny,nx)
    real*8, intent(in) :: dx, dy
    real*8, intent(out) :: vort(ny,nx)
    
    real*8 :: dx2, dy2
    integer :: i, j
    ! ---------------------------------------------

    dx2 = 2. * dx
    dy2 = 2. * dy
    vort(:,:) = 0.

    do i = 2, nx-1
        do j = 2, ny-1
            vort(j,i) = (v(j,i+1)-v(j,i-1)) / dx2 - (u(j+1,i)-u(j-1,i)) / dy2
        end do
    end do

    return
end subroutine vorticity_f90loop
! gfortran -shared -fPIC solve_ctypes.f90 -o solvef90ctypes.so

subroutine vorticity_f90loop(nx, ny, u, v, dx, dy, vort) bind(c, name='vorticity_f90loop')
    use iso_c_binding, only : c_double, c_int
    implicit none

    integer(c_int), intent(in) :: nx, ny
    real(c_double), intent(in) :: u(ny,nx), v(ny,nx)
    real(c_double), intent(in) :: dx, dy
    real(c_double) :: vort(ny,nx)
    
    real(c_double) :: dx2, dy2
    integer(c_int) :: i, j
    ! ---------------------------------------------

    dx2 = 2. * dx
    dy2 = 2. * dy
    vort(:,:) = 0.

    do i = 2, nx-1
        do j = 2, ny-1
            vort(j,i) = (v(j,i+1)-v(j,i-1)) / dx2 - (u(j+1,i)-u(j-1,i)) / dy2
        end do
    end do

    return
end subroutine vorticity_f90loop
# solve.py

import time
import ctypes as ct
import numpy as np
import solvef90f2py as f90f2py

def vorticity_py(u, v, dx, dy):
    ny = len(u)
    nx = len(u[0])
    vort =[[0 for _ in range(nx)] for _ in range(ny)]
    
    dx2 = dx * 2
    dy2 = dy * 2
    
    for j in range(1, ny-1):
        for i in range(1, nx-1):
            vort[j][i] = (v[j][i+1] - v[j][i-1]) / dx2 - (u[j+1][i] - u[j-1][i]) / dy2
            
    return vort

def vorticity_np(u, v, dx, dy):   
    ny, nx = u.shape
    vort = np.zeros((ny, nx))
    vort[1:-1,1:-1] = (v[1:-1,2:] - v[1:-1,:-2]) / (2*dx) - (u[2:,1:-1] - u[:-2,1:-1]) / (2*dy)
    return vort

def solve_f90ctypes_loop(nx, ny, u, v, dx, dy):   
    f90lib = ct.CDLL('solvef90ctypes.so')
    f90lib.vorticity_f90loop.argtypes = [
        ct.POINTER(ct.c_int),
        ct.POINTER(ct.c_int),
        np.ctypeslib.ndpointer(dtype=np.float64, ndim=2, shape=(ny, nx)),
        np.ctypeslib.ndpointer(dtype=np.float64, ndim=2, shape=(ny, nx)),
        ct.POINTER(ct.c_double),
        ct.POINTER(ct.c_double),
        np.ctypeslib.ndpointer(dtype=np.float64, ndim=2, shape=(ny, nx))
    ]
    f90lib.vorticity_f90loop.restype = None

    u = np.asfortranarray(u)
    v = np.asfortranarray(v)
    vort = np.empty_like(u, dtype=np.float64, order='F')
    
    f90lib.vorticity_f90loop(
        ct.byref(ct.c_int(nx)), 
        ct.byref(ct.c_int(ny)), 
        u, 
        v, 
        ct.byref(ct.c_double(dx)), 
        ct.byref(ct.c_double(dy)), 
        vort
    )
    return vort
    
if __name__ == '__main__':
    # prepare data
    dx = 0.5
    dy = 0.5
    ny = 1000
    nx = 1000
    
    u = np.random.rand(ny, nx)
    v = np.random.rand(ny, nx)
    u_lst = u.tolist()
    v_lst = v.tolist()
    
    # pure python
    t1 = time.time()
    vort_py = vorticity_py(u_lst, v_lst, dx, dy)
    t2 = time.time()
    print(f'(python)          time = {t2-t1:.8f}')
    
    # numpy
    t1 = time.time()
    vort_np = vorticity_np(u, v, dx, dy)
    t2 = time.time()
    print(f'(numpy)           time = {t2-t1:.8f}')
    
    # f2py - loop   
    t1 = time.time()
    vort_f90loop = f90f2py.vorticity_f90loop(u, v, dx, dy)
    t2 = time.time()
    print(f'(f90+f2py, loop)  time = {t2-t1:.8f}')
     
    # ctypes - loop
    t1 = time.time()
    vort_ctloop = solve_f90ctypes_loop(nx, ny, u, v, dx, dy)
    t2 = time.time()
    print(f'(f90+ct, loop)    time = {t2-t1:.8f}')
    
    # check result
    print()
    print(vort_py[2][:5])
    print(vort_np[2,:5])
    print(vort_f90loop[2,:5])
    print(vort_ctloop[2,:5])

Execute result:

$ python solve.py
(python)          time = 0.32275057
(numpy)           time = 0.01257682
(f90+f2py, loop)  time = 0.01558447
(f90+ct, loop)    time = 0.01573086

[0, -0.5746908151001381, -0.4502448148151549, 0.4887173001423759, 0.17991335715166268]
[ 0.         -0.57469082 -0.45024481  0.4887173   0.17991336]
[ 0.         -0.57469082 -0.45024481  0.4887173   0.17991336]
[ 0.         -0.57469082 -0.45024481  0.4887173   0.17991336]
3 Likes

Thanks for the question @CYehLu and welcome to the forum! Yes, your question is on topic here, I am glad you asked. The f2py approach I think should be faster. Another fast approach is using Cython or the C/API directly, see here how to do it:

As long as you do not copy the arrays and access their data directly, the performance overhead is in converting the array descriptor and interface. Creating a shared library (f2py or Cython or C/API approaches) is faster, as you just import the module in Python. The ctypes approach has to execute the Python code first to create the Python interface (but it has the advantage that you don’t have to compile anything).

3 Likes

Thanks for the reply! Very helpful.
I have no experience in using C/API to call C function in python. I found this example, is it the method to interact with fortran subroutines/functions through C/API?

Also I’m not fully understand the explanation you mentioned about. You mentioned that “The ctypes approach has to execute the Python code first to create the Python interface (but it has the advantage that you don’t have to compile anything).
But I think that compiling fortran code into shared library (*.so files) is necessary both for f2py and ctypes. That is, I still need

$ gfortran -shared -fPIC test.f90 -o test.so

to compile my fortran code before using ctypes in python. Is anything I misunderstood?

Very grateful for you reply!

I think so. I personally prefer to use Cython, that essentially generates such a C/API file for you.

Yes, you still have to call gfortran no matter what to compile your Fortran code. But with ctypes that is all that you have to compile, while with Cython you also have to compile the C file using gcc to create an so module that can be imported. Also, if you compile the C module then typically you would compile the Fortran test.f90 to an object file and just link it with the C wrapper into one shared library that you then import from Python.

2 Likes

This thread should answer some of your questions, we started writing up some of the details for the different approaches but there is no final guide yet.

1 Like

Thanks to @certik and @awvwgk !
Very helpful, I think I should continue to use f2py as my fortran subroutines/functions wrapper.
Other tools like pyccel seems very interesting and I think I can try to use it.
Thanks again, very glad to come to here.

1 Like

@CYehLu
Did you try numba? I have slightly modified your code ( I assume that using np.arrays instead of lists is allowed in “pure” python version)

import numpy as np
from numba import njit

def vorticity_py(u, v, dx, dy):
   ny, nx = u.shape
   vort = np.zeros((ny, nx))
   dx2 = dx * 2
   dy2 = dy * 2 
   for j in range(1, ny-1):
       for i in range(1, nx-1):
           vort[j,i] = (v[j,i+1] - v[j,i-1]) / dx2 - (u[j+1,i] - u[j-1,i]) / dy2       
   return vort

def vorticity_np(u, v, dx, dy):   
   ny, nx = u.shape
   vort = np.zeros((ny, nx))
   dx2 = dx * 2
   dy2 = dy * 2 
   vort[1:-1,1:-1] = (v[1:-1,2:] - v[1:-1,:-2]) / dx2 - (u[2:,1:-1] - u[:-2,1:-1]) / dy2
   return vort

@njit(parallel = True)
def vorticity_jit(u, v, dx, dy):
   ny, nx = u.shape
   vort = np.zeros((ny, nx))
   dx2 = dx * 2
   dy2 = dy * 2
   for j in range(1, ny-1):
       for i in range(1, nx-1):
           vort[j,i] = (v[j,i+1] - v[j,i-1]) / dx2 - (u[j+1,i] - u[j-1,i]) / dy2          
   return vort

# prepare data
dx = 0.5
dy = 0.5
ny = 1000
nx = 1000
u = np.random.rand(ny, nx)
v = np.random.rand(ny, nx)

res = vorticity_py(u,v,dx,dy)
print(res[1,1])
res = vorticity_np(u,v,dx,dy)
print(res[1,1])
res = vorticity_jit(u,v,dx,dy)
print(res[1,1])

Results:
%timeit vorticity_py(u,v,dx,dy)
1.24 s ± 10.2 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit vorticity_np(u,v,dx,dy)
16.5 ms ± 306 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%timeit vorticity_jit(u,v,dx,dy)
9.47 ms ± 175 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

1 Like

Hi @Dobrodzieju
Yes I have tried numba and I sometimes use it to speed up my python code.
But sometimes a function will be executed only once (but cost lot of time), then numba is not very suitable because it has to compile when I execute it for the first time.
I just knew that numba also offers AOT (ahead-of-time compilation) but I have not tested it.
Also sometimes I really need to speed up more – more than numba can offer. So I have to search for some method to call Fortran/C functions in python.

Still appreciate for your numba example! BTW, your numba version is much faster but I am curious about the performance when parallel is turned off?
(I don’t have a computer to test it now…)