FFT in Fortran vs. C++

Hi there!

I’m looking for some comparison between speed and reproducibility of FFT packages in Fortran and C++. I would like to know use cases where one would prefer to use FFTW or fftpack.

Context:

I’m trying to implement a Runge-Kutta type method to numerically solve the nonlinear schrodinger equation (for laser pulse propagation). This particular method simply a combination of the traditional RK4 method and the split-step Fourier method. At every step it requires switching back and forth between time and fequency domain, which involves FFT for exponential of matrix (i.e. something like fft(exp(M)))). It involves exactly 8 FFT operations per one timestep. Given, that the timestep has to be small enough compared to the total time of propagation, the FFT steps can quickly become a major computational bottle neck. In this context, I’m trying to figure out it would better to implement this solver using fftpack (in Fortran) or FFTW (in C++).

I’d like to know if anyone has any idea about the same?

Reference: A Fourth-Order Runge–Kutta in the Interaction Picture Method for Simulating Supercontinuum Generation in Optical Fibers

1 Like

FFTW is fully compatible with Fortran and provides an interface. You can also use the Intel MKL Discrete Fourier Transform which has the same layout as FFTW. FFTPACK is itself written in Fortran.

The FFT is usually restricted to specific prime factors. In the case of FFTPACK these are 2, 3 and 5. Both FFTW and MKL can use larger factors (we go up to 7) but with diminishing returns in efficiency.

As far as speed is concerned, despite FFTPACK being very well written, both FFTW and MKL are faster because they use CPU-specific instructions. In our experience FFTW has a slight edge over MKL, but this can depend on the transformation length and dimension.

Here is our Fortran interface to the FFTW complex-to-complex N-dimensional transform (you can find more in the code here):

subroutine zfftifc(nd,n,sgn,z)
implicit none
! arguments
integer, intent(in) :: nd,n(nd),sgn
complex(8), intent(inout) :: z(*)
! local variables
integer, parameter :: FFTW_ESTIMATE=64
integer p
integer(8) plan
real(8) t1
! interface to FFTW version 3
!$OMP CRITICAL(zfftifc_)
call dfftw_plan_dft(plan,nd,n,z,z,sgn,FFTW_ESTIMATE)
!$OMP END CRITICAL(zfftifc_)
call dfftw_execute(plan)
!$OMP CRITICAL(zfftifc_)
call dfftw_destroy_plan(plan)
!$OMP END CRITICAL(zfftifc_)
if (sgn == -1) then
  p=product(n(:))
  t1=1.d0/dble(p)
  call zdscal(p,t1,z,1)
end if
end subroutine
3 Likes

FFTW is not a C++ library. Its C code is autogenerated from OCaml.

Are there any examples of FFT using MKL for the complex N-dimensional transform?

Here’s our interface to the MKL FFT, it accepts the same arguments as the FFTW interface above:

subroutine zfftifc(nd,n,sgn,z)
use mkl_dfti
implicit none
! arguments
integer, intent(in) :: nd,n(nd),sgn
complex(8), intent(inout) :: z(*)
! local variables
integer status,p
real(8) t1
type(DFTI_DESCRIPTOR), pointer :: handle
! interface to the Intel MKL advanced Discreet Fourier Transform (DFT) routines
! (with thanks to Torbjorn Bjorkman)
p=product(n(:))
t1=1.d0/dble(p)
status=DftiCreateDescriptor(handle,DFTI_DOUBLE,DFTI_COMPLEX,nd,n)
status=DftiSetValue(handle,DFTI_FORWARD_SCALE,t1)
status=DftiCommitDescriptor(handle)
if (sgn == -1) then
  status=DftiComputeForward(handle,z)
else
  status=DftiComputeBackward(handle,z)
end if
status=DftiFreeDescriptor(handle)
end subroutine

And here’s the routine for determining the next largest n for a given set of prime factors:

subroutine nfftifc(np,n)
implicit none
! arguments
integer, intent(in) :: np
integer, intent(inout) :: n
! local variables
integer i,j
integer, parameter :: p(10)=[2,3,5,7,11,13,17,19,23,29]
if ((np < 1).or.(np > 10)) then
  write(*,*)
  write(*,'("Error(nfftifc): np out of range : ",I8)') np
  write(*,*)
  stop
end if
if (n <= 0) then
  write(*,*)
  write(*,'("Error(nfftifc): n <= 0 : ",I8)') n
  write(*,*)
  stop
end if
10 continue
i=n
do j=1,np
  do while(mod(i,p(j)) == 0)
    i=i/p(j)
  end do
end do
if (i /= 1) then
  n=n+1
  goto 10
end if
end subroutine
2 Likes

One point to keep in mind is that fftw has a GPL license.
So, it may not be always appropriate.

I had good luck with FFTE: http://www.ffte.jp/, scales great in parallel.

1 Like

looks like there are few GitHub mirrors of FFTE: Repository search results · GitHub

Including the one from @certik, who has constructed a git history with all the previous releases. Great! This looks like a good candidate to modernize (covert to free-form source, add an FPM manifest, etc.)

3 Likes

Yes, I carefully unpacked every release tarball that I was able to find, since the website only lists the latest one. Just today I realized it was already 9 years ago. It feels like yesterday…

I think it’s still developed upstream, so if we do any changes, they will be overridden if upstream releases a new version. @jacobwilliams what would you recommend?

2 Likes

@certik FFTE looks good, do you know any benchmarks for parallelization on intel hardware?

I found this: An Analysis of FFTW and FFTE Performance | SpringerLink

If there’s none, I can try to run some tests to see.

1 Like

Maybe the original author would be interested in learning about GitHub, FPM, and modern Fortran?

Otherwise:

  • Just keep the F77 code as is and make it into an FPM package. That way it would be easy to copy/paste any new release code in there.
  • Fork it and refactor/modernize. Any later upstream changes would need to be manually incorporated if possible.
2 Likes

Yes, I benchmarked it in 2018, I am attaching the results of strong scaling:

Here are the parameters of the cluster:

  • LANL Grizzly:
    • 1.8 PF, 1490 compute nodes, Intel Broadwell
    • 1.2 TF per node
    • 2 x Intel Xeon E5-2695 CPUs (36 cores) per node
    • 128 GB memory per node
    • 1 x Intel Omni-Path IB (100Gb/s)
    • 15.2 PB storage

It had 36 cores per node, I ran 32 MPI ranks on each node and kept increasing the number of nodes.

1 Like

I tried the test

Downloads\ffte-master\ffte-master\tests>make
gfortran -O3 -fomit-frame-pointer -fopenmp -I.. -c test1d.f -o test1d.o
gfortran -O3 -fomit-frame-pointer -fopenmp -I.. -c ../zfft1d.f -o zfft1d.o
gfortran -O3 -fomit-frame-pointer -fopenmp -I.. -c ../fft235.f -o fft235.o
gfortran -O3 -fomit-frame-pointer -fopenmp -I.. -c ../kernel.f -o kernel.o
gfortran -O3 -fomit-frame-pointer -fopenmp -I.. -c ../factor.f -o factor.o
gfortran -O3 -fomit-frame-pointer -fopenmp -I.. test1d.o zfft1d.o fft235.o kernel.o factor.o -o test1d
gfortran -O3 -fomit-frame-pointer -fopenmp -I.. -c test2d.f -o test2d.o
gfortran -O3 -fomit-frame-pointer -fopenmp -I.. -c ../zfft2d.f -o zfft2d.o
gfortran -O3 -fomit-frame-pointer -fopenmp -I.. test2d.o zfft2d.o fft235.o kernel.o factor.o -o test2d
gfortran -O3 -fomit-frame-pointer -fopenmp -I.. -c test3d.f -o test3d.o
gfortran -O3 -fomit-frame-pointer -fopenmp -I.. -c ../zfft3d.f -o zfft3d.o
gfortran -O3 -fomit-frame-pointer -fopenmp -I.. test3d.o zfft3d.o fft235.o kernel.o factor.o -o test3d
gfortran -O3 -fomit-frame-pointer -fopenmp -I.. -c rtest2d.f -o rtest2d.o
rtest2d.f:31:72:

   31 |       CALL DUMP(A,(NX/2+1)*NY)
      |                                                                        1
Error: Type mismatch in argument 'a' at (1); passed REAL(8) to COMPLEX(8)
make: *** [makefile:33: rtest2d.o] Error 1

See here. Try adding -fallow-argument-mismatch to the GFortran options.

1 Like