Performance, C vs. Fortran

I recently found some old code I had used to solve the Poisson equation with a somewhat naive double loop relaxation method on a 2d grid. The C code is as follows:

#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <math.h>
#include <time.h>

void swap(double a[101][101], double b[101][101], int n)
{
  double swp[101][101];
  for (int i = 0; i < n; i++) {
    for (int j = 0; j < n; j++) {
      swp[i][j] = a[i][j];
      a[i][j] = b[i][j];
      b[i][j] = swp[i][j];
    }
  }
}
double mmax(double phi[101][101], double phip[101][101], int n)
{
  double max = 0.0;
  double diff = 0.0;
  for (int i = 0; i < n; i++) {
    for (int j = 0; j < n; j++) {
      diff = abs(phi[i][j]-phip[i][j]);
      if (diff > max)
        max = diff;
    }
  }
  return max;
}
double rho(double x, double y)
{
  double s1 = 0.6;
  double e1 = 0.8;
	double s2 = 0.2;
	double e2 = 0.4;

	if (x > s1 && x < e1 && y > s1 && y < e1) {
		return 1.0;
	} else if (x > s2 && x < e2 && y > s2 && y < e2 ) {
		return -1.0;
	} else {
		return 0.0;
	}
}
void run(double toler, double a)
{
  double epsilon0 = 8.85e-12;
  double phi[101][101];
  double phip[101][101];
  memset(phip, 0, sizeof(phip[0][0])*101*101);
  memset(phi, 0, sizeof(phi[0][0])*101*101);

  double delta = 1.0;

  while (delta > toler) {
    for (int i=1; i < 100; i++) {
      for (int j=1; j < 100; j++) {
        phip[i][j] = (phi[i+1][j] + phi[i-1][j] +
                      phi[i][j+1] + phi[i][j-1])/4 +
                      pow(a,2.0)/(4.0 * epsilon0)*rho(i * a, j * a);
      }
    }
    delta = mmax(phi, phip, 101);
    swap(phi, phip, 101);

  }


}

int main(int argc, char *argv[])
{
  clock_t start = clock();
  run(1e-6, 0.01);
  clock_t end = clock();
  double total = ((double)(end - start)) / CLOCKS_PER_SEC;
  printf("Execution time: %f\n",total);
}

I decided to write it in Fortran to see how it does performance wise. And the Fortran code is as follows:

module rhofunc
implicit none
public
integer, parameter :: dp=kind(0.d0)
contains
    pure real(dp) function rho(x,y)
        real(dp), intent(in) :: x,y
        if (x > 0.6_dp .and. x < 0.8_dp .and. y > 0.6_dp .and. y<0.8) then
            rho = 1.0_dp
        else if (x> 0.2_dp .and. x<0.4_dp .and. y>0.2_dp .and. y<0.4_dp) then
            rho = -1.0_dp
        else
            rho = 0.0_dp
        end if
    end function

end module

program poisson
use rhofunc, only: rho
implicit none
integer, parameter :: dp=kind(0.d0), M=100
integer            :: i,j
real(dp),parameter :: epsilon0=8.85e-12_dp, target=1e-6, a=0.01
real(dp)           :: delta, start, end, phiprime(M,M), phi(M,M), temp(M,M), a2


delta = 1.0_dp

call cpu_time(start)

do while (delta > target )
    a2 = a**2.0_dp
    do i=2, M-1
        do j=2, M-1
            phiprime(i,j) = (phi(i+1,j) + phi(i-1,j) + phi(i,j+1) + phi(i,j-1))/4.0_dp &
            + a2/4.0_dp/epsilon0*rho(i*a,j*a)
        end do
    end do
    delta = maxval(abs(phiprime - phi))
    temp = phi
    phi = phiprime 
    phiprime = temp


end do
call cpu_time(end)
print *, end-start
end program

On my machine, the C code runs in 0.12 seconds, but the Fortran code takes 0.37 seconds. Is there a particular reason why the C code is faster here? (I’m able to somewhat improve C performance by doing some unholy pointer tricks when swapping to 0.09 seconds)

Both are single-file, so I simply used gcc/gfortran with the -Ofast optimization flag. Edit: I also ran both with WSL2 on Windows.

2 Likes

I got similar results. Maybe Fortran programs have a longer start-up time – could you increase the dimension of your problem to 1000 or 10000?

Something odd is going on, since if I increase the grid size to 250x250, the C version takes 0.6 seconds to run (about expected, since the increase in iterations is about 6x), whereas the fortran version now takes 30 seconds, a nearly 100x increase in time. This seems extremely odd to me. To test 1000x1000, one would have to allocate on heap for the C version.

Aren’t Fortran arrays column major order, but C arrays are row-major order? If so, this should be an inefficient iteration scheme, compared to C, right? Maybe the compiler takes care of this for you though

4 Likes

It seems the compiler took care of that anyway. Actually, swapping the loop variables around seems to make the code even slower, somehow. I’m pretty sure there must be something trivial I’m missing, because that also makes little sense.

Also, I noted that the Fortran code goes through something like 15x more iterations than the C code, and I don’t see why that happens either. Even accounting for that, though, C is considerably faster per iteration.

I wonder what I’m missing here.

Isn’t phi(:,:) in the Fortran code not initialized to zero before entering the main loop?
Also, it may be useful to check the number of iterations used in the C and Fortran codes (which should be the same or similar if the two calculations are essentially “equivalent”…)

Yeah, I fixed the initialization, but it didn’t change anything. I also ninja edited just as you were about to reply that the Fortran code does many more loops than the C code, but I can’t figure out why. I imagine that it will be something rather stupid in the end. Also, even taking in to account the number of iterations, the C code seems to be faster per iteration.

A difference I see is that abs() in C is returning an int. Would it be possible to rewrite the code using array slicing? That is once you verify it is getting correct results.

1 Like

i can’t check the results, but some code like this.

real(dp) :: rho_mat(M,M)
do i=2, M-1
    do j=2, M-1
        rho_mat(i,j) = rho(i*a,j*a)
    enddo
enddo
do while (delta > target )
    a2 = a**2.0_dp
    phiprime(2:M-1,2:M-1) = (phi(3:M,2:M-1) + phi(1:M-2,2:M-1) + phi(2:M-1,3:M) + phi(2:M-1,1:M-2))/4.0_dp &
            + a2/4.0_dp/epsilon0*rho_mat(2:M-1,2:M-1)
    delta = maxval(abs(phiprime - phi))
    print *, delta
    temp = phi
    phi = phiprime 
    phiprime = temp
enddo

See lower post for an example that is easier to read.

A difference I see is that abs() in C is returning an int.

Oh… :slight_smile: After fixing this (plus some more modifications like declaring arrays as phi(0:M) with M=100), I got the same number of iterations. Interestingly, gfortran seems to give much slower speed with -Ofast than -O3 for this code, for some reason…

(Also, epsilon0 = 8.85e-12 and a = 0.01 may possibly give some numerical issues, because of a**2 / epsilon0 and rho() = a step-like function)? If I change epsilon0 to, say, 8.85e-8, -O3 and -Ofast gave the same number of iterations.)

1 Like

C short-circuits evaluation of the logical expressions in rho but Fortran does not (and there a lot of terms to evaluate). I wonder if that might be a source of some of the performance difference.

Here’s what I eventually ended up with. Maybe you can see what you think. It has close to the same number of iterations by using the new suggested epsilon and putting fabs in the C code.

module rhofunc
implicit none
public
integer, parameter :: dp = kind(0.d0)
contains
pure real(dp) function rho(x, y)
   real(dp), intent(in) :: x, y
   if(x > 0.6_dp .and. x < 0.8_dp .and. y > 0.6_dp .and. y < 0.8) then
      rho = 1.0_dp
   else if(x > 0.2_dp .and. x < 0.4_dp .and. y > 0.2_dp .and. y < 0.4_dp) then
      rho = -1.0_dp
   else
      rho = 0.0_dp
   endif
end function

end module

program poisson
use rhofunc, only: rho
implicit none
integer, parameter :: dp = kind(0.d0), M = 100
integer :: i, j
real(dp), parameter :: epsilon0 = 3.85e-8_dp, target = 1e-6, a = 0.01
real(dp) :: delta, start, end, phiprime(0:M+1, 0:M+1), phi(0:M+1, 0:M+1), a2
integer :: iters
real(dp) :: rho_mat(M, M)

iters = 0
phi = 0.0_dp

delta = 1.0_dp

call cpu_time(start)

do i = 1, M
   do j = 1, M
      rho_mat(i, j) = rho(i*a, j*a)
   enddo
enddo

a2 = a**2.0_dp

do while(.true.)
   phiprime(1:M, 1:M) = (phi(2:M+1, 1:M) + phi(0:M-1, 1:M) + phi(1:M, 2:M+1) + phi(1:M, 0:M-1))/4.0_dp &
                        + a2/4.0_dp/epsilon0*rho_mat
   delta = maxval(abs(phiprime - phi))
   iters = iters + 1
   if(delta > target) then
      phi = phiprime
   else
      exit
   endif
enddo

call cpu_time(end)
print *, iters
print *, end - start
end program

A minor fix to your answer :), in the line:

phiprime(1:M, 1:M) = (phi(2:M+1, 1:M) + phi(0:M-1, 1:M) + phi(1:M, 2:M+1) + phi(1:M, 0:M-1))/4.0_dp &
                        + a2/4.0_dp/epsilon0*rho_mat

you will have exculde the “center part” as well since the original post says

(phi(i+1,j) + phi(i-1,j) + phi(i,j+1) + phi(i,j-1))/4.0_dp + ...

that means phi(i, j) is not included.

1 Like

I was thinking that would be

phi(1:M, 1:M)

Is that not right?

You’re right, I thought the array starts at 1, but it’s phi(0:M + 1)

How does this version compare against C?

Assuming it’s correct, though it doesn’t have the exact same number of iterations, the Fortran version is faster than the C version on my laptop with very simple settings.

gcc -O3 poisson.c -lm && ./a.out
Iterations:, 15515
Execution Time: 0.358088

gfortran -O3 poisson.f90 && ./a.out
Iterations: 15889
Execution Time: 0.22557400000000000

1 Like

Interesting that it’s faster than C on your machine, it’s not on mine. I rewrote the C algorithm to be equivalent to the Fortran one, where the rho array is calculated outside of the loop, and with M=250 it runs in about 16 seconds, whereas @implicitall 's Fortran version takes over 30.

Also, I’m a bit embarassed for using abs instead of fabs in C :smiley:

I only used the original C code with fabs and different epsilon. Feel free to post the program changes and number of iterations.

1 Like

Another candidate for our GitHub - fortran-lang/benchmarks: Fortran benchmarks repository. If anyone is interested in taking the lead on it, please let me know.

1 Like