Complex arithmetic significantly slower? - not really

I am recently playing with fractals (nothing serious, just some nice examples for my upcoming Fortran Graphics Library). With fractals involving complex numbers I came across a weird fact: at least in GFortran, complex arithmetic seems to be significantly slower than doing the equivalent calculations using real numbers. For example, consider the following image:

This is a tiny part of the Mandelbrot set (often called “the seahorse” - in fact there is an infinite number of those, lots can be seen in the picture, just smaller ones.) The Fortran program that creates this image is simple enough, and becomes even simpler if I use complex numbers. Just to clarify the concept, I am using either (a) complex numbers with real64 components, or (b) just real64 vectors of dimension(2) to imitate the real and imaginary part of a complex number. The color of each pixel is computed in a loop with up to 6500 iterations (darker color means more iterations were needed.)

Now, I would expect using complex numbers (native in Fortran) to be noticeably faster, since when I am using real numbers I have to do more calculations in each iteration to get the same result. However, it turns out the opposite of what I would expect is true. In all my experiments, using pairs of real64 numbers is in fact about 2.2 times faster than using complex numbers. I tried many other cases, the “magic number” 2.2 (give or take by a small amount) seems to be persistent everywhere.

The example problem I used here shouts “parallelize me” from afar - which I did, and that does of course reduce the computation time significantly. However complex arithmetic remains more than two times slower - in fact it gets slightly worse, 2.2-2.6 times slower in every case, using the OpenMP-parallelized version of the code.
This example is simple enough to just ignore complex numbers and do the same thing with pairs of real numbers. But imagine a more complicated case, with lots of more sophisticated calculations. Years ago, I was working on a project that everything had to be done in the complex plane, and it was far more complicated than the example I used here. In such a case, using anything but complex numbers was a headache to even thing of.

So I am wondering, did anyone notice such a performance difference? Perhaps using other compilers and/or more complicated calculations than just those needed in fractals? My best bet is that complex numbers are poorly implemented - in GFortran, at least. But I might be wrong here.

1 Like

Under the hood, complex computations do essentially the same things you are doing with pairs of reals, as there are no such thing as complex registers and complex operations in the CPUs. So complex operations have no reason to be faster compared to well organized computations on pairs of reals. But it wouldn’t expect them to be slower in the general case. Still, there can be some cases where computations on pairs can be faster, for instance if you know that the final result of a computation is real, then you can drop the computations for the imaginary part.

2 Likes

I think if you look back to the 1960s and 1970s, many fortran codes involving complex arithmetic were implemented with separate real and imaginary parts for exactly this reason, complex arithmetic was poorly implemented and optimized by the compilers of the time. But I think by the 1980s, this was largely corrected. Of course, this was related also to the fact that standard fortran of that era only supported single-precision complex. But all that is corrected in the language now (quirks of CMPLX() aside). Can you post some simple code here to show the 2.2x difference you are talking about?

1 Like

True, but I would expect a compiler can do better under the hood to compute this:

complex(kind=real64) :: z_old, z_new, c
...
z_new = z_old*z_old + c

than what a user would do with vectors of real numbers to compute the real and imaginary part of the result:

real(kind=real64), dimension(2) :: z_old, z_new, c
...
z_new(1) = z_old(1)*z_old(1) - z_old(2)*z_old(2) + c(1)
z_new(2) = 2*z_old(1)*z_old(2) + c(2)

At the very least, I would expect the former to be more or less the same as the latter, performance-wise, if not faster… But it is, in fact, more than two times slower compared to the latter. It doesn’t make sense to me.

I would expect the same as you, but it doesn’t seem to be the case. I could post the whole subroutine that creates the image above (in fact the whole FGL project will be released soon and will include that code in an example.) However, the “bottleneck” of the code is just two lines:

z_new = z_old*z_old + c
...
if (abs(z_new) > 2) exit

where z_old, z_new, and c are complex numbers (see above). Those two lines are the only part of the code complex arithmetic is used - and trust me, it was the last thing I checked to figure out why the program was slower than what I expected. The code does call graphing procudures, it does much more complicated stuff to draw each pixel, but no, none of that is the reason for the poor performance. The bottleneck is just those two lines. If I replace them with real number arithmetic BOOM, 2.2 times faster.

Well, I am guessing you are much wiser than my dumb self and you know so much more, could you please… enlighten me? Actually, no, forget that, I am not interested.

1 Like

To me at least, the gfortran complex arithmetic looks better than your homebrew:

1 Like

Note that you could avoid both abs() and sqrt() by writing:

if ((z_new%re**2 + z_new%im**2) > 4.0_wp) exit

it will be still faster, as computing a square root is useless (and time-consuming).

2 Likes

@kargl I am using GFortran since it became a decent compiler, in fact I was anticipating that to happen long before it did happen, and I didn’t like the fact g95 and gfortran took different ways back then. I am dealing with complex arithmetic for decades. Do you really think the compiler is protecting me from myself here? Perhaps, just perhaps, you should reconsider your manners.
Normally I would ignore and wouldn’t answer because of the unacceptable manners, but I will do it for the sake of completeness and others that might wish to help. I might be dumb but I am not dumb enough to compute the absolute value in this case, since it’s not really needed. This is the whole snippet that has to do with complex numbers:

complex(kind=real64) :: z_old, z_new, c
...
z_new=z_old*z_old+c
if (abs(z_new) > 2) exit

(there are a few more lines involving complex numbers but those are outside the offending loop with 6500 iterations and are not part of the bottleneck.) The equivalent of the above in the real number pairs version is

real(kind=real64), dimension(2) :: z_old, z_new, c
...
z_new(1)=z_old(1)*z_old(1)-z_old(2)*z_old(2)+c(1)
z_new(2)=2*z_old(1)*z_old(2)+c(2)
if (z_new(1)*z_new(1) + z_new(2)*z_new(2) > 4) exit

I really don’t need to post the whole code, the snippet above is where the performance bottleneck is. And again, it was the last thing I checked, expecting the problem to be elsewhere - but no, it is exactly here. Even if I did post the whole code, you wouldn’t be able to run it anyway, because it uses my FGL extensively.

1 Like

About hypot(X,Y) the Fortran standard says:

The result has a value equal to a processor-dependent approximation to the Euclidean distance, \sqrt{X^2 + Y^2}, without undue overflow or underflow.

So it does not simply compute sqrt(X*X+Y*Y) but takes care of overflow/underflow problems that could occur.

1 Like

For example:
\text{if } |x|\geq |y| \text{ then } |z| = |x|\sqrt{1 + (\frac{y}{x})^2}

\text{if } |x| < |y| \text{ then } |z| = |y|\sqrt{1 + (\frac{x}{y})^2}

1 Like

The point is exactly this one: in the real arithmetic version of the test, you are writing an optimised version, without sqrt() and with only 2 multiplications and 1 addition. In contrast when calling abs(z_new) the compiler has no choice other than generating a sqrt(), and even many more operations as pointed out by @vmagnin

1 Like

For Mandelbrot, I tried also other metrics like the Manhattan distance and the Chebyshev distance.

With Chebyshev distance, the test would be if (max(abs(z%re), abs(z%im)) > 2.0_wp)
With -O0 the computation of the Mandelbrot set was faster than with the square of the Euclidean distance. But with -O3, it was the same… The corresponding topological ball is a square with sides parallel to the axes. In fact, it’s bigger than the circle, but the mathematical operations are simpler and faster.

Finally, in that case it was not globally faster. But in some cases, it can be interesting to consider other metrics than the familiar Euclidean metric.

Note also that there is no arithmetic computation and thus no additional rounding error with Chebyshev. And the contour lines are different.

When I saw the abs(z_new) in the expression, that is what I thought too. The library routine to compute complex abs is complicated, with tests to protect against unnecessary overflow and underflow. Also, it is a 2-norm that requires a square root, which adds additional overhead to the test. If you know for certain that your numbers are <sqrt(huge(1.0_wp)), then you don’t need to worry about overflow, and you can just use the simple expression if ((z_new%re**2 + z_new%im**2) > 4.0_wp) instead, as suggested above.

Even better, if your algorithm doesn’t really need the 2-norm of the complex number, then maybe you can use a simpler test with the 1-norm (abs(z_new%re) + abs(z_new%im)), which does not even require a multiplication. This is a common trick with complex linear algebra; the BLAS name for this function is SCABS1() or DCABS1(). However, there is no intrinsic fortran function, so you must write it out as above or roll your own.

1 Like

Well, this is surprising. It turns out complex abs is very-very costly - in GFortran, at least. Replacing the line

if (abs(z_new) > 2) exit

with

if ((z_new%re*z_new%re + z_new%im*z_new%im) > 4) exit

as @vmagnin suggested has a huge impact in performance. Actually, I was about to try this as a last resort, since my real numbers implementation does exactly that already, but @vmagnin was faster. I would have send you a bottle of wine, but since you have Bourgogne Aligoté there (what a wine!) how could I dare. :laughing:

Some results in numbers (calculations using an old laptop, since I’m not at home). Numbers are in seconds, parenthesized numbers are computation times with parallel code (OpenMP). Fractal 3 is the one in the picture above, but I added two more here for comparisons:

              Fractal 1     Fractal2      Fractal 3
with abs     1.19 (0.56)   3.65 (1.63)   6.00 (2.63)
without abs  0.35 (0.12)   1.03 (0.33)   1.77 (0.54)

As you can see, avoiding abs makes the program about 3.5 times faster in the non-parallelized version, and 4.5-5 times faster when using OpenMP. Now let’s forget about abs and see how the complex plane version compares to the real numbers implementation (which never used sqrt to calculate abs, so it is directly comparable to the complex plane no-abs version):

                Fractal 1     Fractal2      Fractal 3
Complex plane  0.35 (0.12)   1.03 (0.33)   1.77 (0.54)
Real numbers   0.54 (0.25)   1.61 (0.60)   2.80 (1.10)

As you can see, the complex plane version is faster, and in fact the parallel version is about 2 times faster than the real numbers parallel version. If you combine the two tables above you will also see the “magic number” 2.2 I mentioned earlier, but it doesn’t matter anymore.

Of course, those are just elementary experiments, but I think it’s obvious GFortran does a good job with complex numbers. However, avoid abs like plague. I would guess more functions also have a huge impact, but this is worth investigating further.

For the story, the C version of the above has about the same performance as the Fortran code. It is faster than the “real numbers” version above, but slightly slower than the complex-plane Fortran version.

2 Likes

I’ve prepared (adapted) a demo code:

Demo code
! mandelbrot.f90 --
!
!     Code originally from:
!        https://github.com/fortran-lang/webpage/issues/108#issuecomment-1220647597
!
!     Use the preprocessor directive -DWITH_ABS to control the exit method 
!     from the main iteration loop.
!
program main

  implicit none

  integer, parameter :: wp = kind(1.0d0)
  integer, parameter :: NCOLS = 2400, NROWS = 1200, MAX_ITER = 1000
  real(wp),    parameter :: THRESHOLD = 2.7
  integer :: buffer(NROWS,NCOLS)
  integer :: x, y, it, ifile
  real(wp) :: re, im, t1, t2

  call cpu_time(t1)
  do concurrent (y = 1:NROWS, x = 1:NCOLS)
    im = -1.5 + (y - 1) * 3.0 / NROWS
    re = -2.0 + (x - 1) * 3.0 / NCOLS
    it = mandelbrot(cmplx(re, im, wp), MAX_ITER, THRESHOLD)
    buffer(y, x) = it
  end do
  call cpu_time(t2)
  print '("Time: ", f8.6, " sec")', t2 - t1

  open(newunit=ifile,file="out.txt")
  call write_gnuplot(buffer,ifile)
  close(ifile)

  call execute_command_line("gnuplot -persist mandelbrot.plt")

contains

  pure integer function mandelbrot(c, max_iter, threshold)
    complex(wp), intent(in) :: c
    integer, intent(in) :: max_iter
    real(wp),    intent(in) :: threshold
    complex(wp)             :: z

    z = (0.0_wp, 0.0_wp)
    do mandelbrot = 0, max_iter
        z = z**2 + c
#if WITH_ABS
        if (abs(z) > threshold) exit
#else
        if (z%re**2 + z%im**2 > threshold**2) exit
#endif
    end do

  end function mandelbrot

  subroutine write_gnuplot(buffer,iunit)
    use, intrinsic :: iso_fortran_env, only: output_unit
    integer, intent(in) :: buffer(:,:)
    integer, optional, intent(in) :: iunit

    integer :: iunit_, i, j

    iunit_ = output_unit
    if (present(iunit)) iunit_ = iunit
    do i = 1, size(buffer,1)
      write(iunit_,'(*(I6,2X))') buffer(i,:)
    end do
  end subroutine

end program main
# mandelbrot.plt
#
set size ratio -1
unset border
unset tics
unset colorbox
plot "out.txt" matrix w image

The factor 2× difference between abs and the squared norm is observable in other compilers too:

$ ifx --version
ifx (IFORT) 2022.0.0 20211123
Copyright (C) 1985-2021 Intel Corporation. All rights reserved.

$ ifx -O3 -xHOST mandelbrot.F90 
$ ./a.out
Time: 1.388586 sec
$ ifx -O3 -xHOST mandelbrot.F90 -DWITH_ABS
$ ./a.out
Time: 3.993850 sec
$ nvfortran --version

nvfortran 22.7-0 64-bit target on x86-64 Linux -tp haswell 
NVIDIA Compilers and Tools
Copyright (c) 2022, NVIDIA CORPORATION & AFFILIATES.  All rights reserved
$ nvfortran -O3 mandelbrot.F90
$ ./a.out
Time: 1.325308 sec
$ nvfortran -O3 mandelbrot.F90 -DWITH_ABS
$ ./a.out
Time: 2.754659 sec

As emphasized already, there is nothing poor about gfortran’s implementation of complex numbers.

2 Likes

:wine_glass:
That’s what I discovered a long time ago. It’s costly, but at that time I did not completely understand why.
Like @RonShepard said, avoid the complex abs() when you are sure there will be no overflow during its computation. And in the Mandelbrot set, it’s OK as we just need to test if we exit from that small circle.

With any any of these 3 lines, that is squared L2 norm with multiplication, or squared L2 norm with powers, or L1 norm, I obtain the exact same runtimes:

	! if (znew(i)%re **2 + znew(i)%im **2 > 4.0) exit
	! if (znew(i)%re * znew(i)%re + znew(i)%im * znew(i)%im > 4.0) exit
	! if (abs(znew(i)%re) + abs(znew(i)%im) > 2.0) exit

Multiplications are almost for free (at least they are as soon as the result is added to something), and I expect any decent compiler to replace the power 2 (**2) by a simple multiplication. Writing a multiplication instead of **2 is a C habit :wink:

Conclusion : there no need replacing the squared L2 norm by a L1 norm for optimisation purpose (for the L2 norm -not squared- it’s of course a different story because of the sqrt())

1 Like

I learned this in a backwards kind of way. I was working with a linpack code on a vax. There are two kinds of 64-bit floating point on a vax, one with an exponent range of about ±38 and the other with an exponent range of ±308. I was using the numbers with the smaller range because it had more mantissa bits. Somewhere along this process, I started seeing overflow errors for the expression sqrt(x**2+y**2). I knew that expression was mathematically equivalent to CABS(DCMPLX(x,y)), so I stuck that in just to see what would happen. All the overflows went away.

I searched back and those numbers could be large because they were the result of a division where the denominator was near the machine epsilon. I then realized that 1/epsilon was very close to what we would now call huge(). So the problem had to do with computing that expression in a safe way that avoids the overflow. Apparently the vax CABS() function did that correctly. I also determined that the problem did not occur for the other 64-bit floating point with the larger exponent range. Anyway, in my case, I learned that CABS() did do some extra work beyond the straightforward expression, and that that work is sometimes very valuable.

1 Like

While the question has been answered, I wanted to add that this issue resonates strongly with some of the lessons given by Matt Godbolt (the author of Compiler Explorer, the page I used above to look at the quality of complex arithmetic instructions) in this Youtube video from a MUC++ meeting (Munich C++ User Group):

  • Compilers are cleverer than we are … if they have the right information.
  • Don’t compromise readability … Unless you can absolutely and unconditionally prove it beyond any reasonable doubt, don’t compromise readability. … Trust your compiler to take readable, clear code that you can reason about. Turns out if you can reason about it, the compiler can reason about it and it can do the right thing.

Writing your own complex arithmetic routines is something that falls into the second category.

4 Likes

I tried that in May. And discovered that 2.0 must be replaced by ~2.3: the topological ball is a square with sides oriented at a 45° angle. And if you want the Mandelbrot set to be fully inside, 2.0 is not enough.

Note also that the contour lines are different from the Euclidean contour lines outside the set, so the esthetic is different.

Is the test for the size of the circle or rectangle that is plotted? If so, and if it is a square plot, then it seems like two separate tests

if ( abs(znew%re) > 1.0_wp) exit
if ( abs(znew%im) > 1.0_wp) exit

would be appropriate. There are no multiplications, additions, divisions, or square roots in those tests.