How to contribute to LFortran's runtime library

Btw, regarding this:

I think even the most competent compiler cannot optimize some things if it can’t give up determinism. For example if you are not allowed to reorder a loop summation, it is hard to vectorize (or parallelize) it and get exactly the same answer.

So I think as long as the user can control what the compiler does via command line options, the user can decide how much determinism to give up for speed.

1 Like

I haven’t used lfortran for my test but gfortran (I did mention this). Actually I did not know lfortran could already run any code. I thought it is at the first stage of doing the translation (front-end?)

In any case, I am not sure why shouldn’t we implement fast sqrt algorithm but instead leave it to the optimizer to recognize that x**0.5 is actually sqrt(x). What if the user gives ‘-O0’ option?

There are two separate issues. One is simply the numerical accuracy of the various ways to compute an expression in fortran. The expressions sqrt(x) and x**(0.5) might differ for all of these reasons. The value the programmer wants might well depend on the context. For example, the programmer might want the value for x**y that is smooth as a function of y near y=0.5 rather than the one computed by sqrt(x). The other issue in general involves multiple valued functions and branch cuts, certainly when complex arithmetic is involved, but also just with real arithmetic. The programmer might want the negative square root value rather than the positive one, and if the compiler is making guesses behind the curtain when substituting various expressions for optimization purposes, they aren’t always going to match what the programmer wants.

@msz59 LFortran can compile a decent subset of Fortran. It’s a full compiler. All the tests here marked with “llvm” can compile to a binary and run correctly: lfortran/CMakeLists.txt at 56fb378cdb2748b48a561c0b59dfc6eb64b085b0 · lfortran/lfortran · GitHub.

I am sorry I misunderstood what you were benchmarking. Can you post the code that you used for your gfortran benchmark?

Here it is. To change to double precision, it is enough to change real32 to real64 in the use declaration. I have checked SP and DP with no options, -O3 and “-O3 -ffast-math”. The closest result was in DP with fast-math but still twice as long for ssqrt than sqrt

program s
  use, intrinsic :: iso_fortran_env, only: sp => real32
  implicit none
  integer,parameter :: max=100000000
  integer :: i
  real(sp) :: sum, tab(max)
  real :: t1, t2
  call random_number(tab)
  call cpu_time(t1)
  sum = 0.0_sp
  do i=1, max
    sum = sum+sqrt(tab(i))
  end do
  call cpu_time(t2)
  print *, 'sqrt ', sum, t2-t1,' secs'
  call cpu_time(t1)
  sum = 0.0_sp
  do i=1, max
    sum = sum+ssqrt(tab(i))
  end do
  call cpu_time(t2)
  print *, 'ssqrt', sum, t2-t1,' secs'

contains
  elemental real(sp) function ssqrt(x) result(r)
  implicit none
  real(sp), intent(in) :: x
  if (x >= 0) then
    r = x**(1._sp/2)
  else
    error stop "sqrt(x) for x < 0 is not allowed"
  end if
end function ssqrt
end program s

Thanks. For performance you can’t use that if statement in there (we have to figure out another solution). Try this:

program s
  use, intrinsic :: iso_fortran_env, only: sp => real32
  implicit none
  integer,parameter :: max=100000000
  integer :: i
  real(sp) :: sum, tab(max)
  real :: t1, t2
  call random_number(tab)
  call cpu_time(t1)
  sum = 0.0_sp
  do i=1, max
    sum = sum+sqrt(tab(i))
  end do
  call cpu_time(t2)
  print *, 'sqrt ', sum, t2-t1,' secs'
  call cpu_time(t1)
  sum = 0.0_sp
  do i=1, max
    sum = sum+ssqrt(tab(i))
  end do
  call cpu_time(t2)
  print *, 'ssqrt', sum, t2-t1,' secs'

contains
  elemental real(sp) function ssqrt(x) result(r)
  implicit none
  real(sp), intent(in) :: x
  r = x**(1._sp/2)
end function ssqrt
end program s

On Apple M1 Max with GFortran 11.3.0 I am getting:

$ gfortran -O3 -ffast-math -funroll-loops b.f90
$ ./a.out 
 sqrt    67108864.0       2.97800004E-02  secs
 ssqrt   67108864.0       2.36230046E-02  secs

So the pure Fortran version is faster than the intrinsic.

1 Like

To benchmark LFortran against GFortran, you can do so with the following code (I took your example and workarouned some current LFortran limitations):

program s
  use, intrinsic :: iso_fortran_env, only: sp => real32
  implicit none
  integer,parameter :: max=100000000
  integer :: i
  real(sp) :: sum
  real(sp), allocatable :: tab(:)
  real(8) :: t1, t2
  allocate(tab(max))
  do i=1, max
    tab(i) = real(i) / max
  end do
  call cpu_time(t1)
  sum = 0.0_sp
  do i=1, max
    sum = sum+sqrt(tab(i))
  end do
  call cpu_time(t2)
  print *, 'sqrt ', sum, t2-t1,' secs'
  call cpu_time(t1)
  sum = 0.0_sp
  do i=1, max
    sum = sum+ssqrt(tab(i))
  end do
  call cpu_time(t2)
  print *, 'ssqrt', sum, t2-t1,' secs'

contains
  elemental real(sp) function ssqrt(x) result(r)
  implicit none
  real(sp), intent(in) :: x
  r = x**(1._sp/2)
end function ssqrt
end program s

And you get:

$ gfortran -O3 -ffast-math -funroll-loops b.f90 && ./a.out 
 sqrt    67108864.0       2.8060000000000002E-002  secs
 ssqrt   67108864.0       2.3711999999999997E-002  secs
$ lfortran --fast b.f90                                   
sqrt  1.67772160e+07 1.30483999999999989e-01  secs
ssqrt 1.67772160e+07 9.59920000000000218e-02  secs

For smaller max the numbers agree better, but I used a big enough max to get some decent timings. Go ahead and see what the timings are on your computer. As you can see, GFortran is currently about 4x faster on this benchmark. We’ll work on improving optimizations later.

2 Likes

Thanks @kargl you nailed it. You highlighted the advantage of having the sqrt intrinsic function: it uses the (faster) sqrtsd instruction every time, while the power x**(1._sp/2) only uses it if you supply the -ffast-math flag.

I think that also resolves the dilemma above: the expression x**(1._sp/2) without -ffast-math behaves as x**y , but with -ffast-math it behaves as sqrt(x).

Consequently, it seems you can indeed use the general x**(1._sp/2) to implement the intrinsic sqrt(x), however by default it would only use the fast sqrtsd instruction if -ffast-math is provided (or equivalent). In the case of sqrt we want to always use -ffast-math if it is implemented using x**(1._sp/2), as it turns out that is what the user wants.

Regarding the cleanest implementation, one option is to have a dedicated ASR node for this operation like this:

--- a/src/libasr/ASR.asdl
+++ b/src/libasr/ASR.asdl
@@ -238,6 +238,7 @@ expr
     | RealUnaryMinus(expr arg, ttype type, expr? value)
     | RealCompare(expr left, cmpop op, expr right, ttype type, expr? value)
     | RealBinOp(expr left, binop op, expr right, ttype type, expr? value)
+    | RealSqrt(expr arg, ttype type, expr? value)
     | ComplexConstant(float re, float im, ttype type)
     | ComplexUnaryMinus(expr arg, ttype type, expr? value)
     | ComplexCompare(expr left, cmpop op, expr right, ttype type, expr? value)

The frontend would simply use this node for the intrinsic sqrt(x), as well as it can transform RealBinOp(x op=Pow, 1/2, ...) into RealSqrt(x, ...) if the -ffast-math flag is provided in the ASR->ASR optimization phase.

We are trying to keep the ASR design minimal and only add nodes if needed. I thought the sqrt function could be implemented using the general RealBinOp operator, but I can see now that the backend should only generate the sqrtsd if -ffast-math is provided, but some codes/users cannot use this flag, so it seems a dedicated RealSqrt node might be the way to go, to signal to the backend to use the sqrtsd instruction even without -ffast-math.

An alternative implementation is to recognize the instrinsic sqrt function in the backend and directly generate the sqrtsd instruction. The pro is that ASR is simpler, the con is that the backend has to have a special logic for intrinsic sqrt, so having an explicit ASR node might be cleaner. We struggle these design choices. I’ll think about this.

Regarding inlining, LFortran inlines intrinsic functions (it has access to the source code at compile time). It currently doesn’t use -ffast-math because I have not figured out how to make LLVM do it from C++ yet, but that’s a different issue.

@msz59 going forward, as you can see, you might want to focus on another intrinsic function, the sqrt is quite special, as this thread highlighted.

3 Likes

Thank you @kargl. It means a lot coming from you.

2 Likes