Nice. On Apple M1 I am getting 1.8s with both GFortran 11.0.1:

```
+ gfortran -O3 -march=native -ffast-math -funroll-loops -c kind.f90 -o kind.o
+ gfortran -O3 -march=native -ffast-math -funroll-loops -c black_scholes.f90 -o black_scholes.o
+ gfortran -O3 -march=native -ffast-math -funroll-loops -c xxblack_scholes.f90 -o xxblack_scholes.o
+ gfortran -O3 -march=native -ffast-math -funroll-loops -o xxblack_scholes xxblack_scholes.o black_scholes.o kind.o
+ ./xxblack_scholes
c = 4.7594223259015269
real 0m1.853s
user 0m1.846s
sys 0m0.004s
```

and the latest LFortran master:

```
+ lfortran --fast -c kind.f90 -o kind.o
+ lfortran --fast -c black_scholes.f90 -o black_scholes.o
+ lfortran --fast -c xxblack_scholes.f90 -o xxblack_scholes.o
+ lfortran --fast -o xxblack_scholes xxblack_scholes.o black_scholes.o kind.o
+ ./xxblack_scholes
c = 4.75942232590152159
real 0m1.808s
user 0m1.801s
sys 0m0.004s
```

I had to apply a tiny patch as we still have to implement the `cpu_time`

and also as a workaround for LLVM: Calling f(g(x)) in a loop can run out of stack (#573) · Issues · lfortran / lfortran · GitLab (which I really need to fix soon):

```
diff --git a/xxblack_scholes.f90 b/xxblack_scholes.f90
index 27dc187..45866a3 100644
--- a/xxblack_scholes.f90
+++ b/xxblack_scholes.f90
@@ -8,13 +8,15 @@ implicit none
integer, parameter :: niter = 10**8
integer :: iter
real(kind=dp) :: c,t1,t2
-call cpu_time(t1)
+real(dp) s,k,r,t,vol
+!call cpu_time(t1)
! Example 15.6 p360 of Options, Futures, and other Derivatives (2015), 9th edition,
! by John C. Hull
+s=42;k=40;r=0.1_dp;t=0.5_dp;vol=0.2_dp
do iter=1,niter
- c = call_price(s=42.0_dp,k=40.0_dp,r=0.1_dp,t=0.5_dp,vol=0.2_dp) ! Hull gets 4.76
+ c = call_price(s, k, r, t, vol) ! Hull gets 4.76
end do
print*,"c =",c
-call cpu_time(t2)
-print*,"time elapsed = ",t2-t1
+!call cpu_time(t2)
+!print*,"time elapsed = ",t2-t1
end program xxblack_scholes
```

The patch is really minimal, I am very happy about that! It gives the same answer as GFortran. I am also happy about the performance. Credit goes to LLVM, we don’t do any optimizations ourselves yet, but we will in the coming months. LLVM compiles code slowly (compared to what is possible, but LFortran compiles this in 0.224s compared to GFortran 0.342s, both with optimizations on), but the performance of the final executable is excellent. They have done an A+ job. So did GFortran, because they have implemented all this from scratch, and on this benchmark it’s essentially equivalent.

Update: I implemented `cpu_time()`

in !1491 and !1492, so now the diff is just:

```
diff --git a/xxblack_scholes.f90 b/xxblack_scholes.f90
index 27dc187..e5df36c 100644
--- a/xxblack_scholes.f90
+++ b/xxblack_scholes.f90
@@ -8,11 +8,13 @@ implicit none
integer, parameter :: niter = 10**8
integer :: iter
real(kind=dp) :: c,t1,t2
+real(dp) s,k,r,t,vol
call cpu_time(t1)
! Example 15.6 p360 of Options, Futures, and other Derivatives (2015), 9th edition,
! by John C. Hull
+s=42;k=40;r=0.1_dp;t=0.5_dp;vol=0.2_dp
do iter=1,niter
- c = call_price(s=42.0_dp,k=40.0_dp,r=0.1_dp,t=0.5_dp,vol=0.2_dp) ! Hull gets 4.76
+ c = call_price(s, k, r, t, vol) ! Hull gets 4.76
end do
print*,"c =",c
call cpu_time(t2)
```