Dear all,

I just wanted to test a code, to see if there are speed dfference between M1 and PC.

I have a code, I have to say, this definitely may not be the best code to test memory bandwidth, but I wanted to share the test code with you. You may test and report the result, especially if you have M1, M1 Pro, M1 Max, or even M1 Ultra.

The code is basically solving a simple 2-equation stochastic differential equation (SDE). Because it is stochastic differential equation, you can imagine you need to solve it many many times, in order to find the correct statistics. if ‘many many’ = 10^5, it basically means SDE will be 10^5 slower than ODE.

OK. The code is simplified from a sequential Monte Carlo code, so some modules are left as blank for simplicity. There are 8 files, and a Makefile.

All you need to do is just copy them inside one folder, them type

```
make
```

then type

```
./stoRK
```

to run the code. That is all.

The 8 files are below,

constants.f90 (755 Bytes)

fg.f90 (818 Bytes)

main.f90 (814 Bytes)

pf.f90 (31 Bytes)

ran.f90 (9.2 KB)

stats.f90 (3.6 KB)

stochastic_rk.f90 (4.7 KB)

tests.f90 (2.2 KB)

The Makefile is (there is no MPI in this code) below, just type `make`

and that is all.

```
# This Makefile was generated by Rong Chen for gfortran on ubuntu
# sudo apt install gfortran mpich https://www.youtube.com/watch?v=aRhYoAC-Ymc
MPI=false
ifeq ($(MPI),true)
FC = mpif90
MPIFILE=mympi
EXEC = stork_mpi
LINKER =
IDIR =
FFLAGS = -O3 -march=native -frecursive # -flto
#-fcheck=all -Og -g -fbacktrace -Wall -Wextra -Wno-tabs -Wno-unused-dummy-argument -Wno-unused-variable -Wno-unused-function -Wno-compare-reals -Wno-maybe-uninitialized -Wno-conversion -ffpe-trap=invalid,zero,overflow -finit-real=nan -ffree-line-length-0# -Ofast -march=native -flto # -pg #-no-pie
F77FLAGS = $(FFLAGS) -std=legacy -fdefault-real-8 -fdefault-double-8 # gfortran only.
FFLAGS_heapstack = -frecursive # -fmax-stack-var-size=655360
LDFLAGS=
LIBS = -static-libgfortran
LINKER =
else
FC = gfortran
EXEC = stork
LINKER =
IDIR =
FFLAGS = -O3 -march=native -frecursive -flto
#-fcheck=all -Og -g -fbacktrace -Wall -Wextra -Wno-tabs -Wno-unused-dummy-argument -Wno-unused-variable -Wno-unused-function -Wno-compare-reals -Wno-maybe-uninitialized -Wno-conversion -ffpe-trap=invalid,zero,overflow -finit-real=nan -ffree-line-length-0
#-g -Wall -Wtabs -Wextra -Warray-temporaries -Wconversion -fbacktrace -ffree-line-length-0 -fcheck=all -ffpe-trap=invalid,zero,overflow -finit-real=nan
# -pg #-no-pie # -fmax-stack-var-size=655360
F77FLAGS = $(FFLAGS) -std=legacy -fdefault-real-8 -fdefault-double-8 # gfortran only.
FFLAGS_heapstack = -frecursive # -fmax-stack-var-size=655360
LDFLAGS =
LIBS= -static-libgfortran
endif
.SUFFIXES:
.SUFFIXES: .o .f .f90
.f90.o:
$(FC) $(FFLAGS) -c $<
%.o: %.mod
OBJECTS=\
main.o\
constants.o\
fg.o\
pf.o\
ran.o\
stats.o\
stochastic_rk.o\
tests.o
main: $(OBJECTS)
$(FC) $(LDFLAGS) -o ./$(EXEC) $(OBJECTS) $(LIBS) 2>compiling_record.log
clean:
rm -f $(EXEC) *\.mod *\.mod0 *\.smod *\.smod0 *\.log *\.o *~
# @del /q /f $(EXEC) *.mod *.smod *.o $(EXEC) *~ > nul 2> nul
# not that in windows rm -f does not work, so use del instead.
# > nul 2> nul just to suppress some redundunt mesage.
main.o: constants.o ran.o fg.o pf.o stochastic_rk.o tests.o stats.o main.f90
$(FC) $(FFLAGS) -c main.f90
constants.o: constants.f90
$(FC) $(FFLAGS) -c constants.f90
fg.o: constants.o fg.f90
$(FC) $(FFLAGS) -c fg.f90
pf.o: constants.o stats.o ran.o stochastic_rk.o fg.o pf.f90
$(FC) $(FFLAGS) -c pf.f90
ran.o: ran.f90
$(FC) $(FFLAGS) -c ran.f90
stats.o: constants.o stats.f90
$(FC) $(FFLAGS) -c stats.f90
stochastic_rk.o: constants.o ran.o fg.o stochastic_rk.f90
$(FC) $(FFLAGS) -c stochastic_rk.f90
tests.o: constants.o ran.o fg.o stochastic_rk.o stats.o tests.f90
$(FC) $(FFLAGS) -c tests.f90
```

My hardware are

- Thinkpad P72 with Xeon-2186M + 64 GB ECC DDR4 2666 (with 2TB + 2TB + 8TB storage and Quadro P5200 GPU just to show off lol). Windows 10 pro workstation. Intel OneAPI 2022.0.3 + Visual Studio 2019.

The ifort compile flag in windows 10 is,

```
/nologo /debug:full /MP /O3 /QxHost /Qparallel /heap-arrays0 /Qopt-matmul /module:"x64\Release\\" /object:"x64\Release\\" /Fd"x64\Release\vc160.pdb" /traceback /libs:static /threads /Qmkl:cluster /c
```

link flag is

```
/OUT:"x64\Release\stochastic_RK.exe" /INCREMENTAL:NO /NOLOGO /MANIFEST /MANIFESTFILE:"x64\Release\stochastic_RK.exe.intermediate.manifest" /MANIFESTUAC:"level='asInvoker' uiAccess='false'" /DEBUG /PDB:"D:\Works CHLA\stochastic_RK\stochastic_RK\stochastic_RK\x64\Release\stochastic_RK.pdb" /SUBSYSTEM:CONSOLE /LARGEADDRESSAWARE /IMPLIB:"D:\Works CHLA\stochastic_RK\stochastic_RK\stochastic_RK\x64\Release\stochastic_RK.lib"
```

- M1 Macbook Air with 16B RAM and 2TB storage. Latest OS, and the latest gfortran in brew, version 11.2.0_3.

The code has two parts,

PART 1. It generate a 800000000 element gaussian random number array, which will be used in PART 2. This part cost like 10GB memory or so. So you may need 16GB memory.

In this part, M1 chip does not have advantage over my Xeon-2186M, it is almost 3X slower than my Xeon. 24s vs 9s.

However, it is also possible that my ifort compiled with /qparallel flag and MKL, so that perhaps also cause the difference. Anyway.

PART 2. The second part is to repeat solving the same SDE for 5 times.

The first time M1 cost 3s, then (strangely) the next 4 loops, each cost 1.6s.

Now, my Xeon-2186M, all the 5 times, each cost like 3.5s.

So it seems M1 overall, in this part, is 2X faster than my Xeon.

Finally it show the total time the code cost.

Welcome to test, and if you have any suggestion with regard to speed up the code, please feel free to let me know as well.

Thank you very much indeed in advance!

PS.

The M1 result is

```
start generating big Gaussian random number array, it make take 5 - 20 seconds ...
size of random array = 800000000
random number generating took n seconds, n = 24.14900
i is n out of 5, n = 1
STOCHASTIC_RK_TEST 03 ------------------------------------
Time cost = 3.005000 sec
step size : 1/ 1000
Np: 100000
Simulated Mean at t=0.2:1.0:0.2 is: 16.374 13.407 10.976 8.9851 7.3570
Theoretical Mean at t=0.2:1.0:0.2 is: 16.375 13.406 10.976 8.9866 7.3576
Simulated Variance at t=0.2:1.0:0.2 is: 0.16495 0.27571 0.34949 0.39851 0.43052
Theoretical Variance at t=0.2:1.0:0.2 is: 0.16484 0.27534 0.34940 0.39905 0.43233
------------------------------------
i is n out of 5, n = 2
STOCHASTIC_RK_TEST 03 ------------------------------------
Time cost = 1.689000 sec
step size : 1/ 1000
Np: 100000
Simulated Mean at t=0.2:1.0:0.2 is: 16.374 13.407 10.976 8.9851 7.3570
Theoretical Mean at t=0.2:1.0:0.2 is: 16.375 13.406 10.976 8.9866 7.3576
Simulated Variance at t=0.2:1.0:0.2 is: 0.16495 0.27571 0.34949 0.39851 0.43052
Theoretical Variance at t=0.2:1.0:0.2 is: 0.16484 0.27534 0.34940 0.39905 0.43233
------------------------------------
i is n out of 5, n = 3
STOCHASTIC_RK_TEST 03 ------------------------------------
Time cost = 1.599000 sec
step size : 1/ 1000
Np: 100000
Simulated Mean at t=0.2:1.0:0.2 is: 16.374 13.407 10.976 8.9851 7.3570
Theoretical Mean at t=0.2:1.0:0.2 is: 16.375 13.406 10.976 8.9866 7.3576
Simulated Variance at t=0.2:1.0:0.2 is: 0.16495 0.27571 0.34949 0.39851 0.43052
Theoretical Variance at t=0.2:1.0:0.2 is: 0.16484 0.27534 0.34940 0.39905 0.43233
------------------------------------
i is n out of 5, n = 4
STOCHASTIC_RK_TEST 03 ------------------------------------
Time cost = 1.610000 sec
step size : 1/ 1000
Np: 100000
Simulated Mean at t=0.2:1.0:0.2 is: 16.374 13.407 10.976 8.9851 7.3570
Theoretical Mean at t=0.2:1.0:0.2 is: 16.375 13.406 10.976 8.9866 7.3576
Simulated Variance at t=0.2:1.0:0.2 is: 0.16495 0.27571 0.34949 0.39851 0.43052
Theoretical Variance at t=0.2:1.0:0.2 is: 0.16484 0.27534 0.34940 0.39905 0.43233
------------------------------------
i is n out of 5, n = 5
STOCHASTIC_RK_TEST 03 ------------------------------------
Time cost = 1.606000 sec
step size : 1/ 1000
Np: 100000
Simulated Mean at t=0.2:1.0:0.2 is: 16.374 13.407 10.976 8.9851 7.3570
Theoretical Mean at t=0.2:1.0:0.2 is: 16.375 13.406 10.976 8.9866 7.3576
Simulated Variance at t=0.2:1.0:0.2 is: 0.16495 0.27571 0.34949 0.39851 0.43052
Theoretical Variance at t=0.2:1.0:0.2 is: 0.16484 0.27534 0.34940 0.39905 0.43233
------------------------------------
total time cost = 33.76200 seconds
STOP Program end normally.
```

Xeon 2186 result is,

```
start generating big Gaussian random number array, it make take 5 - 20 seconds
...
size of random array = 800000000
random number generating took n seconds, n = 9.563000
i is n out of 5, n = 1
STOCHASTIC_RK_TEST 03 ------------------------------------
Time cost = 4.104000 sec
step size : 1/ 1000
Np: 100000
Simulated Mean at t=0.2:1.0:0.2 is: 16.374 13.407 10.976 8.9851 7.3570
Theoretical Mean at t=0.2:1.0:0.2 is: 16.375 13.406 10.976 8.9866 7.3576
Simulated Variance at t=0.2:1.0:0.2 is: 0.16495 0.27571 0.34949 0.39851 0.43052
Theoretical Variance at t=0.2:1.0:0.2 is: 0.16484 0.27534 0.34940 0.39905 0.43233
------------------------------------
i is n out of 5, n = 2
STOCHASTIC_RK_TEST 03 ------------------------------------
Time cost = 3.786000 sec
step size : 1/ 1000
Np: 100000
Simulated Mean at t=0.2:1.0:0.2 is: 16.374 13.407 10.976 8.9851 7.3570
Theoretical Mean at t=0.2:1.0:0.2 is: 16.375 13.406 10.976 8.9866 7.3576
Simulated Variance at t=0.2:1.0:0.2 is: 0.16495 0.27571 0.34949 0.39851 0.43052
Theoretical Variance at t=0.2:1.0:0.2 is: 0.16484 0.27534 0.34940 0.39905 0.43233
------------------------------------
i is n out of 5, n = 3
STOCHASTIC_RK_TEST 03 ------------------------------------
Time cost = 4.156000 sec
step size : 1/ 1000
Np: 100000
Simulated Mean at t=0.2:1.0:0.2 is: 16.374 13.407 10.976 8.9851 7.3570
Theoretical Mean at t=0.2:1.0:0.2 is: 16.375 13.406 10.976 8.9866 7.3576
Simulated Variance at t=0.2:1.0:0.2 is: 0.16495 0.27571 0.34949 0.39851 0.43052
Theoretical Variance at t=0.2:1.0:0.2 is: 0.16484 0.27534 0.34940 0.39905 0.43233
------------------------------------
i is n out of 5, n = 4
STOCHASTIC_RK_TEST 03 ------------------------------------
Time cost = 3.776000 sec
step size : 1/ 1000
Np: 100000
Simulated Mean at t=0.2:1.0:0.2 is: 16.374 13.407 10.976 8.9851 7.3570
Theoretical Mean at t=0.2:1.0:0.2 is: 16.375 13.406 10.976 8.9866 7.3576
Simulated Variance at t=0.2:1.0:0.2 is: 0.16495 0.27571 0.34949 0.39851 0.43052
Theoretical Variance at t=0.2:1.0:0.2 is: 0.16484 0.27534 0.34940 0.39905 0.43233
------------------------------------
i is n out of 5, n = 5
STOCHASTIC_RK_TEST 03 ------------------------------------
Time cost = 3.529000 sec
step size : 1/ 1000
Np: 100000
Simulated Mean at t=0.2:1.0:0.2 is: 16.374 13.407 10.976 8.9851 7.3570
Theoretical Mean at t=0.2:1.0:0.2 is: 16.375 13.406 10.976 8.9866 7.3576
Simulated Variance at t=0.2:1.0:0.2 is: 0.16495 0.27571 0.34949 0.39851 0.43052
Theoretical Variance at t=0.2:1.0:0.2 is: 0.16484 0.27534 0.34940 0.39905 0.43233
------------------------------------
total time cost = 29.53600 seconds
Program end normally.
```

The most time consuming part is the below subroutine, whichis basically the vector version of John Burkardt’s stochasticRK code with some optimizations, https://people.math.sc.edu/Burkardt/f_src/stochastic_rk/stochastic_rk.f90

```
subroutine rk4_ti_fullvec_test ( x0, np, nstep, q, h, fi_gi_in, nd, x )
! https://stackoverflow.com/questions/69147944/is-there-room-to-further-optimize-the-stochastic-rk-fortran-90-code
! https://stackoverflow.com/questions/32809769/how-to-pass-subroutine-names-as-arguments-in-fortran
use random
use fg
implicit none
integer(kind = i8), intent(in) :: np
integer(kind = i8), intent(in) :: nstep
integer(kind = i8), intent(in) :: nd
procedure(fi_gi_fullvec_03) :: fi_gi_in
real(kind = r8), intent(in) :: q,h
real(kind = r8), intent(in) :: x0(np,nd)
real(kind = r8), intent(out) :: x(np,nd,0:nstep)
real(kind = r8) :: ks(np,nd,4),ks_matmul(np,4,nd)
real(kind = r8) :: xs(np,nd,4)
real(kind = r8) :: warray(np,nd,4)
integer(kind = i8) :: i,j,k,l,m,n
real(kind = r8) :: xstar(np,nd)
real( kind = r8 ) :: f(np,nd), g(np,nd)
real(kind = r8) :: sigma(4)
sigma=sqrt(qs*q/h)
x(:,:,0) = x0
do k = 1, nstep
xstar = x(:,:,k-1)
do j = 1,4
do concurrent (l=1:nd)
!do l = 1,nd
!write(6,*) 'k j l = ', k,j,l
xs(:,l,j) = x(:,l,k-1) + matmul(ks(:,l,:j-1),as(:j-1,j))
!write(6,*) 'success k j l = ', k,j,l
!if (j>1) then
! xs(:,l,j) = x(:,l,k-1) + matmul(ks(:,l,:j-1),as(:j-1,j))
!else
! xs(:,l,j) = x(:,l,k-1)
!endif
!xs(:,l,j) = x(:,l,k-1) + matmul(ks_matmul(:,:j-1,l),as(:j-1,j))
!do concurrent (n=1:np)
!!$OMP parallel do
! do n=1,np
! xs(n,l,j) = x(n,l,k-1) + dot_product(ks(n,l,:j-1),as(:j-1,j))
! enddo
!!$OMP end parallel do
!do n = 1,np
! warray(n,l,j) = rnor()*sigma(j) !gaussian(2)*sqrt(qs(j)*qoh)
!enddo
enddo
call fi_gi_in( np, nd, xs(:,:,j), x0(1,:), f, g )
!ks(:,:,j) = h * ( f + g*warray(:,:,j) )
ks(:,:,j) = h * ( f + g*normal(:,:,j,k)*sigma(j) )
xstar = xstar + alphas(j)*ks(:,:,j)
!do concurrent (l=1:nd,n=1:np)
! do l=1,nd
!!$OMP parallel do
! do n=1,np
! ks(n,l,j) = h * ( f(n,l) + g(n,l)*normal(n,l,j,k)*sigma(j) )
! xstar(n,l) = xstar(n,l) + alphas(j)*ks(n,l,j)
! ks_matmul(n,j,l) = ks(n,l,j)
! enddo
!!$OMP end parallel do
! enddo
enddo
x(:,:,k) = xstar
enddo
return
end subroutine rk4_ti_fullvec_test
```