Why solving two differential equations is 8 times slower than solving one?

Dear all, I have a question about solving SDE.
The code is very easy, the paper and code information can be found in the below thread,

It is based on John Burkardt’s code which is based on paper of Kasdin’s Runge-Kutta solver for SDE.

John Burkardt’s code is just a scaler version, which means just solve one SDE.
https://people.math.sc.edu/Burkardt/f_src/stochastic_rk/stochastic_rk.html

It is easy to extend to vector version, for example, to solve 2 SDEs. Here is my code, which is as close to Burkardt’s as possible,

subroutine rk4_ti_step_vec_mod ( x, t, h, q, fi, gi, n, xstar )
    use random
    implicit none
    interface				   !! Declare function        
        function fi ( x ) result (f)
        use constants
        real ( kind = r8 ) :: f(2)
        real ( kind = r8 ) :: x(2)
        end function
        function gi ( x ) result (g)
        use constants
        real ( kind = r8 ) :: g(2)
        real ( kind = r8 ) :: x(2)
        end function      
    end interface
    integer ( kind = i8 ) n
    real ( kind = r8 ) h
    real ( kind = r8 ) k1(n)
    real ( kind = r8 ) k2(n)
    real ( kind = r8 ) k3(n)
    real ( kind = r8 ) k4(n)
    real ( kind = r8 ) q
    real ( kind = r8 ) r8_normal_01

    real ( kind = r8 ) t
    real ( kind = r8 ) t1
    real ( kind = r8 ) t2
    real ( kind = r8 ) t3
    real ( kind = r8 ) t4
    real ( kind = r8 ) w1
    real ( kind = r8 ) w2
    real ( kind = r8 ) w3
    real ( kind = r8 ) w4
    real ( kind = r8 ) x(n)
    real ( kind = r8 ) x1(n)
    real ( kind = r8 ) x2(n)
    real ( kind = r8 ) x3(n)
    real ( kind = r8 ) x4(n)
    real ( kind = r8 ) xstar(n)

    real ( kind = r8 ), parameter :: a21 = 2.71644396264860_r8 &
    ,a31 = - 6.95653259006152_r8 &
    ,a32 =   0.78313689457981_r8 &
    ,a41 =   0.0_r8 &
    ,a42 =   0.48257353309214_r8 &
    ,a43 =   0.26171080165848_r8 &
    ,a51 =   0.47012396888046_r8 &
    ,a52 =   0.36597075368373_r8 &
    ,a53 =   0.08906615686702_r8 &
    ,a54 =   0.07483912056879_r8

    real ( kind = r8 ), parameter, dimension(4) :: qs = [ 2.12709852335625_r8 &
        ,2.73245878238737_r8 &
        ,11.22760917474960_r8 &
        ,13.36199560336697_r8 ]
    real ( kind = r8 ), parameter, dimension(2) :: aaa = [one,one]

    real ( kind = r8 ) :: qoh
    real ( kind = r8 ) :: warray(n,4)
    integer (kind = i8) :: i

    qoh = q / h

    do i =1,4
        warray(:,i) = gaussian(n)*sqrt(qs(i)*qoh)
    enddo

    !t1 = t
    x1 = x
    k1 = h * ( fi ( x1 ) + gi ( x1 ) * warray(:,1) ) 

    !t2 = t1 + a21 * h
    x2 = x1 + a21 * k1
    k2 = h * ( fi ( x2 ) + gi ( x2 ) * warray(:,2) )

    !t3 = t1 + ( a31  + a32 )* h
    x3 = x1 + a31 * k1 + a32 * k2
    k3 = h * ( fi ( x3 ) + gi ( x3 ) * warray(:,3) )

    !t4 = t1 + ( a41  + a42 + a43 ) * h
    x4 = x1 + a41 * k1 + a42 * k2 + a43 * k3 ! the a43 * k3 seems need to be added.
    k4 = h * ( fi ( x4 ) + gi ( x4 ) * warray(:,4) )

    xstar = x1 + a51 * k1 + a52 * k2 + a53 * k3 + a54 * k4

return
end subroutine rk4_ti_step_vec_mod

My function gaussian(n) generate n element gaussian number.

Is there any performance issue for the above subroutine rk4_ti_step_vec_mod? In PS part, I have a more concise and perhaps more vectorized version. The speed is about 15% faster but still not enough.

The two functions fi and gi are extremely easy, they can be defined as,

function fi_vec2 ( x ) result (f)
use constants
implicit none
real ( kind = r8 ) :: f(2)
real ( kind = r8 ) :: x(2)
  f(1) = -x(2)*x(1) 
  f(2) = -x(2) + one
return
end function fi_vec2

function gi_vec2 ( x ) result(g)
use constants
implicit none
real ( kind = r8 ), parameter :: gp(2) = [one,zero] 
real ( kind = r8 ) :: x(2),g(2)
g=gp
return
end function gi_vec2

In John’s scalar code in his website, I define the scaler fi and gi as,

function fi ( x )
use constants
  implicit none

  real ( kind = r8 ) fi
  real ( kind = r8 ) x

  fi = -x  !1.0_r8

  return
end function fi
function gi ( x )
use constants
  implicit none

  real ( kind = r8 ) gi
  real ( kind = r8 ) x

  gi = 1.0

  return
end function gi

Now I set time range as [0,1], and set step size as 1/1000, so call the solver 1000 times. I repeat the process 10000 times.

For the scalar version, just one SDE, it took 1.2 second.
However, for the vector version using my subroutine rk4_ti_step_vec_mod, that is, I solve 2 SDE using the fi_vec2 and gi_vec2 I mentioned above. It took 9 seconds.

However, I have tried many different ways, and I counted the operations, it seems solving 2 SDEs using my model should slow down by a factor of 2-4 perhaps. But it should not be 8 times slower.

So I just wanted to ask, for anyone who have experience in solving SDE or ODE using Runge-Kutta method, in your experience, solving 2 SDE/ODEs, and solving 1 SDE/ODE, tyically how much is the time difference? Assuming the SDE/ODEs are all very simple.

Thank you very much indeed!

PS,
I also have a more concise version based on the suggestion by veryreverie, it is slightly faster, took 8.xxx seconds or so.
But still not that much faster. So I was wondering what is the problem.

subroutine rk4_ti_step_vec_mod1 ( x, t, h, q, fi_vec, gi_vec, n, xstar )
    use random
    implicit none
    interface				   !! Declare function        
        function fi_vec ( x ) result (f)
        use constants
        real ( kind = r8 ) :: f(2)
        real ( kind = r8 ) :: x(2)
        end function
        function gi_vec ( x ) result (g)
        use constants
        real ( kind = r8 ) :: g(2)
        real ( kind = r8 ) :: x(2)
        end function      
    end interface       
    integer(kind = i8), intent(in) :: n
    real(kind=r8), intent(in) :: x(n)
    real(kind = r8), intent(in) :: t  
    real(kind = r8), intent(in) :: h
    real(kind = r8), intent(in) :: q
    !real(kind = r8), external :: fi_vec
    !real(kind = r8), external :: gi_vec
    real(kind = r8), intent(out) :: xstar(n)
    real ( kind = r8 ), parameter :: alphas(4) = [ 0.47012396888046_r8 &
    ,0.36597075368373_r8 &
    ,0.08906615686702_r8 &
    ,0.07483912056879_r8 ]
    real(kind = r8), parameter :: as(4,4) = reshape([ &
        &  0.0_r8,              0.0_r8,              0.0_r8,              0.0_r8, &
        &  2.71644396264860_r8, 0.0_r8,              0.0_r8,              0.0_r8, &
        & -6.95653259006152_r8, 0.78313689457981_r8, 0.0_r8,              0.0_r8, &
        &  0.0_r8,              0.48257353309214_r8, 0.26171080165848_r8, 0.0_r8 ], [4,4])
    real(kind = r8), parameter :: qs(4) = [  2.12709852335625_r8, &
        &  2.73245878238737_r8, &
        & 11.22760917474960_r8, &
        & 13.36199560336697_r8 ]
    real(kind = r8) :: ks(n,4)
    real(kind = r8) :: ts(n,4)
    real(kind = r8) :: ws(n,4)
    real(kind = r8) :: xs(n,4)
    real(kind = r8) :: warray(n,4)
    real(kind = r8) :: ak(n)
    integer(kind = i8) :: j,i
    real(kind = r8) :: qoh
    qoh = q / h
    xstar = x    
    do j=1,4
        warray(:,j) = gaussian(n)*sqrt(qs(j)*qoh)
        !ts(j) = t + sum(as(:j-1,j)) * h      
        do concurrent (i=1:n)
            ak(i)=dot_product(as(:j-1,j), ks(i,:j-1))  
        enddo        
        !ak=zero
        !do i=1,j-1
        !    ak = ak + as(i,j)*ks(:,i)
        !enddo
        xs(:,j) = x + ak  
        ks(:,j) = h * (  fi_vec(xs(:,j)) + gi_vec(xs(:,j))*warray(:,j)  )
        xstar = xstar + alphas(j)*ks(:,j)     
    enddo
return
end subroutine rk4_ti_step_vec_mod1

Below is the main code,

program main
use random
use random2
use constants
use stochastic_rk
use stats

implicit none

integer(kind=i8) :: i,np,seed,k
integer(kind=i8) :: day,hour,minute
real (kind=r8) :: t0,t1,t2
real (kind=r8), allocatable :: x(:,:)
real (kind=r8) :: xvar(5),xave(5)

call timestamp ( )
write ( *, '(a)' ) ' '
write ( *, '(a)' ) 'STOCHASTIC_RK_TEST'
write ( *, '(a)' ) '  FORTRAN90 version'
write ( *, '(a)' ) '  Test the STOCHASTIC_RK library.'


seed = 123456789
call setrn(int(seed,8))

np = 10000
allocate(x(5,np))

call CPU_TIME(t1) 
do i=1,np
call test02 ( x(:,i) )   
!write (6,*) i, x(i) 
enddo
call CPU_TIME(t2)  
write(6,'(''Time cost = '',t20,g12.5,'' sec'')') t2-t1  
! statistics
call stat2(x,xave,xvar)
write(6,'(''step size : '',t20,g12.5)') '1/1000'
write(6,'(''Np: '',t20,g12.5)') np
write(6,'('' '')')
write(6,'(''Simulated Mean at t=0.2:1.0:0.2 is: '',t50,5(g12.5,1x))') (xave(k),k=1,5)
write(6,'(''Theoretical Mean at t=0.2:1.0:0.2 is: '',t50,5(g12.5,1x))') ( 20.0*exp(-k*0.2),k=1,5 )
write(6,'('' '')')
write(6,'(''Simulated Variance at t=0.2:1.0:0.2 is: '',t50,5(g12.5,1x))') (xvar(k),k=1,5)
write(6,'(''Theoretical Variance at t=0.2:1.0:0.2 is: '',t50,5(g12.5,1x))') ( 0.5*(1.0-exp(-2*k*0.2)),k=1,5 )

!
!  Terminate.
!
call timestamp ( )
stop ('Program end normally.')

end program main 


subroutine test02 ( xout )
use constants
use stochastic_rk
implicit none
interface				   !! Declare function        
function fi_vec2 ( x ) result (f)
use constants
real ( kind = r8 ) :: f(2)
real ( kind = r8 ) :: x(2)
end function
function gi_vec2 ( x ) result (g)
use constants
real ( kind = r8 ) :: g(2)
real ( kind = r8 ) :: x(2)
end function      
end interface 
integer ( kind = i8 ), parameter :: n = 1000
real ( kind = r8 ) h
integer ( kind = i8 ) :: i, itot, istart, istep
real ( kind = r8 ) q
integer ( kind = i8 ) seed
real ( kind = r8 ) t
real ( kind = r8 ), parameter :: t0 = 0.0
real ( kind = r8 ), parameter :: tn = 1.0
real ( kind = r8 ) x(2,0:n)
real (kind = r8) :: xout(5)
integer ( kind = i8 ) :: nd ! dimension, =2

h = ( tn - t0 ) / real(n)
q = 1.0
seed = 123456789

i = 0
t = t0
x(1,i) = 20.0_r8
x(2,i) = one
nd = 2

!write ( *, '(a)' ) ' '
!write ( *, '(a)' ) '         I           T             X'
!write ( *, '(a)' ) ' '
!write ( *, '(2x,i8,2x,f14.6,2x,g14.6)' ) i, t, x(i)

do i = 1, n
t = t0 + (real(i)*(tn-t0))/real(n)
call rk4_ti_step_vec_mod1 ( x(:,i-1), t, h, q, fi_vec2, gi_vec2, nd, x(:,i) )
!write ( *, '(2x,i8,2x,f14.6,2x,g14.6)' ) i, t, x(i)
end do

itot = i-1
istart = itot/5
istep = istart
xout(1:5:1) = x(1,istart:itot:istep)
!write ( *, '(2x,i8,2x,f14.6,2x,g14.6)' ) i-1, t, xout
return
end subroutine test02   

function fi_vec2 ( x ) result (f)
use constants
implicit none
real ( kind = r8 ) :: f(2)
real ( kind = r8 ) :: x(2)
  f(1) = -x(2)*x(1) 
  f(2) = -x(2) + one
return
end function fi_vec2
function gi_vec2 ( x ) result(g)
use constants
implicit none
real ( kind = r8 ), parameter :: gp(2) = [one,zero] 
real ( kind = r8 ) :: x(2),g(2)
g=gp
return
end function gi_vec2    

subroutine test01 ( xout )
use constants
use stochastic_rk
implicit none
integer ( kind = i8 ), parameter :: n = 1000
real ( kind = r8 ), external :: fi
real ( kind = r8 ), external :: gi
real ( kind = r8 ) h
integer ( kind = i8 ) :: i, itot, istart, istep
real ( kind = r8 ) q
integer ( kind = i8 ) seed
real ( kind = r8 ) t
real ( kind = r8 ), parameter :: t0 = 0.0
real ( kind = r8 ), parameter :: tn = 1.0
real ( kind = r8 ) x(0:n)
real (kind = r8) :: xout(5)

h = ( tn - t0 ) / real(n)
q = 1.0
seed = 123456789

i = 0
t = t0
x(i) = 20.0

!write ( *, '(a)' ) ' '
!write ( *, '(a)' ) '         I           T             X'
!write ( *, '(a)' ) ' '
!write ( *, '(2x,i8,2x,f14.6,2x,g14.6)' ) i, t, x(i)

do i = 1, n
t = t0 + (real(i)*(tn-t0))/real(n)
call rk4_ti_step_mod ( x(i-1), t, h, q, fi, gi, seed, x(i) )
!write ( *, '(2x,i8,2x,f14.6,2x,g14.6)' ) i, t, x(i)
end do

itot = i-1
istart = itot/5
istep = istart
xout(1:5:1) = x(istart:itot:istep)
!write ( *, '(2x,i8,2x,f14.6,2x,g14.6)' ) i-1, t, xout
return
end subroutine test01


function fi ( x )
use constants
  implicit none

  real ( kind = r8 ) fi
  real ( kind = r8 ) x

  fi = -x  !1.0_r8

  return
end function fi
function gi ( x )
use constants
  implicit none

  real ( kind = r8 ) gi
  real ( kind = r8 ) x

  gi = 1.0

  return
end function gi

Then there is a module called constants.f90

module constants
    implicit none
    integer, public, parameter :: i4=selected_int_kind(9)
    integer, public, parameter :: i8=selected_int_kind(15)
    integer, public, parameter :: r8=selected_real_kind(15,9)
    real(kind=r8), public, parameter :: zero=0.0_r8,one=1.0_r8,two=2.0_r8,three=3.0_r8,four=4.0_r8 &
        ,five=5.0_r8,six=6.0_r8,seven=7.0_r8,eight=8.0_r8,nine=9.0_r8 &
        ,ten=10.0_r8,tenth=.1_r8,half=.5_r8,third=1.0_r8/3.0_r8,sixth=1.0_r8/6.0_r8 &
        ,pi=4.0_r8*atan(1.0_r8) &
        ,normpdf_factor=1.0_r8/sqrt(8.0_r8*atan(1.0_r8)) ! 1/sqrt(2 pi)
    complex(kind=r8), public, parameter :: czero=(0.0_r8,0.0_r8),ci=(0.0_r8,1.0_r8),cone=(1.0_r8,0.0_r8)  
end module constants

Then module for ran.f90

module random2
implicit none
integer, private, parameter :: i4=selected_int_kind(9)
integer, private, parameter :: i8=selected_int_kind(15)
integer, private, parameter :: r8=selected_real_kind(15,9)

contains
subroutine ran1(rn,irn)
! multiplicative congruential with additive constant
integer(kind=i8),  parameter :: mask24 = ishft(1_8,24)-1
integer(kind=i8),  parameter :: mask48 = ishft(1_8,48_8)-1_8
real(kind=r8),  parameter :: twom48=2.0d0**(-48)
integer(kind=i8),  parameter :: mult1 = 44485709377909_8
integer(kind=i8),  parameter :: m11 = iand(mult1,mask24)
integer(kind=i8),  parameter :: m12 = iand(ishft(mult1,-24),mask24)
integer(kind=i8),  parameter :: iadd1 = 96309754297_8
integer(kind=i8) :: irn
real(kind=r8) :: rn
integer(kind=i8) :: is1,is2
is2 = iand(ishft(irn,-24),mask24)
is1 = iand(irn,mask24)
irn = iand(ishft(iand(is1*m12+is2*m11,mask24),24)+is1*m11+iadd1,mask48)
rn = ior(irn,1_8)*twom48
return
end subroutine ran1

subroutine ran2(rn,irn)
! multiplicative congruential with additive constant
integer(kind=i8),  parameter :: mask24 = ishft(1_8,24)-1
integer(kind=i8), parameter :: mask48 = ishft(1_8,48)-1
real(kind=r8), parameter :: twom48=2.0d0**(-48)
integer(kind=i8), parameter :: mult2 = 34522712143931_8
integer(kind=i8), parameter :: m21 = iand(mult2,mask24)
integer(kind=i8), parameter :: m22 = iand(ishft(mult2,-24),mask24)
integer(kind=i8), parameter :: iadd2 = 55789347517_8
integer(kind=i8) :: irn
real(kind=r8) :: rn
integer(kind=i8) :: is1,is2
is2 = iand(ishft(irn,-24),mask24)
is1 = iand(irn,mask24)
irn = iand(ishft(iand(is1*m22+is2*m21,mask24),24)+is1*m21+iadd2,mask48)
rn = ior(irn,1_8)*twom48
return
end subroutine ran2

end module random2

module random
implicit none
integer, private, parameter :: i4=selected_int_kind(9)
integer, private, parameter :: i8=selected_int_kind(15)
integer, private, parameter :: r8=selected_real_kind(15,9)
integer(kind=i8), private, parameter :: mask48 = ishft(1_8,48)-1
integer(kind=i8), private, save :: irn = 1_8,irnsave

contains   
function randn(n)
! return an array of random variates (0,1)
use random2
integer(kind=i8) :: n
real(kind=r8), dimension(n) :: randn
integer(kind=i8) :: i
do i=1,n
    call ran1(randn(i),irn)
enddo
return
end function randn

function gaussian(n)
! return an array of gaussian random variates with variance 1
integer(kind=i8) :: n
real(kind=r8), dimension(n) :: gaussian
real(kind=r8), dimension(2*((n+1)/2)) :: rn
real(kind=r8) :: rn1,rn2,arg,x1,x2,x3
real(kind=r8), parameter :: pi=4.0d0*atan(1.0_r8)
integer(kind=i8) :: i
rn=randn(size(rn,kind=i8))
do i=1,n/2
    rn1=rn(2*i-1)
    rn2=rn(2*i)
    arg=2.0_r8*pi*rn1
    x1=sin(arg)
    x2=cos(arg)
    x3=sqrt(-2.0_r8*log(rn2))
    gaussian(2*i-1)=x1*x3
    gaussian(2*i)=x2*x3
enddo
if (mod(n,2).ne.0) then
    rn1=rn(n)
    rn2=rn(n+1)
    arg=2.0_r8*pi*rn1
    x2=cos(arg)
    x3=sqrt(-2.0_r8*log(rn2))
    gaussian(n)=x2*x3
    endif
return
end function gaussian

subroutine setrn(irnin)
! set the seed
integer(kind=i8) :: irnin
integer(kind=i8) :: n
integer(kind=i8), allocatable :: iseed(:)
irn=iand(irnin,mask48)
call savern(irn)
call RANDOM_SEED(SIZE=n)
allocate(iseed(n))
iseed=int(irnin)
call RANDOM_SEED(PUT=iseed) ! set the internal ran function's seed.
return
end subroutine setrn

subroutine savern(irnin)
! save the seed
integer(kind=i8) :: irnin
irnsave=irnin
return
end subroutine savern

subroutine showirn(irnout)
! show the seed
integer(kind=i8) :: irnout
irnout=irnsave
return
end subroutine showirn   

function randomn(n) ! the default ramdom_number subroutine.
! return an array of random variates (0,1)
integer(kind=i8) :: n
real(kind=r8), dimension(n) :: randomn
call RANDOM_NUMBER(randomn)
return
end function randomn

subroutine random1(x) ! generate random number x using the default ramdom_number subroutine.
real(kind=r8) :: x
call RANDOM_NUMBER(x)
return
end subroutine random1

end module random

Then a module for stats.f90

module stats
use constants 
implicit none
contains
subroutine stat2(x,xave,xvar)
use constants
real (kind=r8) :: x(:,:),val(5),val2(5),xvar(5),xave(5)
integer (kind =i8) :: i,j,xsize,np
val=zero
val2=zero
np = size(x,2)
do i=1,np
    do j=1,5
        val(j) = val(j) + x(j,i)
        val2(j) = val2(j) + x(j,i)**2
    enddo
enddo
xave=val/np
xvar=abs( val2/np - xave**2 ) 
return
end subroutine stat2 

subroutine stat(x,xave,xvar)
use constants
real (kind=r8) :: x(:),val,val2,xvar,xave
integer (kind =i8) :: i,xsize,np
val=0.0
val2=0.0
np = size(x)
do i=1,np
    val = val + x(i)
    val2= val2 + x(i)**2
    !write (6,*) i, x(i)
enddo
xave=val/np
xvar=abs(  val2/np - xave**2  )   
end subroutine stat 

end module stats

Then there is a module for stochastic_rk.f90

module stochastic_rk
use constants  
implicit none
contains    
subroutine rk4_ti_step_mod ( x, t, h, q, fi, gi, seed, xstar )
use random

!*****************************************************************************80
!
!! RK4_TI_STEP takes one step of a stochastic Runge Kutta scheme.
!
!  Discussion:
!
!    The Runge-Kutta scheme is fourth-order, and suitable for time-invariant
!    systems in which F and G do not depend explicitly on time.
!
!    d/dx X(t,xsi) = F ( X(t,xsi) ) + G ( X(t,xsi) ) * w(t,xsi)
!
!  Licensing:
!
!    This code is distributed under the GNU LGPL license.
!
!  Modified:
!
!    20 June 2010
!
!  Author:
!
!    John Burkardt
!
!  Reference:
!
!    Jeremy Kasdin,
!    Runge-Kutta algorithm for the numerical integration of
!    stochastic differential equations,
!    Journal of Guidance, Control, and Dynamics,
!    Volume 18, Number 1, January-February 1995, pages 114-120.
!
!    Jeremy Kasdin,
!    Discrete Simulation of Colored Noise and Stochastic Processes
!    and 1/f^a Power Law Noise Generation,
!    Proceedings of the IEEE,
!    Volume 83, Number 5, 1995, pages 802-827.
!
!  Parameters:
!
!    Input, real ( kind = 8 ) X, the value at the current time.
!
!    Input, real ( kind = 8 ) T, the current time.
!
!    Input, real ( kind = 8 ) H, the time step.
!
!    Input, real ( kind = 8 ) Q, the spectral density of the input white noise.
!
!    Input, external real ( kind = 8 ) FI, the name of the deterministic
!    right hand side function.
!
!    Input, external real ( kind = 8 ) GI, the name of the stochastic
!    right hand side function.
!
!    Input/output, integer ( kind = 4 ) SEED, a seed for the random 
!    number generator.
!
!    Output, real ( kind = 8 ) XSTAR, the value at time T+H.
!
implicit none

real ( kind = r8 ), external :: fi
real ( kind = r8 ), external :: gi
real ( kind = r8 ) h
real ( kind = r8 ) k1
real ( kind = r8 ) k2
real ( kind = r8 ) k3
real ( kind = r8 ) k4
real ( kind = r8 ) q
real ( kind = r8 ) r8_normal_01
integer ( kind = i8 ) seed
real ( kind = r8 ) t
real ( kind = r8 ) t1
real ( kind = r8 ) t2
real ( kind = r8 ) t3
real ( kind = r8 ) t4
real ( kind = r8 ) w1
real ( kind = r8 ) w2
real ( kind = r8 ) w3
real ( kind = r8 ) w4
real ( kind = r8 ) x
real ( kind = r8 ) x1
real ( kind = r8 ) x2
real ( kind = r8 ) x3
real ( kind = r8 ) x4
real ( kind = r8 ) xstar

real ( kind = r8 ) :: qoh
real ( kind = r8 ) :: normal(4)

real ( kind = r8 ), parameter :: a21 = 2.71644396264860_r8 &
,a31 = - 6.95653259006152_r8 &
,a32 =   0.78313689457981_r8 &
,a41 =   0.0_r8 &
,a42 =   0.48257353309214_r8 &
,a43 =   0.26171080165848_r8 &
,a51 =   0.47012396888046_r8 &
,a52 =   0.36597075368373_r8 &
,a53 =   0.08906615686702_r8 &
,a54 =   0.07483912056879_r8

real ( kind = r8 ), parameter, dimension(4) :: qarray = [ 2.12709852335625_r8 &
    ,2.73245878238737_r8 &
    ,11.22760917474960_r8 &
    ,13.36199560336697_r8 ]

real ( kind = r8 ) :: warray(4)
integer (kind = 4) :: i

qoh = q / h
normal = gaussian(4)

do i =1,4
    warray(i) = normal(i)*sqrt(qarray(i)*qoh)
enddo

!t1 = t
x1 = x
k1 = h * ( fi ( x1 ) + gi ( x1 ) * warray(1) ) 

!t2 = t1 + a21 * h
x2 = x1 + a21 * k1
k2 = h * ( fi ( x2 ) + gi ( x2 ) * warray(2) )

!t3 = t1 + ( a31  + a32 )* h
x3 = x1 + a31 * k1 + a32 * k2
k3 = h * ( fi ( x3 ) + gi ( x3 ) * warray(3) )

!t4 = t1 + ( a41  + a42 + a43 ) * h
x4 = x1 + a41 * k1 + a42 * k2 + a43 * k3 ! the a43 * k3 seems need to be added.
k4 = h * ( fi ( x4 ) + gi ( x4 ) * warray(4) )

xstar = x1 + a51 * k1 + a52 * k2 + a53 * k3 + a54 * k4

return
end subroutine rk4_ti_step_mod

subroutine rk4_ti_step_vec_mod ( x, t, h, q, fi, gi, n, xstar )
use random
implicit none
interface				   !! Declare function        
    function fi ( x ) result (f)
    use constants
    real ( kind = r8 ) :: f(2)
    real ( kind = r8 ) :: x(2)
    end function
    function gi ( x ) result (g)
    use constants
    real ( kind = r8 ) :: g(2)
    real ( kind = r8 ) :: x(2)
    end function      
end interface
integer ( kind = i8 ) n
real ( kind = r8 ) h
real ( kind = r8 ) k1(n)
real ( kind = r8 ) k2(n)
real ( kind = r8 ) k3(n)
real ( kind = r8 ) k4(n)
real ( kind = r8 ) q
real ( kind = r8 ) r8_normal_01

real ( kind = r8 ) t
real ( kind = r8 ) t1
real ( kind = r8 ) t2
real ( kind = r8 ) t3
real ( kind = r8 ) t4
real ( kind = r8 ) w1
real ( kind = r8 ) w2
real ( kind = r8 ) w3
real ( kind = r8 ) w4
real ( kind = r8 ) x(n)
real ( kind = r8 ) x1(n)
real ( kind = r8 ) x2(n)
real ( kind = r8 ) x3(n)
real ( kind = r8 ) x4(n)
real ( kind = r8 ) xstar(n)

real ( kind = r8 ), parameter :: a21 = 2.71644396264860_r8 &
,a31 = - 6.95653259006152_r8 &
,a32 =   0.78313689457981_r8 &
,a41 =   0.0_r8 &
,a42 =   0.48257353309214_r8 &
,a43 =   0.26171080165848_r8 &
,a51 =   0.47012396888046_r8 &
,a52 =   0.36597075368373_r8 &
,a53 =   0.08906615686702_r8 &
,a54 =   0.07483912056879_r8

real ( kind = r8 ), parameter, dimension(4) :: qs = [ 2.12709852335625_r8 &
    ,2.73245878238737_r8 &
    ,11.22760917474960_r8 &
    ,13.36199560336697_r8 ]
real ( kind = r8 ), parameter, dimension(2) :: aaa = [one,one]

real ( kind = r8 ) :: qoh
real ( kind = r8 ) :: warray(n,4)
integer (kind = i8) :: i

qoh = q / h

do i =1,4
    warray(:,i) = gaussian(n)*sqrt(qs(i)*qoh)
enddo

!t1 = t
x1 = x
k1 = h * ( fi ( x1 ) + gi ( x1 ) * warray(:,1) ) 

!t2 = t1 + a21 * h
x2 = x1 + a21 * k1
k2 = h * ( fi ( x2 ) + gi ( x2 ) * warray(:,2) )

!t3 = t1 + ( a31  + a32 )* h
x3 = x1 + a31 * k1 + a32 * k2
k3 = h * ( fi ( x3 ) + gi ( x3 ) * warray(:,3) )

!t4 = t1 + ( a41  + a42 + a43 ) * h
x4 = x1 + a41 * k1 + a42 * k2 + a43 * k3 ! the a43 * k3 seems need to be added.
k4 = h * ( fi ( x4 ) + gi ( x4 ) * warray(:,4) )

xstar = x1 + a51 * k1 + a52 * k2 + a53 * k3 + a54 * k4

return
end subroutine rk4_ti_step_vec_mod


subroutine rk4_ti_step_vec_mod1 ( x, t, h, q, fi_vec, gi_vec, n, xstar )
    use random
    implicit none
    interface				   !! Declare function        
        function fi_vec ( x ) result (f)
        use constants
        real ( kind = r8 ) :: f(2)
        real ( kind = r8 ) :: x(2)
        end function
        function gi_vec ( x ) result (g)
        use constants
        real ( kind = r8 ) :: g(2)
        real ( kind = r8 ) :: x(2)
        end function      
    end interface       
    integer(kind = i8), intent(in) :: n
    real(kind=r8), intent(in) :: x(n)
    real(kind = r8), intent(in) :: t  
    real(kind = r8), intent(in) :: h
    real(kind = r8), intent(in) :: q
    !real(kind = r8), external :: fi_vec
    !real(kind = r8), external :: gi_vec
    real(kind = r8), intent(out) :: xstar(n)
    real ( kind = r8 ), parameter :: alphas(4) = [ 0.47012396888046_r8 &
    ,0.36597075368373_r8 &
    ,0.08906615686702_r8 &
    ,0.07483912056879_r8 ]
    real(kind = r8), parameter :: as(4,4) = reshape([ &
        &  0.0_r8,              0.0_r8,              0.0_r8,              0.0_r8, &
        &  2.71644396264860_r8, 0.0_r8,              0.0_r8,              0.0_r8, &
        & -6.95653259006152_r8, 0.78313689457981_r8, 0.0_r8,              0.0_r8, &
        &  0.0_r8,              0.48257353309214_r8, 0.26171080165848_r8, 0.0_r8 ], [4,4])
    real(kind = r8), parameter :: qs(4) = [  2.12709852335625_r8, &
        &  2.73245878238737_r8, &
        & 11.22760917474960_r8, &
        & 13.36199560336697_r8 ]
    real(kind = r8) :: ks(n,4)
    real(kind = r8) :: ts(n,4)
    real(kind = r8) :: ws(n,4)
    real(kind = r8) :: xs(n,4)
    real(kind = r8) :: warray(n,4)
    real(kind = r8) :: ak(n)
    integer(kind = i8) :: j,i
    real(kind = r8) :: qoh
    qoh = q / h
    xstar = x    
    do j=1,4
        warray(:,j) = gaussian(n)*sqrt(qs(j)*qoh)
        !ts(j) = t + sum(as(:j-1,j)) * h      
        do concurrent (i=1:n)
            ak(i)=dot_product(as(:j-1,j), ks(i,:j-1))  
        enddo        
        !ak=zero
        !do i=1,j-1
        !    ak = ak + as(i,j)*ks(:,i)
        !enddo
        xs(:,j) = x + ak  
        ks(:,j) = h * (  fi_vec(xs(:,j)) + gi_vec(xs(:,j))*warray(:,j)  )
        xstar = xstar + alphas(j)*ks(:,j)     
    enddo
return
end subroutine rk4_ti_step_vec_mod1

subroutine rk4_ti_step_mod1 ( x, t, h, q, fi, gi, seed, xstar ) 
! mostly suggested by veryreverie from stackoverflow, 
! https://stackoverflow.com/questions/69147944/is-there-room-to-further-optimize-the-stochastic-rk-fortran-90-code?noredirect=1#comment122226757_69147944
use random  
implicit none
real(kind=r8), intent(in) :: x
real(kind = r8), intent(in) :: t  
real(kind = r8), intent(in) :: h
real(kind = r8), intent(in) :: q
real(kind = r8), external :: fi
real(kind = r8), external :: gi
integer(kind = i8), intent(in) :: seed
real(kind = r8), intent(out) :: xstar
real ( kind = r8 ), parameter :: alphas(4) = [ 0.47012396888046_r8 &
,0.36597075368373_r8 &
,0.08906615686702_r8 &
,0.07483912056879_r8 ]
real(kind = r8), parameter :: as(4,4) = reshape([ &
    &  0.0_r8,              0.0_r8,              0.0_r8,              0.0_r8, &
    &  2.71644396264860_r8, 0.0_r8,              0.0_r8,              0.0_r8, &
    & -6.95653259006152_r8, 0.78313689457981_r8, 0.0_r8,              0.0_r8, &
    &  0.0_r8,              0.48257353309214_r8, 0.26171080165848_r8, 0.0_r8 ], [4,4])
real(kind = r8), parameter :: qs(4) = [ &
    &  2.12709852335625_r8, &
    &  2.73245878238737_r8, &
    & 11.22760917474960_r8, &
    & 13.36199560336697_r8 ]
real(kind = r8) :: ks(4)
real(kind = r8) :: r8_normal_01
real(kind = r8) :: ts(4)
real(kind = r8) :: ws(4)
real(kind = r8) :: xs(4)
real(kind = r8) :: normal(4)
real(kind = r8) :: warray(4)
integer(kind = i8) :: i
real(kind = r8) :: qoh
qoh = q/h
normal = gaussian(4)  
xstar = x
do i=1,4
!ts(i) = t + sum(as(:i-1,i)) * h
xs(i) = x + dot_product(as(:i-1,i), ks(:i-1))
warray(i) = normal(i)*sqrt(qs(i)*qoh)
ks(i) = h * (fi(xs(i)) + gi(xs(i))*warray(i))
xstar = xstar + alphas(i)*ks(i)
enddo
return
end subroutine rk4_ti_step_mod1

subroutine timestamp ( )


character ( len = 8 ) ampm
integer ( kind = 4 ) d
integer ( kind = 4 ) h
integer ( kind = 4 ) m
integer ( kind = 4 ) mm
character ( len = 9 ), parameter, dimension(12) :: month = (/ &
'January  ', 'February ', 'March    ', 'April    ', &
'May      ', 'June     ', 'July     ', 'August   ', &
'September', 'October  ', 'November ', 'December ' /)
integer ( kind = 4 ) n
integer ( kind = 4 ) s
integer ( kind = 4 ) values(8)
integer ( kind = 4 ) y

call date_and_time ( values = values )

y = values(1)
m = values(2)
d = values(3)
h = values(5)
n = values(6)
s = values(7)
mm = values(8)

if ( h < 12 ) then
ampm = 'AM'
else if ( h == 12 ) then
if ( n == 0 .and. s == 0 ) then
    ampm = 'Noon'
else
    ampm = 'PM'
end if
else
h = h - 12
if ( h < 12 ) then
    ampm = 'PM'
else if ( h == 12 ) then
    if ( n == 0 .and. s == 0 ) then
    ampm = 'Midnight'
    else
    ampm = 'AM'
    end if
end if
end if

write ( *, '(i2.2,1x,a,1x,i4,2x,i2,a1,i2.2,a1,i2.2,a1,i3.3,1x,a)' ) &
d, trim ( month(m) ), y, h, ':', n, ':', s, '.', mm, trim ( ampm )

return
end
end module stochastic_rk

Compile those module files + main file, the code should run. I use intel OneAPI + Visual studio,
compile with O3 + QxHost + heaparray, … etc

Thank you, I included everything in the PS part now.

The latest intel OneAPI. compile with O3 + QxHost + heaparray, … etc

The random part does not matter. I solve one SDE and two SDE using the same random subroutine.
It is just that when solving two SDE I need to generate twice random number than one SDE’s. So just like other parts, it should be just like 2 times slower.
But it is not possible to be 8 times slower, because I did not seem to call it 8 times more. Besides, the random number generation is very cheap.

The equations are extremely easy for f and g functions, therefore 2 SDEs vs 1 SDE is mostly like 2d vector calculations vs 1d scalar calculations. I have counted the number of operations, 2 to 3 factor slow down might be possible. But 8 times slow down is unlikely.

To make it more clear, I make the rk4_ti_step_vec_mod more easy to understand as to why it should be almost twice slower than the scaler version instead of 8 times, I made the rk4_ti_step_vec_mod2 code below,

subroutine rk4_ti_step_vec_mod2 ( x, t, h, q, fi1, fi2, gi1, gi2, n, xstar )
use random
implicit none
real ( kind = r8 ), external :: fi1, fi2, gi1, gi2
integer ( kind = i8 ) n
real ( kind = r8 ) h
real ( kind = r8 ) k1(n)
real ( kind = r8 ) k2(n)
real ( kind = r8 ) k3(n)
real ( kind = r8 ) k4(n)
real ( kind = r8 ) q

real ( kind = r8 ) t
real ( kind = r8 ) t1
real ( kind = r8 ) t2
real ( kind = r8 ) t3
real ( kind = r8 ) t4
real ( kind = r8 ) w1
real ( kind = r8 ) w2
real ( kind = r8 ) w3
real ( kind = r8 ) w4
real ( kind = r8 ) x(n)
real ( kind = r8 ) x1(n)
real ( kind = r8 ) x2(n)
real ( kind = r8 ) x3(n)
real ( kind = r8 ) x4(n)
real ( kind = r8 ) xstar(n)

real ( kind = r8 ), parameter :: a21 = 2.71644396264860_r8 &
,a31 = - 6.95653259006152_r8 &
,a32 =   0.78313689457981_r8 &
,a41 =   0.0_r8 &
,a42 =   0.48257353309214_r8 &
,a43 =   0.26171080165848_r8 &
,a51 =   0.47012396888046_r8 &
,a52 =   0.36597075368373_r8 &
,a53 =   0.08906615686702_r8 &
,a54 =   0.07483912056879_r8

real ( kind = r8 ), parameter, dimension(4) :: qs = [ 2.12709852335625_r8 &
    ,2.73245878238737_r8 &
    ,11.22760917474960_r8 &
    ,13.36199560336697_r8 ]

real ( kind = r8 ) :: qoh
real ( kind = r8 ) :: warray(n,4)
integer (kind = i8) :: i

qoh = q / h

do i =1,4
    warray(:,i) = gaussian(n)*sqrt(qs(i)*qoh)
enddo

!t1 = t
x1 = x
k1(1) = h * ( fi1 ( x1 ) + gi1 ( x1 ) * warray(1,1) ) 
k1(2) = h * ( fi2 ( x1 ) + gi2 ( x1 ) * warray(2,1) ) 

!t2 = t1 + a21 * h
x2 = x1 + a21 * k1
k2(1) = h * ( fi1 ( x2 ) + gi1 ( x2 ) * warray(1,2) )
k2(2) = h * ( fi2 ( x2 ) + gi2 ( x2 ) * warray(2,2) )

!t3 = t1 + ( a31  + a32 )* h
x3 = x1 + a31 * k1 + a32 * k2
k3(1) = h * ( fi1 ( x3 ) + gi1 ( x3 ) * warray(1,3) )
k3(2) = h * ( fi2 ( x3 ) + gi2 ( x3 ) * warray(2,3) )

!t4 = t1 + ( a41  + a42 + a43 ) * h
x4 = x1 + a41 * k1 + a42 * k2 + a43 * k3 ! the a43 * k3 seems need to be added.
k4(1) = h * ( fi1 ( x4 ) + gi1 ( x4 ) * warray(1,4) )
k4(2) = h * ( fi2 ( x4 ) + gi2 ( x4 ) * warray(2,4) )

xstar = x1 + a51 * k1 + a52 * k2 + a53 * k3 + a54 * k4

return
end subroutine rk4_ti_step_vec_mod2    

The external functions fi1, fi2, gi1, gi2 are extremely simple,

function fi_vec21 ( x ) 
use constants
implicit none
real ( kind = r8 ) :: fi_vec21 
real ( kind = r8 ) :: x(2)
  fi_vec21  = -x(2)*x(1) 
return
end function fi_vec21
function fi_vec22 ( x ) 
use constants
implicit none
real ( kind = r8 ) :: fi_vec22 
real ( kind = r8 ) :: x(2)
  fi_vec22  = -x(2) + one
return
end function fi_vec22

function gi_vec21 ( x )
use constants
implicit none
real ( kind = r8 ), parameter :: gp(2) = [one,zero] 
real ( kind = r8 ) :: x(2),gi_vec21
gi_vec21 = gp(1)
return
end function gi_vec21   
function gi_vec22 ( x )
use constants
implicit none
real ( kind = r8 ), parameter :: gp(2) = [one,zero] 
real ( kind = r8 ) :: x(2),gi_vec22
gi_vec22 = gp(2)
return
end function gi_vec22

It should be just about 2 times slower than the scaler version below,

subroutine rk4_ti_step_mod ( x, t, h, q, fi, gi, seed, xstar )
use random

!*****************************************************************************80
!
!! RK4_TI_STEP takes one step of a stochastic Runge Kutta scheme.
!
!  Discussion:
!
!    The Runge-Kutta scheme is fourth-order, and suitable for time-invariant
!    systems in which F and G do not depend explicitly on time.
!
!    d/dt X(t,xsi) = F ( X(t,xsi) ) + G ( X(t,xsi) ) * w(t,xsi)
!
!  Licensing:
!
!    This code is distributed under the GNU LGPL license.
!
!  Modified:
!
!    20 June 2010
!
!  Author:
!
!    John Burkardt
!
!  Reference:
!
!    Jeremy Kasdin,
!    Runge-Kutta algorithm for the numerical integration of
!    stochastic differential equations,
!    Journal of Guidance, Control, and Dynamics,
!    Volume 18, Number 1, January-February 1995, pages 114-120.
!
!    Jeremy Kasdin,
!    Discrete Simulation of Colored Noise and Stochastic Processes
!    and 1/f^a Power Law Noise Generation,
!    Proceedings of the IEEE,
!    Volume 83, Number 5, 1995, pages 802-827.
!
!  Parameters:
!
!    Input, real ( kind = 8 ) X, the value at the current time.
!
!    Input, real ( kind = 8 ) T, the current time.
!
!    Input, real ( kind = 8 ) H, the time step.
!
!    Input, real ( kind = 8 ) Q, the spectral density of the input white noise.
!
!    Input, external real ( kind = 8 ) FI, the name of the deterministic
!    right hand side function.
!
!    Input, external real ( kind = 8 ) GI, the name of the stochastic
!    right hand side function.
!
!    Input/output, integer ( kind = 4 ) SEED, a seed for the random 
!    number generator.
!
!    Output, real ( kind = 8 ) XSTAR, the value at time T+H.
!
implicit none

real ( kind = r8 ), external :: fi
real ( kind = r8 ), external :: gi
real ( kind = r8 ) h
real ( kind = r8 ) k1
real ( kind = r8 ) k2
real ( kind = r8 ) k3
real ( kind = r8 ) k4
real ( kind = r8 ) q
real ( kind = r8 ) r8_normal_01
integer ( kind = i8 ) seed
real ( kind = r8 ) t
real ( kind = r8 ) t1
real ( kind = r8 ) t2
real ( kind = r8 ) t3
real ( kind = r8 ) t4
real ( kind = r8 ) w1
real ( kind = r8 ) w2
real ( kind = r8 ) w3
real ( kind = r8 ) w4
real ( kind = r8 ) x
real ( kind = r8 ) x1
real ( kind = r8 ) x2
real ( kind = r8 ) x3
real ( kind = r8 ) x4
real ( kind = r8 ) xstar

real ( kind = r8 ) :: qoh
real ( kind = r8 ) :: normal(4)

real ( kind = r8 ), parameter :: a21 = 2.71644396264860_r8 &
,a31 = - 6.95653259006152_r8 &
,a32 =   0.78313689457981_r8 &
,a41 =   0.0_r8 &
,a42 =   0.48257353309214_r8 &
,a43 =   0.26171080165848_r8 &
,a51 =   0.47012396888046_r8 &
,a52 =   0.36597075368373_r8 &
,a53 =   0.08906615686702_r8 &
,a54 =   0.07483912056879_r8

real ( kind = r8 ), parameter, dimension(4) :: qarray = [ 2.12709852335625_r8 &
    ,2.73245878238737_r8 &
    ,11.22760917474960_r8 &
    ,13.36199560336697_r8 ]

real ( kind = r8 ) :: warray(4)
integer (kind = 4) :: i

qoh = q / h
normal = gaussian(4)

do i =1,4
    warray(i) = normal(i)*sqrt(qarray(i)*qoh)
enddo

!t1 = t
x1 = x
k1 = h * ( fi ( x1 ) + gi ( x1 ) * warray(1) ) 

!t2 = t1 + a21 * h
x2 = x1 + a21 * k1
k2 = h * ( fi ( x2 ) + gi ( x2 ) * warray(2) )

!t3 = t1 + ( a31  + a32 )* h
x3 = x1 + a31 * k1 + a32 * k2
k3 = h * ( fi ( x3 ) + gi ( x3 ) * warray(3) )

!t4 = t1 + ( a41  + a42 + a43 ) * h
x4 = x1 + a41 * k1 + a42 * k2 + a43 * k3 ! the a43 * k3 seems need to be added.
k4 = h * ( fi ( x4 ) + gi ( x4 ) * warray(4) )

xstar = x1 + a51 * k1 + a52 * k2 + a53 * k3 + a54 * k4

return
end subroutine rk4_ti_step_mod

Because it is just everywhere
xi = xxx
ki = xxx

just becomes
xi(1) = xxx
xi(2) = xxx
ki(1) = xxx
ki(2) = xxx

Besides, for one SDE, the fi and gi are very simple too, there are actually even no operations,

function fi ( x )
use constants

!*****************************************************************************80
!
!! FI is a time invariant deterministic right hand side.
!
!  Licensing:
!
!    This code is distributed under the GNU LGPL license.
!
!  Modified:
!
!    20 June 2010
!
!  Author:
!
!    John Burkardt
!
!  Parameters:
!
!    Input, real ( kind = 8 ) X, the argument.
!
!    Output, real ( kind = 8 ) FI, the value.
!
  implicit none

  real ( kind = r8 ) fi
  real ( kind = r8 ) x

  fi = -x  !1.0_r8

  return
end function fi
function gi ( x )
use constants

!*****************************************************************************80
!
!! GI is a time invariant stochastic right hand side.
!
!  Licensing:
!
!    This code is distributed under the GNU LGPL license.
!
!  Modified:
!
!    20 June 2010
!
!  Author:
!
!    John Burkardt
!
!  Parameters:
!
!    Input, real ( kind = 8 ) X, the argument.
!
!    Output, real ( kind = 8 ) GI, the value.
!
  implicit none

  real ( kind = r8 ) gi
  real ( kind = r8 ) x

  gi = 1.0

  return
end function gi

Consider the vector version of fi and gi are just a tiny little bit expensive than scaler version, now solving the 2 SDE with vector functions fi and gi, there should be some like 2-3 or 4 factor slow down I expect. However I just get like 7-8 times slower than the scaler version, which seems a little weird.

compiler flag:

/nologo /debug:full /O3 /QxHost /assume:buffered_io /heap-arrays0 /Qipo /debug-parameters:all /module:“x64\Release\” /object:“x64\Release\” /Fd"x64\Release\vc150.pdb" /traceback /libs:static /threads /Qmkl:cluster /c

linker flag:

/OUT:“x64\Release\stochastic_RK.exe” /INCREMENTAL:NO /NOLOGO /MANIFEST /MANIFESTFILE:“x64\Release\stochastic_RK.exe.intermediate.manifest” /MANIFESTUAC:“level=‘asInvoker’ uiAccess=‘false’” /DEBUG /PDB:“D:\Works CHLA\stochastic_RK\stochastic_RK\stochastic_RK\x64\Release\stochastic_RK.pdb” /SUBSYSTEM:CONSOLE /LARGEADDRESSAWARE /IMPLIB:“D:\Works CHLA\stochastic_RK\stochastic_RK\stochastic_RK\x64\Release\stochastic_RK.lib”

Basically it is just -O3 -QxHost -heaparray
Other flags are not important.

I solve two equations, and solve one equations, I am using the same flags.

I simplified my cases to the following,

I solve one equation dy/dt = 1

I solve 100 the same equations, i from 1 to 100, dy_i/dt =1

Then I should expect it took 100 times longer than solving just one equation, right?

In my code I did things like that and I expect solving 2 the same equations should cost just 2 times than solving one such equation. But I got a factor of 7 - 8 or so.

So I think perhaps there are some unnecessary memory allocation or deallocation, or some temp arrays are created. But I have not identify where.

Do you think the following code may generate temp arrays for x(:,i-1) and x(:,i)?

do i = 1, n
t = t0 + (real(i)*(tn-t0))/real(n)
call rk4_ti_step_vec_mod1 ( x(:,i-1), t, h, q, fi_vec2, gi_vec2, nd, x(:,i) )
end do 

it is in the subroutine,

subroutine test02 ( xout )
use constants
use stochastic_rk
implicit none
interface				   !! Declare function        
function fi_vec2 ( x ) result (f)
use constants
real ( kind = r8 ) :: f(2)
real ( kind = r8 ) :: x(2)
end function
function gi_vec2 ( x ) result (g)
use constants
real ( kind = r8 ) :: g(2)
real ( kind = r8 ) :: x(2)
end function      
end interface 
real ( kind = r8 ), external :: fi_vec21, fi_vec22, gi_vec21, gi_vec22
integer ( kind = i8 ), parameter :: n = 1000
real ( kind = r8 ) h
integer ( kind = i8 ) :: i, itot, istart, istep
real ( kind = r8 ) q
integer ( kind = i8 ) :: seed
real ( kind = r8 ) t
real ( kind = r8 ), parameter :: t0 = 0.0
real ( kind = r8 ), parameter :: tn = 1.0
real ( kind = r8 ) :: x(2,0:n)
real (kind = r8) :: xout(5)
integer ( kind = i8 ) :: nd ! dimension, =2

h = ( tn - t0 ) / real(n)
q = 1.0
seed = 123456789

i = 0
t = t0
x(1,i) = 20.0_r8
x(2,i) = one
nd = 2

!write ( *, '(a)' ) ' '
!write ( *, '(a)' ) '         I           T             X'
!write ( *, '(a)' ) ' '
!write ( *, '(2x,i8,2x,f14.6,2x,g14.6)' ) i, t, x(i)

do i = 1, n
t = t0 + (real(i)*(tn-t0))/real(n)
call rk4_ti_step_vec_mod1 ( x(:,i-1), t, h, q, fi_vec2, gi_vec2, nd, x(:,i) )
end do

itot = i-1
istart = itot/5
istep = istart
xout(1:5:1) = x(1,istart:itot:istep)
!write ( *, '(2x,i8,2x,f14.6,2x,g14.6)' ) i-1, t, xout
return
end subroutine test02   

with functions,

function fi_vec2 ( x ) result (f)
use constants
implicit none
real ( kind = r8 ) :: f(2)
real ( kind = r8 ), intent(in) :: x(2)
  f(1) = -x(2)*x(1) 
  f(2) = -x(2) + one
return
end function fi_vec2
function gi_vec2 ( x ) result(g)
use constants
implicit none
real ( kind = r8 ), parameter :: gp(2) = [one,zero] 
real ( kind = r8 ), intent(in) :: x(2)
real ( kind = r8 ) :: g(2)
g=gp
return
end function gi_vec2

I did solve the problem by putting the RNG outside the loop, also do not heap arrays.
If I do not heap arrays, then solving 1 ODE cost 0.4s, while solving two cost 0.9s. So it is as expected. And the problem is solved.

However, if I heap-arrays, solving 1 ODE still cost 0.35s, but solving two cost 3s.
But I just wanted to know more why heap-arrays can cause a big difference, and how to prevent such performance problem.
A related heap-arrays topic is here,