OpenMP stochastic simulation

Looking at your initialization of arrays, it looks like you might be hitting a “first touch” policy problem which can possibly hinder performance. You might try to initialize your arrays such that each thread initializes a subsection… No guarantees thought but it might be worth taking a look at that.

1 Like

As a proof-of-concept you could use the simple random-number generator used here with OpenMP and if the results are encouraging look at some more complex PRNGs?

Using the supplied test cases as-is and then with the intrinsics would help confirm the root cause.
https://people.sc.fsu.edu/~jburkardt/f_src/random_openmp/random_openmp.f90

Are there MSWIndows tools to show the mutex calls? That might be useful. Even something like a MSWindows version of strace? Not a MSWindows user, so not sure what it has.

1 Like

Sorry, looking at the code within the forum is not that easy. I missed that the allocatable variable row_sum(:) is not used. But it is actually declared within the parallel region. So the deallocate is automatically invoked. However, I think for a non-allocated variable it should not have much impact (currently I cannot check in windows, the Linux perf tool does not show much interesting stuff, except that random_number eats up all the resources if used.)

1 Like

BTW, not having 64bit unsigned integer numbers with well defined overflow behaviour makes it difficult to implement well-established random number generators in fortran. There are a couple generators which support split and merge operations, which helps to avoid any syncronisation within a parallel region. gfortran implements the xoshiro256** generator and its “jump” function for splitting (see libgfortran/intrinsics/random.c of in its current source tree). But I do not see how that is actually used by the compiler in conjunction with openmp code.

1 Like

The only “first touch” that I can see out of the parallel region is sim_ind_a(1,:) = init_a, which is a very limited one. I doubt moving it to the parallel region make a significant difference.

However, this makes me think that most of the code is maybe memory bound, as it has to write a very large amound of data (12GB estimated with the parameters as posted). If so, most of code being memory bound and the call to RANDOM_NUMBER() being serialized by a mutex, there’s no chance to get a significant speed-up with multithreading.

1 Like

Thanks! I actually thought of replacing the intrinsic random_number with a simpler generator from the book “Modern Fortran explained 2018” (solution to exercise 6 chapter 9) and test this against the intrinsic.

On my PC (Ryzen5700X(8-core, max-freq set to 3.4GHz) + Ubuntu22), the timing with gfortran-12 -O3 -march=native -fopenmp is like this:

1-thread : stdout -> 11.6 sec  (time command -> 8.50user 4.55system)
2-thread : stdout ->  6.8 sec  (time command -> 8.57user 4.58system)
4-thread : stdout ->  4.2 sec  (time command -> 8.63user 4.54system)
8-thread : stdout ->  3.0 sec  (time command -> 8.72user 4.64system)

If I replace random_number() with u = 0.5d0, the result is:

1-thread : stdout -> 6.9 sec  (time command -> 3.74user 4.57system)
2-thread : stdout -> 4.4 sec  (time command -> 3.83user 4.58system)
4-thread : stdout -> 3.1 sec  (time command -> 3.81user 4.64system)
8-thread : stdout -> 2.4 sec  (time command -> 3.85user 4.84system)

For serial runs, my PC is ~ 2 times slower than your laptop (= Core i7 12800H, max 4.8 GHz?, the memory (DDR4 vs DDR5?) may also be different):
https://www.cpubenchmark.net/cpu.php?cpu=Intel+Core+i7-12800H&id=4778

But with threading the program becomes faster to some extent, although the scaling is not very good (probably because memory bound? (as mentioned above))

These pages may also be related:

Though this is not about ifort + Windows, similar things might be happening…? (But it also seems (to me) that the speed-down is too large for the above laptop.)

1 Like

Note that the consumption part is not parallelised and takes a significant amount of runtime. But at least it should not get slower the more threads are used.

1 Like

Thanks for pointing this out. Yes, I should move the consumption part inside the parallel region

Unless I missed something in his post, it seems to me that @septc used only gfortran for tests. It seems that in contrast to ifort, RANDOM_NUMBER() has no mutex in gfortran.

1 Like

You’re right, I had missed it

I installed gfortran and I ran a speed comparison between ifort and gfortran, serial vs parallel code. The results confirm what others have pointed out. Here is a table with running time in seconds

Num cores Time gfortran Time ifort
OS Windows 11 Windows 11
version GNU Fortran (GCC) 12.3.0 Version 2021.11.0
flags -O3 -march=native -fopenmp
-O3 -Qipo -QxHost -Qopemp
serial 0 7.2 4.88
parallel 16 1.016 50.82

I’m quite happy about the speed gain with gfortran (~7 times). Instead, ifort is faster in serial but the performance on multiple cores is abysmal. I think for simulations I’ll go with gfortran (given that I’d like to use the intrinsic random number subroutine). As I side note, this small example is yet another instance that shows how gfortran (an open-source compiler) manages to stay competitive

P.S. If someone can suggest some compiler flag I could use with gfortran to further improve speed (esp. in serial)… thanks :slight_smile:

I report below a slightly updated version of the MWE

module mymod
    implicit none
    
    private
    public :: linspace, markov_simul, myerror
    
    contains
    
    subroutine myerror(string)

    implicit none
    character(len=*), intent(in) :: string
    write(*,*) "ERROR: ", string
    write(*,*) "Program will terminate.."
    stop

    end subroutine myerror

    function linspace(a, b, num) result(array)
        ! Input parameters
        real(8), intent(in) :: a   ! The starting value
        real(8), intent(in) :: b    ! The stopping value
        integer, intent(in) :: num     ! The number of points

        ! Output array
        real(8) :: array(num)

        ! Local variables
        integer :: i
        real(8) :: step

        ! Error handling for num <= 1
        if (num == 1) then
            array(1) = a
            return
        endif

        ! Calculate the step size
        step = (b - a) / real(num - 1, 8)

        ! Populate the array
        do i = 1, num
            array(i) = a + (i - 1) * step
        enddo
    end function linspace
    
    function simul_iid(z_prob,dbg) result(ind_sim)
    implicit none
    
    !##### DESCRIPTION #############
    !Simulate a random draw from a discretized distribution 
    ! Note: this is fast if the size of z_prob is small
    ! When z_prob is large (say, more than 100 elements), sample_pmf is faster
    
    !##### INPUTS/OUTPUTS #############
    real(8), intent(in) :: z_prob(:) !discretized prob vector, must be >=0 and sum to 1
    integer, intent(in) :: dbg       !0-1 flag
    integer :: ind_sim               !index for the draw
    
    !##### LOCALS #############
    integer :: n, i
    real(8) :: u ! random number (a scalar)
    real(8) :: check, prob_sum
    
    !##### ROUTINE CODE #############
    n = size(z_prob)
    
    if (dbg==1) then
        if ( any(z_prob<0d0) .or. any(z_prob>1d0) ) then
            error stop "simul_iid: 0<=prob<=1 violated"
        endif
        check = sum(z_prob)
        if (abs(check-1d0)>1d-6) then
            error stop "simul_iid: prob does not sum to one"
        endif
    endif
    
    call RANDOM_NUMBER(u)
    !u = 0.5d0

    do i = 1,n-1
        !Cumulative sum p1+p2+..+pi
        prob_sum = sum(z_prob(1:i))
        if (u <= prob_sum ) then
            ind_sim = i
            return
        endif !END IF
    enddo !end for

    ! Else, choose the last value
    ind_sim = n
    
    end function simul_iid

    subroutine markov_simul(ysim,Ptran,Tsim,y1,dbg_in)
    implicit none
    
    !##### DESCRIPTION #############
    ! Routine for simulating a Markov chain with time-independent
    ! transition matrix (standard case). To increase speed, set
    ! the debug flag to zero.
    ! This subroutine calls the function simulate_iid.
    
     !##### INPUTS/OUTPUTS #############
    real(8), intent(in)  :: Ptran(:,:) ! Transition matrix, dim: (ns,ns)
    integer, intent(in)  :: Tsim       ! Lenght of the time series to simulate
    integer, intent(in)  :: y1         ! Initial condition 
    integer, intent(out) :: ysim(:)    ! Simulated time series s.t. y_sim(1)=y1, dim: (Tsim)
    integer, intent(in),optional :: dbg_in ! 0-1 debug flag
    
    !##### LOCALS #############
    integer :: n, i, ns, t, dbg
    real(8) :: check, prob_sum
    
    !##### ROUTINE CODE #############
    if (present(dbg_in)) then
        dbg = dbg_in
    else
        dbg = 0
    endif
    
    ns = size(Ptran,dim=1)
    
    ! Check if inputs are correct (only if debug flag = 1)
    if (dbg==1) then
        if (size(Ptran,1)/=size(Ptran,2)) then
            call myerror("markov_simul: Ptran is not square")
        endif
        if ( any(Ptran<0d0) .or. any(Ptran>1d0) ) then
            call myerror("markov_simul: 0<=prob<=1 violated")
        endif
        if (Tsim < 2) then
            call myerror("markov_simul: Tsim must be at least 2")
        endif
        if ( y1<1 .or. y1>ns) then
            call myerror("markov_simul: initial condition is out of bounds")
        endif
        if ( size(ysim)/=Tsim ) then
            call myerror("markov_simul: ysim has wrong size")
        endif
    endif
    
    ! Simulate series of length Tsim
    ysim(1) = y1
    
    do t = 1,Tsim-1
        ysim(t+1) = simul_iid(Ptran(ysim(t),:),dbg)    
    enddo
    
    
end subroutine markov_simul

end module mymod
!================================================================================================
program main
    ! THIS CODE USES OPENMP
    use mymod, only: linspace, markov_simul 
    use omp_lib 

    implicit none
    
    ! ------------------ Declarations ------------------
    integer :: par_fortran
    integer, parameter :: nthreads = 16
    integer, parameter :: N_sim = 14000 
    integer, parameter :: T_sim = 30000
    integer, parameter :: n_a=500, n_z=2
    real(8) :: t1,t2
    real(8) :: Pi(2,2), z_grid(2), b, grid_max, r, w
    integer, allocatable :: pol_ap_ind(:,:), sim_ind_a(:,:), sim_ind_z(:,:)
    real(8), allocatable :: a_grid(:), sim_a(:,:), sim_c(:,:)
    ! Variables for the simulation
    integer :: hh_c, t, init_a, init_z
    ! ------------------ End declarations ------------------
    
    write(*,*) "Starting program..."
    
    write(*,*) "max # of threads = ",omp_get_max_threads()
    CALL OMP_SET_NUM_THREADS(nthreads)
    !$omp parallel if (par_fortran==1)
    write(*,*) "hello from thread ",omp_get_thread_num() !," of ",nthreads
    !$omp end parallel 
    
    write(*,*) "Type 1 to run in parallel, 0 to run in serial"
    read(*,*) par_fortran

    ! ---------------------- Set economic parameter values ----------------------!
    Pi(1,:)  = [0.6d0, 0.4d0]
    Pi(2,:)  = [.05d0, 0.95d0]
    z_grid   = [0.5d0, 1.0d0]
    r        = 0.04d0
    w        = 1.0d0
    b        = 0.0d0
    grid_max = 4.0d0
    
    ! Set grid for assets
    allocate(a_grid(n_a))
    a_grid = linspace(b,grid_max,n_a)
    
    ! Set policy function for assets (index) 
    allocate(pol_ap_ind(n_a,n_z))
    pol_ap_ind(:,1) = n_a/5 ! Low shock
    pol_ap_ind(:,2) = n_a/2 ! High shock
    
    write(*,*) "============================================================="
    write(*,*) "Starting SIMULATION..."
    write(*,*) "============================================================="
    
    t1 = omp_get_wtime()
    
    allocate(sim_a(T_sim+1,N_sim),sim_c(T_sim,N_sim),sim_ind_a(T_sim+1,N_sim), sim_ind_z(T_sim,N_sim))
    
    ! Initial conditions for state variables "a" and "z"
    init_a = 1
    init_z = n_z/2
    ! Recall: first index is time, second index is individual
    sim_ind_a(1,:) = init_a
    
    ! Loop over time periods to simulate assets
    !t1 = omp_get_wtime()
    
    ! Outer loop over individuals
    !$omp parallel if (par_fortran==1) default(shared) private(hh_c,t)
    !$omp do schedule(dynamic)
    do hh_c=1,N_sim
        !write(*,*) "Doing hh_c = ", hh_c
        ! Simulate a sequence of shocks (length Tsim) for individual hh_c
        call markov_simul(sim_ind_z(:,hh_c),Pi,T_sim,init_z)
        ! Inner loop over time periods
        do t=2,T_sim+1       
            ! Update assets
            sim_ind_a(t,hh_c) = pol_ap_ind(sim_ind_a(t-1,hh_c),sim_ind_z(t-1,hh_c))
            sim_a(t,hh_c)     = a_grid(sim_ind_a(t,hh_c))
        enddo  
        ! Update consumption
        do t=1,T_sim
            sim_c(t,hh_c) = (1d0+r)*sim_a(t,hh_c)+w*z_grid(sim_ind_z(t,hh_c))-sim_a(t+1,hh_c)
        enddo  
    enddo
    !$omp enddo 
    !$omp end parallel

    t2 = omp_get_wtime()
    
    write(*,*) "============================================================="
    write(*,*) 'SIMULATION runs for ',t2-t1,' seconds.'
    write(*,*) "============================================================="
    
    write(*,*) "============================================================="
    write(*,*) 'MODEL MOMENTS'
    write(*,'(X,A,F10.4)') "Average assets, simulation: ", sum(sim_a)/size(sim_a)
    write(*,'(X,A,F10.4)') "Average consumption, simulation: ", sum(sim_c)/size(sim_c)
    write(*,*) 'EXAMPLE SIMULATED SERIES FOR 10 INDIVIDUALS at t=1000'
    write(*,'(X,A)') "hh_c, sim_a, sim_c"
    do hh_c=1,10
        write(*,'(I5,10F10.4)') hh_c, sim_a(1000,hh_c), sim_c(1000,hh_c)
    enddo
    
    deallocate(a_grid,pol_ap_ind,sim_a,sim_c,sim_ind_z)
    
    write(*,*) "Program terminated OK!" 
    
end program main
2 Likes