OpenMP stochastic simulation

I’m trying to parallelize a panel simulation using OpenMP but the serial code is much faster than the parallel code. Not sure what I’m doing wrong.

The key part of the code is the following:

! Outer loop over individuals
!$omp parallel if (par_fortran==1) default(shared) private(hh_c,t)
!$omp do 
do hh_c=1,N_sim
    write(*,*) "Doing hh_c = ", hh_c
    ! Simulate a sequence of shocks (length Tsim) for individual hh_c
    call markov_simul(sim_ind_z(:,hh_c),Pi,T_sim,init_z)
    ! Inner loop over time periods
    do t=2,T_sim+1       
        ! Update assets
        sim_ind_a(t,hh_c) = pol_ap_ind(sim_ind_a(t-1,hh_c),sim_ind_z(t-1,hh_c))
        sim_a(t,hh_c)     = a_grid(sim_ind_a(t,hh_c))
    enddo    
enddo
!$omp enddo 
!$omp end parallel

There are two nested loops: the outer loop is over individuals and can be parallelized, the inner loop is over time periods. For each individual, I simulate a long history of assets and shocks. The simulation of the shocks is done by the subroutine markov_simul contained in module mymod.

It turns out that the parallel code (which is activated when I set par_fortran=1) is much slower than the serial code.
Output of serial code:

 Starting program...
 max # of threads =           20
 hello from thread            0
 =============================================================
 Starting SIMULATION...
 =============================================================
 Doing hh_c =            1
 Doing hh_c =            2
 Doing hh_c =            3
 Doing hh_c =            4
 Doing hh_c =            5
 Doing hh_c =            6
 Doing hh_c =            7
 Doing hh_c =            8
 Doing hh_c =            9
 Doing hh_c =           10
 Doing hh_c =           11
 Doing hh_c =           12
 Doing hh_c =           13
 Doing hh_c =           14
 Simulation assets done, now consumption...
 =============================================================
 SIMULATION runs for    5.00110049999967       seconds.
 =============================================================
 =============================================================
 MODEL MOMENTS
 Average assets, simulation:     1.8624
 Average consumption, simulation:     1.0189
 EXAMPLE SIMULATED SERIES FOR 10 INDIVIDUALS at t=1000
 hh_c, sim_a, sim_c
    1    1.9960    1.0798
    2    1.9960    1.7822
    3    1.9960    1.0798
    4    1.9960    1.0798
    5    1.9960    1.0798
    6    0.7936   -0.1707
    7    1.9960    1.0798
    8    1.9960    1.0798
    9    1.9960    1.0798
   10    0.7936    0.5317
 Program terminated OK!

Output of parallel code:

Starting program...
 max # of threads =           20
 hello from thread            0
 hello from thread            1
 hello from thread            4
 hello from thread            6
 hello from thread            7
 hello from thread            8
 hello from thread            3
 hello from thread            5
 hello from thread            9
 hello from thread           10
 hello from thread           11
 hello from thread           12
 hello from thread           13
 hello from thread            2
 =============================================================
 Starting SIMULATION...
 =============================================================
 Doing hh_c =            1
 Doing hh_c =            9
 Doing hh_c =            4
 Doing hh_c =            3
 Doing hh_c =            8
 Doing hh_c =           13
 Doing hh_c =           11
 Doing hh_c =           12
 Doing hh_c =           14
 Doing hh_c =            5
 Doing hh_c =            7
 Doing hh_c =           10
 Doing hh_c =            6
 Doing hh_c =            2
 Simulation assets done, now consumption...
 =============================================================
 SIMULATION runs for    134.736892299999       seconds.
 =============================================================
 =============================================================
 MODEL MOMENTS
 Average assets, simulation:     1.8623
 Average consumption, simulation:     1.0189
 EXAMPLE SIMULATED SERIES FOR 10 INDIVIDUALS at t=1000
 hh_c, sim_a, sim_c
    1    1.9960    1.0798
    2    1.9960    1.0798
    3    1.9960    1.0798
    4    0.7936   -0.1707
    5    0.7936   -0.1707
    6    1.9960    1.0798
    7    1.9960    1.0798
    8    1.9960    1.0798
    9    0.7936   -0.1707
   10    0.7936    0.5317
 Program terminated OK!

For those interested, I post below the entire code, compiled with ifort /O3 /Qipo /Qopenmp

module mymod
    implicit none
    
    private
    public :: linspace, markov_simul, myerror
    
    contains
    
    subroutine myerror(string)

    implicit none
    character(len=*), intent(in) :: string
    write(*,*) "ERROR: ", string
    write(*,*) "Program will terminate.."
    pause
    stop

    end subroutine myerror

    function linspace(a, b, num) result(array)
        ! Input parameters
        real(8), intent(in) :: a   ! The starting value
        real(8), intent(in) :: b    ! The stopping value
        integer, intent(in) :: num     ! The number of points

        ! Output array
        real(8) :: array(num)

        ! Local variables
        integer :: i
        real(8) :: step

        ! Error handling for num <= 1
        if (num == 1) then
            array(1) = a
            return
        endif

        ! Calculate the step size
        step = (b - a) / real(num - 1, 8)

        ! Populate the array
        do i = 1, num
            array(i) = a + (i - 1) * step
        enddo
    end function linspace
    
    function simul_iid(z_prob,dbg) result(ind_sim)
    implicit none
    
    !##### DESCRIPTION #############
    !Simulate a random draw from a discretized distribution 
    ! Note: this is fast if the size of z_prob is small
    ! When z_prob is large (say, more than 100 elements), sample_pmf is faster
    
    !##### INPUTS/OUTPUTS #############
    real(8), intent(in) :: z_prob(:) !discretized prob vector, must be >=0 and sum to 1
    integer, intent(in) :: dbg       !0-1 flag
    integer :: ind_sim               !index for the draw
    
    !##### LOCALS #############
    integer :: n, i
    real(8) :: u ! random number (a scalar)
    real(8) :: check, prob_sum
    
    !##### ROUTINE CODE #############
    n = size(z_prob)
    
    if (dbg==1) then
        if ( any(z_prob<0d0) .or. any(z_prob>1d0) ) then
            error stop "simul_iid: 0<=prob<=1 violated"
        endif
        check = sum(z_prob)
        if (abs(check-1d0)>1d-6) then
            error stop "simul_iid: prob does not sum to one"
        endif
    endif
    
    call RANDOM_NUMBER(u)

    do i = 1,n-1
        !Cumulative sum p1+p2+..+pi
        prob_sum = sum(z_prob(1:i))
        if (u <= prob_sum ) then
            ind_sim = i
            return
        endif !END IF
    enddo !end for

    ! Else, choose the last value
    ind_sim = n
    
    end function simul_iid

    subroutine markov_simul(ysim,Ptran,Tsim,y1,dbg_in)
    implicit none
    
    !##### DESCRIPTION #############
    ! Routine for simulating a Markov chain with time-independent
    ! transition matrix (standard case). To increase speed, set
    ! the debug flag to zero.
    ! This subroutine calls the function simulate_iid.
    
     !##### INPUTS/OUTPUTS #############
    real(8), intent(in)  :: Ptran(:,:) ! Transition matrix, dim: (ns,ns)
    integer, intent(in)  :: Tsim       ! Lenght of the time series to simulate
    integer, intent(in)  :: y1         ! Initial condition 
    integer, intent(out) :: ysim(:)    ! Simulated time series s.t. y_sim(1)=y1, dim: (Tsim)
    integer, intent(in),optional :: dbg_in ! 0-1 debug flag
    
    !##### LOCALS #############
    integer :: n, i, ns, t, dbg
    real(8) :: check, prob_sum
    real(8), allocatable :: row_sum(:)
    
    !##### ROUTINE CODE #############
    if (present(dbg_in)) then
        dbg = dbg_in
    else
        dbg = 0
    endif
    
    ns = size(Ptran,dim=1)
    
    ! Check if inputs are correct (only if debug flag = 1)
    if (dbg==1) then
        if (size(Ptran,1)/=size(Ptran,2)) then
            call myerror("markov_simul: Ptran is not square")
        endif
        if ( any(Ptran<0d0) .or. any(Ptran>1d0) ) then
            call myerror("markov_simul: 0<=prob<=1 violated")
        endif
        row_sum = sum(Ptran,dim=2)
        if (maxval(abs(row_sum-1d0)) > 1d-6) then
            call myerror("markov_simul: Rows of Ptran do not sum to one")
        endif
        if (Tsim < 2) then
            call myerror("markov_simul: Tsim must be at least 2")
        endif
        if ( y1<1 .or. y1>ns) then
            call myerror("markov_simul: initial condition is out of bounds")
        endif
        if ( size(ysim)/=Tsim ) then
            call myerror("markov_simul: ysim has wrong size")
        endif
    endif
    
    ! Simulate series of length Tsim
    ysim(1) = y1
    
    do t = 1,Tsim-1
        ysim(t+1) = simul_iid(Ptran(ysim(t),:),dbg)    
    enddo
    
    
end subroutine markov_simul

end module mymod
!================================================================================================
program main
    ! THIS CODE USES OPENMP
    use mymod, only: linspace, markov_simul 
    use omp_lib 

    implicit none
    
    ! ------------------ Declarations ------------------
    integer, parameter :: par_fortran = 1
    integer, parameter :: nthreads = 14
    integer, parameter :: N_sim = 14 
    integer, parameter :: T_sim = 30000000
    integer, parameter :: n_a=500, n_z=2
    real(8) :: t1,t2
    real(8) :: Pi(2,2), z_grid(2), b, grid_max, r, w
    integer, allocatable :: pol_ap_ind(:,:), sim_ind_a(:,:), sim_ind_z(:,:)
    real(8), allocatable :: a_grid(:), sim_a(:,:), sim_c(:,:)
    ! Variables for the simulation
    integer :: hh_c, t, init_a, init_z
    ! ------------------ End declarations ------------------
    
    write(*,*) "Starting program..."
    
    write(*,*) "max # of threads = ",omp_get_max_threads()
    CALL OMP_SET_NUM_THREADS(nthreads)
    !$omp parallel if (par_fortran==1)
    write(*,*) "hello from thread ",omp_get_thread_num() !," of ",nthreads
    !$omp end parallel 
    
    ! ---------------------- Set economic parameter values ----------------------!
    Pi(1,:)  = [0.6d0, 0.4d0]
    Pi(2,:)  = [.05d0, 0.95d0]
    z_grid   = [0.5d0, 1.0d0]
    r        = 0.04d0
    w        = 1.0d0
    b        = 0.0d0
    grid_max = 4.0d0
    
    ! Set grid for assets
    allocate(a_grid(n_a))
    a_grid = linspace(b,grid_max,n_a)
    
    ! Set policy function for assets (index) 
    allocate(pol_ap_ind(n_a,n_z))
    pol_ap_ind(:,1) = n_a/5 ! Low shock
    pol_ap_ind(:,2) = n_a/2 ! High shock
    
    write(*,*) "============================================================="
    write(*,*) "Starting SIMULATION..."
    write(*,*) "============================================================="
    
    t1 = omp_get_wtime()
    
    allocate(sim_a(T_sim+1,N_sim),sim_c(T_sim,N_sim),sim_ind_a(T_sim+1,N_sim), sim_ind_z(T_sim,N_sim))
    
    ! Initial conditions for state variables "a" and "z"
    init_a = 1
    init_z = n_z/2
    ! Recall: first index is time, second index is individual
    sim_ind_a(1,:) = init_a
    
    ! Loop over time periods to simulate assets
    !t1 = omp_get_wtime()
    
    ! Outer loop over individuals
    !$omp parallel if (par_fortran==1) default(shared) private(hh_c,t)
    !$omp do 
    do hh_c=1,N_sim
        write(*,*) "Doing hh_c = ", hh_c
        ! Simulate a sequence of shocks (length Tsim) for individual hh_c
        call markov_simul(sim_ind_z(:,hh_c),Pi,T_sim,init_z)
        ! Inner loop over time periods
        do t=2,T_sim+1       
            ! Update assets
            sim_ind_a(t,hh_c) = pol_ap_ind(sim_ind_a(t-1,hh_c),sim_ind_z(t-1,hh_c))
            sim_a(t,hh_c)     = a_grid(sim_ind_a(t,hh_c))
        enddo    
    enddo
    !$omp enddo 
    !$omp end parallel
    !t2 = omp_get_wtime()
    
    !write(*,*) 'Time for simulating asset: ', t2-t1
    write(*,*) 'Simulation assets done, now consumption...'
    !pause
    
    ! Update consumption
    do hh_c=1,N_sim
        do t=1,T_sim
            sim_c(t,hh_c) = (1d0+r)*sim_a(t,hh_c)+w*z_grid(sim_ind_z(t,hh_c))-sim_a(t+1,hh_c)
        enddo
    enddo
    
    t2 = omp_get_wtime()
    
    write(*,*) "============================================================="
    write(*,*) 'SIMULATION runs for ',t2-t1,' seconds.'
    write(*,*) "============================================================="
    
    write(*,*) "============================================================="
    write(*,*) 'MODEL MOMENTS'
    write(*,'(X,A,F10.4)') "Average assets, simulation: ", sum(sim_a)/size(sim_a)
    write(*,'(X,A,F10.4)') "Average consumption, simulation: ", sum(sim_c)/size(sim_c)
    write(*,*) 'EXAMPLE SIMULATED SERIES FOR 10 INDIVIDUALS at t=1000'
    write(*,'(X,A)') "hh_c, sim_a, sim_c"
    do hh_c=1,10
        write(*,'(I5,10F10.4)') hh_c, sim_a(1000,hh_c), sim_c(1000,hh_c)
    enddo
    
    deallocate(a_grid,pol_ap_ind,sim_a,sim_c,sim_ind_z)
    
    write(*,*) "Program terminated OK!" 
    
end program main

Any help is greatly appreciated!

Have you tried adding schedule(dynamic) to the omp do directive?

Edit: at some point I wrote a parallel version of the “Tour de Wino” game, a Monte-Carlo based PDE solving algorithm. A description can be found in the book of Stanley J. Farlow, Partial Differential Equations For Scientists And Engineers. Using schedule(dynamic) it scaled reasonably well. Full code in the drop-down box below:

Tour de Wino
module tour_de_wino

   implicit none

   integer, parameter :: wp = kind(1.0e0)

   !> Boundary cache object, used to store the results a single Wino
   !> walking across the city grid
   type :: boundary_cache
      
      !> Grid size
      integer :: nx, ny
      
      !> Rewards given at the boundaries
      real(wp) :: north, south
      real(wp) :: east, west 

      !> Boundary arrays (used to store how many times Wino hits a boundary)
      integer, allocatable :: n_(:), &
                              s_(:), &
                              e_(:), &
                              w_(:)
   end type

contains

   !> Initialize a boundary cache object
   subroutine init_bc(nx,ny,bc)
      integer, intent(in) :: nx, ny
      type(boundary_cache), intent(out) :: bc

      bc%nx = nx
      bc%ny = ny

      ! note the clever use of custom lower bounds
      allocate(bc%n_(2:nx-1),bc%s_(2:nx-1))
      allocate(bc%e_(2:ny-1),bc%w_(2:ny-1))

      bc%n_ = 0
      bc%s_ = 0
      bc%e_ = 0
      bc%w_ = 0

   end subroutine

   !> Set the values for the rewards Wino gets at each edge
   subroutine set_bc_values(bc,north,south,east,west)
      type(boundary_cache), intent(inout) :: bc
      real(wp), intent(in) :: north, south, east, west

      bc%north = north
      bc%south = south
      bc%east = east
      bc%west = west

   end subroutine

   !> Reset the result values; the reward values remain untouched
   subroutine reset_bc(bc)
      type(boundary_cache), intent(inout) :: bc

      bc%n_ = 0
      bc%s_ = 0
      bc%e_ = 0
      bc%w_ = 0
   end subroutine

   !> Perform a single random walk with Wino 
   subroutine perform_single_tour(i,j,bc)

      integer, value :: i, j
      type(boundary_cache), intent(inout) :: bc

      real(wp) :: r
      integer :: ir

      walk: do 

         call random_number(r)
         ir = int(r*4)

         select case(ir)
         case(0)
            i = i + 1
         case(1)
            j = j + 1
         case(2)
            i = i - 1
         case(3)
            j = j - 1
         end select

         if (i == 1) then
            ! WEST edge
            call increment(bc%w_(j))
            exit walk 
         else if (i == bc%nx) then
            ! EAST edge 
            call increment(bc%e_(j))
            exit walk 
         end if

         if (j == 1) then
            ! SOUTH edge 
            call increment(bc%s_(i))
            exit walk 
         else if (j == bc%ny) then
            ! NORTH edge
            call increment(bc%n_(i))
            exit walk
         end if

      end do walk

   contains

      !> Keep track of Wino arriving
      subroutine increment(i)
         integer, intent(inout) :: i
         !$omp atomic
         i = i + 1
         !$omp end atomic
      end subroutine

   end subroutine

   !> Perform multiple Wino tours (pub crawl)
   !>   could be done in parallel, but we should keep a separate tab / cache
   !>   for each Wino
   subroutine perform_n_tours(n,sx,sy,bc,result)
      integer, intent(in) :: n
      integer, intent(in) :: sx, sy
      type(boundary_cache), intent(inout) :: bc
      real(wp), intent(out) :: result

      integer :: k

      !$omp parallel do default(private) shared(sx,sy,bc) schedule(dynamic)
      do k = 1, n
         call perform_single_tour(sx,sy,bc)
      end do
      !$omp end parallel do

      result = 0.0_wp

      ! reduce rewards at boundaries 
      ! in current form only works for constant boundaries
      result = result + bc%north * sum(bc%n_)/real(n,wp)
      result = result + bc%south * sum(bc%s_)/real(n,wp)
      result = result + bc%east  * sum(bc%e_)/real(n,wp)
      result = result + bc%west  * sum(bc%w_)/real(n,wp)

      call reset_bc(bc)

   end subroutine

end module


program wino

   use tour_de_wino
   implicit none

   integer, parameter :: nx = 41
   integer, parameter :: ny = 41

   integer, parameter :: niter = 100

   type(boundary_cache) :: cache
   real(wp) :: result

   real(wp), allocatable :: a(:,:)
   integer :: i, j, funit

   ! Field storing how drunk Wino's are in each
   ! neighborhood of the city
   allocate(a(nx,ny))
   a = 0.0_wp

   ! Initialize the tab to keep track of Wino's
   call init_bc(nx,ny,cache)
   call set_bc_values(cache, &
      north=1.0_wp, &
      south=0.0_wp, &
      east=0.0_wp, &
      west=0.0_wp)

   ! For each neighborhood (i,j) ...
   do i = 2, nx - 1
      do j = 2, ny - 1

         call perform_n_tours(niter,i,j,cache,result)
         a(i,j) = result

      end do 
   end do

   ! Set the north boundary for printing
   do i = 2, nx-1
      a(i,ny) = 1.0_wp
   end do

   ! Output results
   open(newunit=funit,file="results.txt")
   do i = 1, nx
      do j = 1, ny
         write(funit,*) i-1, j-1, a(i,j)
      end do
      write(funit,*)
   end do
   close(funit)

end program

The resulting heat-map from the random walk processes:

image

1 Like

Thanks! I tried it now but the run time is even higher

!$omp do schedule(dynamic)

 Starting program...
 max # of threads =           20
 hello from thread            0
 hello from thread            2
 hello from thread            1
 hello from thread            4
 hello from thread            5
 hello from thread            6
 hello from thread            7
 hello from thread            8
 hello from thread            9
 hello from thread           10
 hello from thread           11
 hello from thread           12
 hello from thread           13
 hello from thread            3
 =============================================================
 Starting SIMULATION...
 =============================================================
 Doing hh_c =           13
 Doing hh_c =           14
 Doing hh_c =            3
 Doing hh_c =           10
 Doing hh_c =            4
 Doing hh_c =            2
 Doing hh_c =           11
 Doing hh_c =            8
 Doing hh_c =            5
 Doing hh_c =            6
 Doing hh_c =           12
 Doing hh_c =            9
 Doing hh_c =            7
 Doing hh_c =            1
 Simulation assets done, now consumption...
 =============================================================
 SIMULATION runs for    183.720846200000       seconds.
 =============================================================
 =============================================================
 MODEL MOMENTS
 Average assets, simulation:     1.8624
 Average consumption, simulation:     1.0189
 EXAMPLE SIMULATED SERIES FOR 10 INDIVIDUALS at t=1000
 hh_c, sim_a, sim_c
    1    1.9960    1.0798
    2    1.9960    1.0798
    3    0.7936    0.5317
    4    1.9960    1.0798
    5    1.9960    1.0798
    6    1.9960    1.0798
    7    1.9960    1.0798
    8    1.9960    1.0798
    9    1.9960    1.0798
   10    1.9960    1.0798
 Program terminated OK!

Which compiler/version are you using ?

1 Like

I’m using ifort with Intel oneapi 2022 in Visual studio. The command line (from Visual studio) is

/nologo /O3 /Qipo /Qopenmp /module:“x64\Release\” /object:“x64\Release\” /Fd"x64\Release\vc170.pdb" /libs:dll /threads /c

But I’ve tried with ifx as well and I didn’t see any significant difference. It would be great if someone could test my code with another compiler (maybe gfortran), just to know

Your sim_xxx(:,:) arrays have ~500 10^6 elements, which makes roughly 12 GB : do you have enough RAM on your PC ?

1 Like

I have 32 GB of RAM on my laptop

image

EDIT
I checked on the task manager and during the parallel simulation my laptop is using 55% of memory and 95-100% of CPU.
I will try to make the arrays smaller to see if the relative performance parallel/serial improves

It should do, unless you have a lot of other running applications…

Maybe the call to RANDOM_NUMBER in the parallel section is a problem. Depending on the processor, I think that the calls to random number generators are serialized

1 Like

Would it make sense to use another random number function? I used the intrinsic RANDOM_NUMBER because it’s handy but I could use something from other libraries

Ok, I think I got something useful.
I replace the line

call RANDOM_NUMBER(u)

with

u = 0.5d0

and now the parallel version is about 3 times faster than the serial version. So it seems that OpenMP does not like call RANDOM_NUMBER(u) (inside a subroutine which is inside a parallel region). While this trick was useful to isolate the problem, I need to draw random numbers, so I’m not sure how to move on

EDIT

I summarize my experiments in the following table

Parameters Serial Parallel Serial Parallel
N_sim 14000 14000 14000 14000
T_sim 30000 30000 30000 30000
par_fortran 0 1 0 1
omp dynamic YES YES YES YES
RANDOM_NUMBER call NO NO YES YES
Time (seconds) 3.3463044 1.0816885 5.57 122.11384
1 Like

I don’t know how RANDOM_NUMBER() actually works behind the scenes, but if someone has some insight of that I’ll be interested since I do stochastic simulations as well.

However, if having it inside the parallel is a problem, what about “precomputing” it before entering the parallel region and storing it into an array?

1 Like

Sure, but in the above case one needs Nsim*nthreads random numbers, which is quite a large array in itself. If one can afford it it’s OK. The other obvious solution would be to enclose the calls to RANDOM_NUMBER() in a in !$OMP CRITICAL region, but the performances are terrible then.

A mid-term solution is to store some random numbers in a relatively small array that is passed to the routines, and renew the content of the array when the values are consumed. This dramatically reduces the number of individual calls to RANDOM_NUMBER() and the associated overhead of the critical region. I got some speed-up with this approach in the code below, but a limited one. The problem IMO is that the calls to RANDOM_NUMBER() are overall costly in this algorithm and that they have to be serialized anyway.

The real solution would be to find a true parallel and thread-safe random number generator library.


    
    function simul_iid(z_prob,rn,rni,dbg) result(ind_sim)
    ...
    real(8), intent(inout) :: rn(:)
    integer, intent(inout) :: rni
    ...
    !call RANDOM_NUMBER(u)
    if (rni == size(rn)) then
       !$OMP CRITICAL
       call random_number( rn(:) )
       !$OMP END CRITICAL
       rni = 0
    end if
    rni = rni + 1
    u = rn(rni)
    ...
    end function simul_iid



    subroutine markov_simul(ysim,Ptran,Tsim,y1,rn,rni,dbg_in)
    ...
    real(8), intent(inout) :: rn(:)
    integer, intent(inout) :: rni
    ...    
    do t = 1,Tsim-1
        ysim(t+1) = simul_iid(Ptran(ysim(t),:),rn(:),rni,dbg)    
    enddo
   
   end subroutine markov_simul



program main
    ...
    ! for random numbers
    real(8) :: rn(10000,nthreads)
    integer :: rni(nthreads) = 10000
    integer :: it
    ...
    !$omp parallel if (par_fortran==1) default(shared) private(hh_c,t,it)
    !$omp do 
    do hh_c=1,N_sim
        it = omp_get_thread_num() + 1
        ...
        call markov_simul(sim_ind_z(:,hh_c),Pi,T_sim,init_z,rn(:,it),rni(it))
        ...    
    enddo
    !$omp enddo 
    !$omp end parallel
    ...    
end program main
1 Like

Here is a link that describes how to use the gfortran random_seed intrinsic within OMP threads. https://gcc.gnu.org/onlinedocs/gfortran/RANDOM_005fSEED.html
Presumably, other OMP aware fortran compilers have similar capability.

1 Like

The Intel Fortran manual states the routine is thread-safe: https://www.intel.com/content/www/us/en/docs/fortran-compiler/developer-guide-reference/2024-2/random-number.html

1 Like

How accurate does the RANDOM_NUMBER have to be in this stocastic simulation ?

You could have a large list of random values ( say n = 10^6 ) and start each thread from a different place in the list, say (ithead/nthread*N) then cycle all threads sequentially through this random list, keeping a private list pointer for each thread.

An accuracy issue is how to handle extreme events with a low resolution random number generator. I have successfully done many simulation studies with 16-bit and 32-bit random number generators.
Extreme events can be addressed with a changed approach. Rather than repeatedly testing if the event happens now, it can be more accurate to sample for the step interval to the next event occurring and count down.

1 Like

It can be thread-safe, but with the entire call protected by a mutex. And from the timings from @aledinola , this is probably what happens (in which case using a critical region in addition is useless).

Indeed it seems that the RANDOM_NUMBER() in gfortran is not only thread-safe but can also be executed independently by different threads (without any mutex).

1 Like

That said, with ifort 21 on Linux I can’t reproduce the catastrophic performances in multithread as reported by @aledinola (but I get no speed-up, anyway). So, there is maybe a specific issue on Windows.

1 Like

I see an allocatable within the parallel region. I once had a problem with ifort on windows. The scalable allocator from the tbb library was not linked/called. I could not resolve this and had to switch to an external scalable allocator. The linked thread contains a small example code, which demonstrates the problem.

There are some other threads on performance differences with openmp for Linux and Windows in this forum.

Anyway Intel vtune was really helpful to identify that issue. On Linux, I prefer perf.

1 Like

I had a look at your post, interesting! But in my code I do not have any allocate/deallocate statements within the parallel region (unless I missed something).

I do have several allocatable arrays in the parallel region, but they are allocated and deallocated before or after the parallel region. Also, there is no automatic allocation.

Thanks for checking!