Looking at your initialization of arrays, it looks like you might be hitting a “first touch” policy problem which can possibly hinder performance. You might try to initialize your arrays such that each thread initializes a subsection… No guarantees thought but it might be worth taking a look at that.
As a proof-of-concept you could use the simple random-number generator used here with OpenMP and if the results are encouraging look at some more complex PRNGs?
Using the supplied test cases as-is and then with the intrinsics would help confirm the root cause.
https://people.sc.fsu.edu/~jburkardt/f_src/random_openmp/random_openmp.f90
Are there MSWIndows tools to show the mutex calls? That might be useful. Even something like a MSWindows version of strace? Not a MSWindows user, so not sure what it has.
Sorry, looking at the code within the forum is not that easy. I missed that the allocatable variable row_sum(:)
is not used. But it is actually declared within the parallel region. So the deallocate is automatically invoked. However, I think for a non-allocated variable it should not have much impact (currently I cannot check in windows, the Linux perf tool does not show much interesting stuff, except that random_number eats up all the resources if used.)
BTW, not having 64bit unsigned integer numbers with well defined overflow behaviour makes it difficult to implement well-established random number generators in fortran. There are a couple generators which support split and merge operations, which helps to avoid any syncronisation within a parallel region. gfortran implements the xoshiro256** generator and its “jump” function for splitting (see libgfortran/intrinsics/random.c of in its current source tree). But I do not see how that is actually used by the compiler in conjunction with openmp code.
The only “first touch” that I can see out of the parallel region is sim_ind_a(1,:) = init_a
, which is a very limited one. I doubt moving it to the parallel region make a significant difference.
However, this makes me think that most of the code is maybe memory bound, as it has to write a very large amound of data (12GB estimated with the parameters as posted). If so, most of code being memory bound and the call to RANDOM_NUMBER()
being serialized by a mutex, there’s no chance to get a significant speed-up with multithreading.
Thanks! I actually thought of replacing the intrinsic random_number with a simpler generator from the book “Modern Fortran explained 2018” (solution to exercise 6 chapter 9) and test this against the intrinsic.
On my PC (Ryzen5700X(8-core, max-freq set to 3.4GHz) + Ubuntu22), the timing with gfortran-12 -O3 -march=native -fopenmp is like this:
1-thread : stdout -> 11.6 sec (time command -> 8.50user 4.55system)
2-thread : stdout -> 6.8 sec (time command -> 8.57user 4.58system)
4-thread : stdout -> 4.2 sec (time command -> 8.63user 4.54system)
8-thread : stdout -> 3.0 sec (time command -> 8.72user 4.64system)
If I replace random_number()
with u = 0.5d0
, the result is:
1-thread : stdout -> 6.9 sec (time command -> 3.74user 4.57system)
2-thread : stdout -> 4.4 sec (time command -> 3.83user 4.58system)
4-thread : stdout -> 3.1 sec (time command -> 3.81user 4.64system)
8-thread : stdout -> 2.4 sec (time command -> 3.85user 4.84system)
For serial runs, my PC is ~ 2 times slower than your laptop (= Core i7 12800H, max 4.8 GHz?, the memory (DDR4 vs DDR5?) may also be different):
https://www.cpubenchmark.net/cpu.php?cpu=Intel+Core+i7-12800H&id=4778
But with threading the program becomes faster to some extent, although the scaling is not very good (probably because memory bound? (as mentioned above))
These pages may also be related:
Though this is not about ifort + Windows, similar things might be happening…? (But it also seems (to me) that the speed-down is too large for the above laptop.)
Note that the consumption part is not parallelised and takes a significant amount of runtime. But at least it should not get slower the more threads are used.
Thanks for pointing this out. Yes, I should move the consumption part inside the parallel region
Unless I missed something in his post, it seems to me that @septc used only gfortran for tests. It seems that in contrast to ifort, RANDOM_NUMBER()
has no mutex in gfortran.
You’re right, I had missed it
I installed gfortran and I ran a speed comparison between ifort
and gfortran
, serial vs parallel code. The results confirm what others have pointed out. Here is a table with running time in seconds
Num cores | Time gfortran | Time ifort | |
---|---|---|---|
OS | Windows 11 | Windows 11 | |
version | GNU Fortran (GCC) 12.3.0 | Version 2021.11.0 | |
flags | -O3 -march=native -fopenmp |
-O3 -Qipo -QxHost -Qopemp | |
serial | 0 | 7.2 | 4.88 |
parallel | 16 | 1.016 | 50.82 |
I’m quite happy about the speed gain with gfortran (~7 times). Instead, ifort is faster in serial but the performance on multiple cores is abysmal. I think for simulations I’ll go with gfortran (given that I’d like to use the intrinsic random number subroutine). As I side note, this small example is yet another instance that shows how gfortran (an open-source compiler) manages to stay competitive
P.S. If someone can suggest some compiler flag I could use with gfortran to further improve speed (esp. in serial)… thanks
I report below a slightly updated version of the MWE
module mymod
implicit none
private
public :: linspace, markov_simul, myerror
contains
subroutine myerror(string)
implicit none
character(len=*), intent(in) :: string
write(*,*) "ERROR: ", string
write(*,*) "Program will terminate.."
stop
end subroutine myerror
function linspace(a, b, num) result(array)
! Input parameters
real(8), intent(in) :: a ! The starting value
real(8), intent(in) :: b ! The stopping value
integer, intent(in) :: num ! The number of points
! Output array
real(8) :: array(num)
! Local variables
integer :: i
real(8) :: step
! Error handling for num <= 1
if (num == 1) then
array(1) = a
return
endif
! Calculate the step size
step = (b - a) / real(num - 1, 8)
! Populate the array
do i = 1, num
array(i) = a + (i - 1) * step
enddo
end function linspace
function simul_iid(z_prob,dbg) result(ind_sim)
implicit none
!##### DESCRIPTION #############
!Simulate a random draw from a discretized distribution
! Note: this is fast if the size of z_prob is small
! When z_prob is large (say, more than 100 elements), sample_pmf is faster
!##### INPUTS/OUTPUTS #############
real(8), intent(in) :: z_prob(:) !discretized prob vector, must be >=0 and sum to 1
integer, intent(in) :: dbg !0-1 flag
integer :: ind_sim !index for the draw
!##### LOCALS #############
integer :: n, i
real(8) :: u ! random number (a scalar)
real(8) :: check, prob_sum
!##### ROUTINE CODE #############
n = size(z_prob)
if (dbg==1) then
if ( any(z_prob<0d0) .or. any(z_prob>1d0) ) then
error stop "simul_iid: 0<=prob<=1 violated"
endif
check = sum(z_prob)
if (abs(check-1d0)>1d-6) then
error stop "simul_iid: prob does not sum to one"
endif
endif
call RANDOM_NUMBER(u)
!u = 0.5d0
do i = 1,n-1
!Cumulative sum p1+p2+..+pi
prob_sum = sum(z_prob(1:i))
if (u <= prob_sum ) then
ind_sim = i
return
endif !END IF
enddo !end for
! Else, choose the last value
ind_sim = n
end function simul_iid
subroutine markov_simul(ysim,Ptran,Tsim,y1,dbg_in)
implicit none
!##### DESCRIPTION #############
! Routine for simulating a Markov chain with time-independent
! transition matrix (standard case). To increase speed, set
! the debug flag to zero.
! This subroutine calls the function simulate_iid.
!##### INPUTS/OUTPUTS #############
real(8), intent(in) :: Ptran(:,:) ! Transition matrix, dim: (ns,ns)
integer, intent(in) :: Tsim ! Lenght of the time series to simulate
integer, intent(in) :: y1 ! Initial condition
integer, intent(out) :: ysim(:) ! Simulated time series s.t. y_sim(1)=y1, dim: (Tsim)
integer, intent(in),optional :: dbg_in ! 0-1 debug flag
!##### LOCALS #############
integer :: n, i, ns, t, dbg
real(8) :: check, prob_sum
!##### ROUTINE CODE #############
if (present(dbg_in)) then
dbg = dbg_in
else
dbg = 0
endif
ns = size(Ptran,dim=1)
! Check if inputs are correct (only if debug flag = 1)
if (dbg==1) then
if (size(Ptran,1)/=size(Ptran,2)) then
call myerror("markov_simul: Ptran is not square")
endif
if ( any(Ptran<0d0) .or. any(Ptran>1d0) ) then
call myerror("markov_simul: 0<=prob<=1 violated")
endif
if (Tsim < 2) then
call myerror("markov_simul: Tsim must be at least 2")
endif
if ( y1<1 .or. y1>ns) then
call myerror("markov_simul: initial condition is out of bounds")
endif
if ( size(ysim)/=Tsim ) then
call myerror("markov_simul: ysim has wrong size")
endif
endif
! Simulate series of length Tsim
ysim(1) = y1
do t = 1,Tsim-1
ysim(t+1) = simul_iid(Ptran(ysim(t),:),dbg)
enddo
end subroutine markov_simul
end module mymod
!================================================================================================
program main
! THIS CODE USES OPENMP
use mymod, only: linspace, markov_simul
use omp_lib
implicit none
! ------------------ Declarations ------------------
integer :: par_fortran
integer, parameter :: nthreads = 16
integer, parameter :: N_sim = 14000
integer, parameter :: T_sim = 30000
integer, parameter :: n_a=500, n_z=2
real(8) :: t1,t2
real(8) :: Pi(2,2), z_grid(2), b, grid_max, r, w
integer, allocatable :: pol_ap_ind(:,:), sim_ind_a(:,:), sim_ind_z(:,:)
real(8), allocatable :: a_grid(:), sim_a(:,:), sim_c(:,:)
! Variables for the simulation
integer :: hh_c, t, init_a, init_z
! ------------------ End declarations ------------------
write(*,*) "Starting program..."
write(*,*) "max # of threads = ",omp_get_max_threads()
CALL OMP_SET_NUM_THREADS(nthreads)
!$omp parallel if (par_fortran==1)
write(*,*) "hello from thread ",omp_get_thread_num() !," of ",nthreads
!$omp end parallel
write(*,*) "Type 1 to run in parallel, 0 to run in serial"
read(*,*) par_fortran
! ---------------------- Set economic parameter values ----------------------!
Pi(1,:) = [0.6d0, 0.4d0]
Pi(2,:) = [.05d0, 0.95d0]
z_grid = [0.5d0, 1.0d0]
r = 0.04d0
w = 1.0d0
b = 0.0d0
grid_max = 4.0d0
! Set grid for assets
allocate(a_grid(n_a))
a_grid = linspace(b,grid_max,n_a)
! Set policy function for assets (index)
allocate(pol_ap_ind(n_a,n_z))
pol_ap_ind(:,1) = n_a/5 ! Low shock
pol_ap_ind(:,2) = n_a/2 ! High shock
write(*,*) "============================================================="
write(*,*) "Starting SIMULATION..."
write(*,*) "============================================================="
t1 = omp_get_wtime()
allocate(sim_a(T_sim+1,N_sim),sim_c(T_sim,N_sim),sim_ind_a(T_sim+1,N_sim), sim_ind_z(T_sim,N_sim))
! Initial conditions for state variables "a" and "z"
init_a = 1
init_z = n_z/2
! Recall: first index is time, second index is individual
sim_ind_a(1,:) = init_a
! Loop over time periods to simulate assets
!t1 = omp_get_wtime()
! Outer loop over individuals
!$omp parallel if (par_fortran==1) default(shared) private(hh_c,t)
!$omp do schedule(dynamic)
do hh_c=1,N_sim
!write(*,*) "Doing hh_c = ", hh_c
! Simulate a sequence of shocks (length Tsim) for individual hh_c
call markov_simul(sim_ind_z(:,hh_c),Pi,T_sim,init_z)
! Inner loop over time periods
do t=2,T_sim+1
! Update assets
sim_ind_a(t,hh_c) = pol_ap_ind(sim_ind_a(t-1,hh_c),sim_ind_z(t-1,hh_c))
sim_a(t,hh_c) = a_grid(sim_ind_a(t,hh_c))
enddo
! Update consumption
do t=1,T_sim
sim_c(t,hh_c) = (1d0+r)*sim_a(t,hh_c)+w*z_grid(sim_ind_z(t,hh_c))-sim_a(t+1,hh_c)
enddo
enddo
!$omp enddo
!$omp end parallel
t2 = omp_get_wtime()
write(*,*) "============================================================="
write(*,*) 'SIMULATION runs for ',t2-t1,' seconds.'
write(*,*) "============================================================="
write(*,*) "============================================================="
write(*,*) 'MODEL MOMENTS'
write(*,'(X,A,F10.4)') "Average assets, simulation: ", sum(sim_a)/size(sim_a)
write(*,'(X,A,F10.4)') "Average consumption, simulation: ", sum(sim_c)/size(sim_c)
write(*,*) 'EXAMPLE SIMULATED SERIES FOR 10 INDIVIDUALS at t=1000'
write(*,'(X,A)') "hh_c, sim_a, sim_c"
do hh_c=1,10
write(*,'(I5,10F10.4)') hh_c, sim_a(1000,hh_c), sim_c(1000,hh_c)
enddo
deallocate(a_grid,pol_ap_ind,sim_a,sim_c,sim_ind_z)
write(*,*) "Program terminated OK!"
end program main