Random numbers in parallel

I would like to generate a large array of random numbers for a simulation and I am trying to speed things up using OpenMP. Since the performance of the intrinsic random_number is terrible in parallel (I am using ifort on Windows, results are very similar with ifx), I am using as an alternative a routine from the famous Numerical Recipes textbook (Fortran 90 version, of course). However, the perforance is still not very good, even though is better than the intrinsic random_number.
So I have two questions:

  1. How can I make the OpenMP code faster? Especially the one that used the Numerical Recipes routine, since random_number in Windows is a lost cause, apparently.
  2. Are the OpenMP directives correct? In particular, I am not sure how to declare the variable idum: shared, private or firstprivate. Note the idum has the INTENT(INOUT) attribute in the subroutine ran from the Numerical Recipes. Furthermore, in the subroutine there are some local variables declared with the SAVE attribute. Not sure if this creates problem when parallelizing.

I provide below a MWE to make it easier for users interested in replicating this issue. Thanks in advance to anyone to can help me out on this!

[Small note: I shared a post on a related topic some time ago (see OpenMP stochastic simulation). However, here the focus is quite different.]

module mod_random
    implicit none
    ! Reference: Numerical Recipes in Fortran 90 
    private
    public :: ran
    
    contains
    
    ! TODO: Add routine for vectors
    
    FUNCTION ran(idum)
		IMPLICIT NONE
		INTEGER, PARAMETER :: K4B=selected_int_kind(9)
		INTEGER(K4B), INTENT(INOUT) :: idum
		REAL(8) :: ran
		INTEGER(K4B), PARAMETER :: IA=16807,IM=2147483647,IQ=127773,IR=2836
		REAL(8), SAVE :: am
		INTEGER(K4B), SAVE :: ix=-1,iy=-1,k
		if (idum <= 0 .or. iy < 0) then
			am=nearest(1.0,-1.0)/IM
			iy=ior(ieor(888889999,abs(idum)),1)
			ix=ieor(777755555,abs(idum))
			idum=abs(idum)+1
		endif
		ix=ieor(ix,ishft(ix,13))
		ix=ieor(ix,ishft(ix,-17))
		ix=ieor(ix,ishft(ix,5))
		k=iy/IQ
		iy=IA*(iy-k*IQ)-IR*k
		if (iy < 0) iy=iy+IM
		ran=am*ior(iand(IM,ieor(ix,iy)),1)
	END FUNCTION ran

end module mod_random
    
program main
    use mod_random, only: ran
    use omp_lib ! OpenMP library
    implicit none
    
    ! Declare variables
    integer, parameter :: par_fortran = 1
    integer, parameter :: T_sim = 1000   ! In my real code, this is larger
    integer, parameter :: N_sim = 50000  ! In my real code, this is larger
    logical, parameter :: do_print = .false.
    integer :: i, hh_c, iostat, idum
    real(8) :: start, finish    ! Timing variables
    real(8), allocatable :: x(:,:)
    
    write(*,*) "Hello, world!"
    
    ! Call random number generator
    allocate(x(T_sim,N_sim),stat=iostat)
    if (iostat /= 0) then
        print *, 'Error allocating memory'
        stop
    endif
    
    write(*,*) "Numerical recipes <ran>"
    
    ! Start timing
    start = omp_get_wtime()
    
    idum = -1
	! QUESTION: Should idum be declared as shared, private or firstprivate?
    !$omp parallel if (par_fortran==1) default(shared) private(hh_c,i) firstprivate(idum)
    !$omp do
    do hh_c=1,N_sim
        do i = 1,T_sim
            x(i,hh_c) = ran(idum)
            !print *, idum
        enddo
    enddo
    !$omp enddo 
    !$omp end parallel
    
    ! Stop timing
    finish = omp_get_wtime()
    
    ! Print execution time in seconds
    print *, 'Execution Time (s):', finish - start
    
    if (do_print) call print_matrix(x)
    
    write(*,*) "idum = ", idum
    write(*,*) "Mean(x) = ", sum(x)/real(size(x),8)
    
    !----------------------------------------------------------------
    write(*,*) "----------------------------------------------------------------"
    write(*,*) "Intrinsic random_number"
    
    if (allocated(x)) deallocate(x)
    allocate(x(T_sim,N_sim))
    
    ! Start timing
    start = omp_get_wtime()
    
    !$omp parallel if (par_fortran==1) default(shared) private(hh_c,i)
    !$omp do
    do hh_c=1,N_sim
        do i = 1,T_sim
            call RANDOM_NUMBER(x(i,hh_c))
        enddo
    enddo
    !$omp enddo 
    !$omp end parallel
    
    ! Stop timing
    finish = omp_get_wtime()
    
    ! Print execution time in seconds
    print *, 'Execution Time (s):', finish - start
    
    if (do_print) call print_matrix(x)
    
    write(*,*) "Mean(x) = ", sum(x)/real(size(x),8)
    
    if (allocated(x)) deallocate(x)
    
    contains
    
    subroutine print_matrix(x)
    
        real(8), intent(in) :: x(:,:)
        integer :: i, j
        
        !Display matrix x(row,col) on screen
        do i = 1,size(x,1)
            do j = 1,size(x,2)
                write(*,'(f12.6," ")',advance='no'), x(i,j)
                write(*,'(" ")', advance = 'no')
            enddo
            write(*,"()")  
        enddo
        write(*,*) " "
    
    end subroutine print_matrix
    
    
end program main

Correction: I initially wrote that idum has attribute SAVE in the subroutine but that was not true of course. idum is intent(inout). I have corrected this mistake in my original post

Your question was discussed at Pseudorandom number generator · Issue #135 · fortran-lang/stdlib · GitHub , where Leon Foks mentioned his module Prng_Class

Class providing a pseudo-random number generator with an xorshift128+ or xorshift1024* generator that are both suitable for parallel applications. This is class is thread safe, and can be used in parallel without Prngs on different threads affecting each other.

The Scalable Parallel Random Number Generators Library (SPRNG) is in C and C++ but has a Fortran interface. A previous thread was RNG for MC simulations, where some vendor libraries were mentioned.

1 Like

idum being firstprivate is correct. However:

  • the different threads will generate exactly the same sequence of numbers, as the seed (idum) is the same for all of them.
  • this routine is not thread-safe, because of the variables that have the save attribute. This has also consequences on the performances, as all threads are writing to the same memory addresses

You should convert all the save’d variable into intent(inout) arguments

1 Like

Thanks for your suggestions!

However, if I convert the following variables

REAL(8), SAVE :: am
INTEGER(K4B), SAVE :: ix=-1,iy=-1,k

to

REAL(8), INTENT(INOUT) :: am
INTEGER(K4B), INTENT(INOUT) :: ix=-1,iy=-1,k

should I initialize them before the subroutine is called? For example, idum is INOUT and I initialize it to a negative value

Yes, you have to initialize them outside of the routine. Moreover, if you want the sequences to be different between the threads, you should initialize idum… randomely!

    !$omp parallel if (par_fortran==1) default(shared) private(hh_c,i,idum,am,ix,iy,k) 
    !$omp critical
    call random_number(foo)
    idum = -int(foo*huge(idum))
    !$omp end critical
    ix = -1; iy = -1
    !$omp do
    do hh_c=1,N_sim
        do i = 1,T_sim
            x(i,hh_c) = ran(idum,am,ix,iy,k)
            !print *, idum
        enddo
    enddo
    !$omp enddo 
    !$omp end parallel
1 Like

Thanks! I notice however that you haven’t initialized the real variable am, which is also intent(inout) in the subroutine. I think it doesn’t matter, but I wanted to be sure

am is internally initialized at the first call, so no need to initialize it beforehand.

Looking more closely at the code, I don’t get why k has the save attribute. It doesn’t need it (and therefore you don’t have to convert it to an argument).

1 Like

Here is an example I adapted from one of the MKL Random Number Generation examples:

! mkl_ex.f90 -- Example of generating uniform random numbers with oneMKL
!
! To compile:
!  $ ifx -O2 -qmkl -o mkl_ex mkl_ex.f90
! Run with
!  $ mkl_ex <n>
!
include 'mkl_vsl.f90'

program MKL_VSL_GAUSSIAN

USE MKL_VSL_TYPE
USE MKL_VSL

implicit none

real(kind=8), allocatable :: r(:)  ! buffer for random numbers
real(kind=8) s        ! average
real(kind=8) a, b, sigma ! parameters of normal distribution

TYPE (VSL_STREAM_STATE) :: stream

integer(kind=4) errcode
integer(kind=4) i,j
integer brng,method,seed,n, istat
character(len=32) arg

if (1 /= command_argument_count()) then
  write(*,'(A)') "Usage: program_name <integer>"
  stop 1
end if

call get_command_argument(1, arg)
read(arg, *, iostat=istat) n

if (istat /= 0) then
  write(*,'(A)') "Error: Invalid integer argument."
  stop 2
end if

allocate(r(n))

s = 0.0
a = 0.0
b = 1.0
brng=VSL_BRNG_MT19937
method=VSL_RNG_METHOD_UNIFORM_STD
seed=777

! ***** Initializing *****
errcode=vslnewstream( stream, brng,  seed )

! ***** Generating *****
do i = 1,10
    errcode=vdrnguniform( method, stream, n, r, a, b )
    !errcode=vdrnggaussian( method, stream, n, r, a, b )
    s = s + sum(r)
end do

s = s / (10.0d0*n)

! ***** Deinitialize *****
errcode=vsldeletestream( stream )
deallocate(r)

! ***** Printing results *****
print *,"Sample mean of uniform distribution = ", s

end program

Output:

$ time ./mkl_ex 10000000
 Sample mean of uniform distribution =   0.500016546433347     

real	0m0.625s
user	0m0.217s
sys	    0m0.038s

You could initialize a big array in parallel by using block splitting as recommended in the following answer on the Intel forums:

The block splitting is explained under the vslSkipAheadStream function. The PRNG (or rather BRNG as called in the MKL documentation) has to support skip-ahead. This can be found in the Notes for Intel® oneAPI Math Kernel Library Vector Statistics.

1 Like

The pm_distUnif module of the ParaMonte library (source here) also has a generic arbitrary precision real random number generator xoshiro256ssw_type, to which you can pass the process ID to generate random sequence unique for the specific parallel image/process/thread. Here is more information on the parallel xoshiro256** RNG.

With unique parallel rngs in hand, you can then pass them to the generic interface function getUnifRand or subroutine setUnifRand to generate scalars or arrays of arbitrary type, kind, or rank. Here is a relevant benchmark on the speed of xoshiro256ssw_type compared to the intrinsic random_number() (~4 times faster than intrinsic procedures for real64 random sequences, while maintaining procedure purity). Here is another benchmark showing the apparent relative superiority of the performance of xoshiro256ss_type over random_number() (the benchmark compiler was either gfortran or ifort with -O3 flag).

To download the module source or make a shared library of it, try, in Unix Bash,

git clone https://github.com/cdslaborg/paramonte.git && cd paramonte
./install.sh --mod distUnif

If you run into trouble or have comments for enhancements, please share here.

or in Windows Batch environment,

git clone https://github.com/cdslaborg/paramonte.git && cd paramonte
install.bat --mod distUnif

The installation requires a recent CMake installation, available in the shell path list and either gfortran or ifort (ifx may also work, but it is too buggy, unfortunately at this point, and cannot fully compile this library and run its examples).

A copy of the processed source files will be in the folder ./_bin/libparamonte_fortran_*/fpp/ and the examples source files in ./_bin/libparamonte_fortran_*/example/pm_distUnif/ (to run the supplied examples, the flag --mod distUnif must be removed from the installation command because examples have dependencies to other modules besides pm_distUnif. Here are more installation directions.

3 Likes

The default gfortran generator under the hood of random_number() uses this algorithm.

I have been using xoroshiro128+ in my rng_par_zig.f90 for quite a few years already. It is a modernization of the preceding code by others mentioned in the header and includes the jump functions and functions for random reals in other distributions.

It is a completely standalone source file, no fpm or CMake or whatever is needed, just one compiler command. There is just one cpp macro used. For most use cases it can be deleted, but if strict compliance with signed integer overflow is necessary, there is a c function in a different source file to be called as Fortran sadly lacks unsigned integers.

In my opinion, for many application the 1024 version is too much for little reason and 128 is more than enough and is faster.

1 Like

You can try the new gfortran extension: Unsigned integers (The GNU Fortran Compiler)

I avoid compiler extensions as a rule. Their use in PRNG is really just in order to properly overflow so most often signed ones just work, silently ignoring the limitation. If one needs to persuade some checker like the UB sanitizations, one can call the C function. The compiler will inline it without any performance loss.

1 Like

Don’t forget -fwrapv in case of gfortran 14 and above.

Thanks for pointing out the package by Leon Foks. I tried to use it, but I was not successful.

I copied Leon’s test program available here Prng_Class – Fortran Program

and very slightly adapted it to my needs. I show below my test program

program PrngTest

use omp_lib
use Prng_Class, only: Prng, getRandomSeed

implicit none

type(Prng), allocatable :: rng(:) ! Array of Prng classes, one for each thread
integer(i64) :: seed(16) ! Use xorshift1024*, so the seed is size 16
integer(i32) :: i, id
integer(i32) :: nThreads, iThread

real(r64) :: a

! Get a randomly generated seed
call getRandomSeed(seed, .true.)

! Get the number of threads available
!$omp parallel 
  nThreads = omp_get_num_threads()
!$omp end parallel

! Allocate an array of Prngs, one for each thread
allocate(rng(nThreads))

! In parallel, initialize each Prng with the same seed, and jump each prng by the thread ID it is associated with.
! This allows all Prngs to draw from the same stream, but at different points along the stream.
! This is better than giving each Prng its own randomly generated seed.

!$omp parallel shared(rng, seed) private(iThread, a)
  iThread = omp_get_thread_num()
  rng(iThread + 1) = Prng(seed, big = .true.)
  call rng(iThread + 1)%jump(iThread) ! Jump the current thread's Prng by its thread number.
  call rng(iThread + 1)%rngNormal(a) ! Draw from normal distribution on each thread
!$omp end parallel

stop
end program

First of all, I found a typo in Leon’s file, line iThreads = omp_get_thread_num(). The variable iThreads is not defined anywhere. I assume he means iThread. Doing this solves the first compilation error.

However, I get another compilation error: the types i64, i32 are not defined. So I added the following line to the main program:

use variableKind, only: r64, i32, i64

This solved this issue, but then I got this:

ipo: warning #11021: unresolved PRNG_CLASS_mp_RNGINTEGER_128PLUS
        Referenced in C:\Users\aledi\AppData\Local\Temp\ipo_3083642.obj
ipo: warning #11021: unresolved PRNG_CLASS_mp_RNGINTEGER_1024STAR
        Referenced in C:\Users\aledi\AppData\Local\Temp\ipo_3083642.obj
ipo: warning #11021: unresolved PRNG_CLASS_mp_JUMPSTATE_128PLUS
        Referenced in C:\Users\aledi\AppData\Local\Temp\ipo_3083642.obj
ipo: warning #11021: unresolved PRNG_CLASS_mp_JUMPSTATE_1024STAR
        Referenced in C:\Users\aledi\AppData\Local\Temp\ipo_3083642.obj
ipo: warning #11021: unresolved PRNG_CLASS_mp_RNGEXPONENTIAL_D1_PRNG
        Referenced in C:\Users\aledi\AppData\Local\Temp\ipo_3083642.obj
ipo: warning #11021: unresolved PRNG_CLASS_mp_RNGEXPONENTIAL_D1D_PRNG
        Referenced in C:\Users\aledi\AppData\Local\Temp\ipo_3083642.obj
ipo: warning #11021: unresolved PRNG_CLASS_mp_RNGEXPONENTIAL_D2D_PRNG
        Referenced in C:\Users\aledi\AppData\Local\Temp\ipo_3083642.obj
ipo: warning #11021: unresolved PRNG_CLASS_mp_RNGEXPONENTIAL_D3D_PRNG
        Referenced in C:\Users\aledi\AppData\Local\Temp\ipo_3083642.obj
ipo: warning #11021: unresolved PRNG_CLASS_mp_RNGINTEGER_I1_PRNG
        Referenced in C:\Users\aledi\AppData\Local\Temp\ipo_3083642.obj
ipo: warning #11021: unresolved PRNG_CLASS_mp_RNGINTEGER_I1D_PRNG
        Referenced in C:\Users\aledi\AppData\Local\Temp\ipo_3083642.obj
ipo: warning #11021: unresolved PRNG_CLASS_mp_RNGINTEGER_I2D_PRNG
        Referenced in C:\Users\aledi\AppData\Local\Temp\ipo_3083642.obj
ipo: warning #11021: unresolved PRNG_CLASS_mp_RNGINTEGER_I3D_PRNG
        Referenced in C:\Users\aledi\AppData\Local\Temp\ipo_3083642.obj
ipo: warning #11021: unresolved PRNG_CLASS_mp_RNGNORMAL_D1_PRNG
        Referenced in C:\Users\aledi\AppData\Local\Temp\ipo_3083642.obj
ipo: warning #11021: unresolved PRNG_CLASS_mp_RNGNORMAL_D1D_PRNG
        Referenced in C:\Users\aledi\AppData\Local\Temp\ipo_3083642.obj
ipo: warning #11021: unresolved PRNG_CLASS_mp_RNGNORMAL_D2D_PRNG
        Referenced in C:\Users\aledi\AppData\Local\Temp\ipo_3083642.obj
ipo: warning #11021: unresolved PRNG_CLASS_mp_RNGNORMAL_D3D_PRNG
        Referenced in C:\Users\aledi\AppData\Local\Temp\ipo_3083642.obj
ipo: warning #11021: unresolved PRNG_CLASS_mp_RNGUNIFORM_D1_PRNG
        Referenced in C:\Users\aledi\AppData\Local\Temp\ipo_3083642.obj
ipo: warning #11021: unresolved PRNG_CLASS_mp_RNGUNIFORM_D1D_PRNG
        Referenced in C:\Users\aledi\AppData\Local\Temp\ipo_3083642.obj
ipo: warning #11021: unresolved PRNG_CLASS_mp_RNGUNIFORM_D2D_PRNG
        Referenced in C:\Users\aledi\AppData\Local\Temp\ipo_3083642.obj
ipo: warning #11021: unresolved PRNG_CLASS_mp_RNGUNIFORM_D3D_PRNG
        Referenced in C:\Users\aledi\AppData\Local\Temp\ipo_3083642.obj
ipo: warning #11021: unresolved PRNG_CLASS_mp_RNGWEIBULL_D1_PRNG
        Referenced in C:\Users\aledi\AppData\Local\Temp\ipo_3083642.obj
ipo: warning #11021: unresolved PRNG_CLASS_mp_RNGWEIBULL_D1D_PRNG
        Referenced in C:\Users\aledi\AppData\Local\Temp\ipo_3083642.obj
ipo: warning #11021: unresolved PRNG_CLASS_mp_RNGWEIBULL_D2D_PRNG
        Referenced in C:\Users\aledi\AppData\Local\Temp\ipo_3083642.obj
ipo: warning #11021: unresolved PRNG_CLASS_mp_RNGWEIBULL_D3D_PRNG
        Referenced in C:\Users\aledi\AppData\Local\Temp\ipo_3083642.obj
ipo: error #11023: Not all components required for linking are present on command line

Try the code I linked, it should be very easy to compile. You just need the -cpp flag or to delete the macro block. Otherwise the code is a conservative modernization of an older one so it avoids any object orientation and other fancy features.

1 Like

Hi @VladimirF, thanks! I managed to compile your modules on Windows with ifort (The library of Leon Foks was too fancy indeed, I gave up). Do you have a test program to learn how to use the random number generator? Preferably in parallel, but also a serial example would help.
Thanks again for providing this module!

Yes, this is a quick example program:

    use iso_fortran_env
    use rng_par_zig
    !$ use omp_lib

    implicit none
    
    ! Just some initial entropy for a reproducible random sequence
    ! It can be filled with some random data instead
    integer(int64), parameter :: base(2) = [int(Z'1DADBEEFBAADD0D0', int64), &
                                            int(Z'5BADD0D0DEADBEEF', int64)]
                                                                                
    integer :: i, j, k, n
    integer(int64) :: seed(2)
    integer :: tid, nt
    
    real(real64), allocatable :: A(:,:,:)
    
    write(*,*) "Enter n:"
    read(*,*) n
    
    allocate(A(n,n,n))
    
    nt = 1
    !$omp parallel
    !$omp single
    !$ nt = omp_get_num_threads()
    !$omp end single
    !$omp end parallel

    seed = base
    
    !! For distributed parallellism:
    !! Jump nthread-times for each image so that each image
    !! has a disjunct set of random sequences for its threads.
    ! call rng_jump(seed, nt*process_number))


    call rng_init(nt, seed)
    
    
    ! fill an array with normally distributed random numbers
    !$omp parallel private(i,j,k,tid)
    
    tid = 0
    !$ tid = omp_get_thread_num()
    !$omp do collapse(3) 
    do k = 1, n
     do j = 1, n
       do i = 1, n
         call rng_norm(A(i,j,k), tid)
      end do
     end do
    end do
    !$omp end do
    
    !$omp end parallel
    
    print *, A
    
    
    ! fill an array with uniformly distributed random numbers
    !$omp parallel private(i,j,k,tid)
    
    tid = 0
    !$ tid = omp_get_thread_num()
    !$omp do collapse(3) 
    do k = 1, n
     do j = 1, n
       do i = 1, n
         call rng_uni(A(i,j,k), tid)
      end do
     end do
    end do
    !$omp end do
    
    !$omp end parallel
    
    print *, A
    
end
1 Like

This is really helpful, thanks. Just a quick question:

The subroutine(s) rng_uni always returns a scalar, correct? I was wondering if it can return an array, as the intrinsinc random_number. In my code I’m doing something like this

integer, parameter :: N = 10000000
integer, parameter :: n_rand = 4
real(8) :: rand_vec(n_rand)
real(8), allocatable :: A(:)
!--------------------------------------------------!

allocate(A(N))

do i=1,N ! Parallelize this loop
   ! Draw a 4-element vector of uniform random numbers
   call random_number(rand_vec)
   ! Do a lot of time-consuming operations using rand_vec 
   ! and obtain real scalar some_value
   A(i) = some_value
enddo

I was thinking something like this might be useful

interface rng_uni
    module procedure rng_uni_scalar
    module procedure rng_uni_vector
end interface