MPI run time and arrays rank

mpi and array rank

rank=1

program main
    implicit none
    include 'mpif.h'
    integer           :: nprocs,ierr,myid
    real(8)           :: time1
    real(8)           :: time2
    integer,parameter :: n=1000
    complex(kind=8)    :: a(n*n)
    integer           :: i,j
    call mpi_init(ierr)
    call mpi_comm_size(mpi_comm_world,nprocs,ierr)
    call mpi_comm_rank(mpi_comm_world,myid,ierr)
    a=0
    call mpi_barrier(mpi_comm_world,ierr)
    time1=mpi_wtime()
    do i=1,1000
        a=a+cmplx(2.d0,-1.d0,8)
    end do
    time2=mpi_wtime()
    call mpi_barrier(mpi_comm_world,ierr)
    write(*,*)(time2-time1),a(1)
    call mpi_finalize(ierr)
end program main
program main
    implicit none
    include 'mpif.h'
    integer           :: nprocs,ierr,myid
    real(8)           :: time1
    real(8)           :: time2
    integer,parameter :: n=1000
    complex(kind=8)   :: a(n,n)
    integer           :: i,j
    call mpi_init(ierr)
    call mpi_comm_size(mpi_comm_world,nprocs,ierr)
    call mpi_comm_rank(mpi_comm_world,myid,ierr)
    a=0
    call mpi_barrier(mpi_comm_world,ierr)
    time1=mpi_wtime()
    do i=1,1000
        a=a+cmplx(2.d0,-1.d0,8)
    end do
    time2=mpi_wtime()
    call mpi_barrier(mpi_comm_world,ierr)
    write(*,*)(time2-time1),a(1,1)
    call mpi_finalize(ierr)
end program main
compiler cores rank time (s)
gfortran(mpif90) 1 1 1.574819803237915
gfortran(mpif90) 8 1 1.591102361679077
gfortran(mpif90) 1 2 0.679260969161987
gfortran(mpif90) 8 2 8.840477943420410
ifort(mpiifort) 1 1 0.391342599992640
ifort(mpiifort) 8 1 0.393174800032284
ifort(mpiifort) 1 2 0.557108099979814
ifort(mpiifort) 8 2 9.059296299994460

I am so confuse about the result . I think the time should be the same orders,like rank=1. But rank=2 array has more than 15times difference. Is there some errors ,or maybe some skills? Thanks in advance .

Compile option : -O3
Computer : intel i9-10900k, 20 cores, 16GB

IN the rank2 case, your matrix is 1000x1000, whereas in the rank=1 case it is only 1000 elements. That may make quite a difference.

It is actually a one-rank array with size 1000*1000. Hence the question, I think.

If you change this line

a=a+cmplx(2.d0,-1.d0,8)

to

a = a + mod(i,10)

I am wondering how does the timing change on your machine? On my machine, even the rank-1 code becomes slow as the number of MPI tasks increases (possibly due to limited band width?)

Yes,my result is same as yours.

time

1 Like

Oops, my bad - you’re right.

I am not familiar with MPI syntax, but doesn’t this imply all threads do the full calculation?
I don’t see the load sharing definition.
Would it be scheduling “do i” (which fails on sharing a= between threads" or partitioning “a=”
To gain anything,wouldn’t you need to partition “a=a+cmplx(2.d0,-1.d0,8)” between threads ?

The computation charts certainly indicate this problem.

I converted to OpenMP using gFortran and provided partitioned load sharing for the following results.
I used allocate to avoid stack problems.
The results show no issue with rank, but little gain from multi-threading.
I hope I have interpreted the problem corectly.

program omp_2
   use omp_lib
    implicit none
!    include 'mpif.h'
    integer           :: nprocs,ierr,myid
    integer,parameter :: n=1000
!z    complex(kind=8)    :: a(n,n)
    complex(kind=8), allocatable :: a(:,:)
    integer           :: i,j,k, threads

    real(8), external :: elapse_time
    real(8)           :: time

    allocate ( a(n,n) )

    do threads = 1, omp_get_max_threads ()

      a=0

      time = elapse_time()
    !$OMP PARALLEL DO private(i,j,k) shared(a) schedule(STATIC)
      do j = 1, n
       do k = 1,n
        do i=1,n
          a(k,j) = a(k,j) + cmplx (2.d0,-1.d0,8)
        end do
       end do
      end do
    !$OMP END PARALLEL DO

      time = elapse_time() - time
      write (*,*) threads, time, a(1,1)

    end do

end program  omp_2

  real(8) function elapse_time ()
    integer(8) :: tick, rate
     call SYSTEM_CLOCK ( tick, rate )
     elapse_time = dble(tick) / dble(rate)
  end function elapse_time
Rank 1
gcc.ver=11.1.0
gcc_dir=C:\Program Files (x86)\gcc_eq\gcc_11.1.0
options=-fimplicit-none -g -O2 -march=native -ffast-math -fopenmp
 Intel(R) Core(TM) i5-2300 CPU @ 2.80GHz
           1  0.28576523373249074                   (2000.0000000000000,-1000.0000000000000)
           2  0.27628413746288061                   (2000.0000000000000,-1000.0000000000000)
           3  0.26089710296400881                   (2000.0000000000000,-1000.0000000000000)
           4  0.31049151642173456                   (2000.0000000000000,-1000.0000000000000)

Rank_2
gcc.ver=11.1.0
gcc_dir=C:\Program Files (x86)\gcc_eq\gcc_11.1.0
options=-fimplicit-none -g -O2 -march=native -ffast-math -fopenmp
 Intel(R) Core(TM) i5-2300 CPU @ 2.80GHz
           1  0.26826990427252895                   (2000.0000000000000,-1000.0000000000000)
           2  0.27871094916872607                   (2000.0000000000000,-1000.0000000000000)
           3  0.26204691542716319                   (2000.0000000000000,-1000.0000000000000)
           4  0.27194549221849229                   (2000.0000000000000,-1000.0000000000000)
1 Like

Thanks for reply. And I found this question is so strange. There are some results.

  • for rank-2d . Multi cores calculation always slower than one core
  • for rank-1d . For some expressions ,multi cores calculation is nearly same as one core. But sometimes,its like rank 2d ,slower
  • If change complex to real,the result is different.

:dizzy_face:

My interpretation does not show a rank-2d problem.
I did control the !$OMP DO to remove any race-condition, however my interpretation of the MPI calculations may not be correct.

I see, it is my fault, maybe my computer’s memory is not sufficient.
If set n=500 , the result is quite different.

A is a local array of 16 MBytes. It is very unlikely to exceed the computer’s memory, but could challenge some stack size allocations or be influenced by some cache sizes. (Implied do loop order for rank-2 could be an issue, but should be unlikely)

Achieving the result graphed where time is proportional to threads, suggests either:
processor time rather than elapsed time is being measured or
the number of threads being selected exceeds the number of cores/threads available to the program or
computation is not being distributed between threads (as I have assumed in the OMP example).

Yes, and contrarily to the OpenMP case, each MPI task will do the exact same calculation on its own local array a(n,n) or a(n*n). Hence, note that the total memory footprint of variable a in the MPI case is nprocs*n*n*16 bytes, while it is still n*n*16 for OpenMP.

This looks like an optimisation problem to me. Have you tried to manually loop over a?

With

do j=1,1000
do k=1,1000
do i=1,1000
    a(k,j)=a(k,j)+cmplx(2.d0,-1.d0,8)
end do
end do
end do

I get faster execution with one process and with four processes the execution time is almost the same.
If you change the order of the loops so that the i-loop is the outermost, you`ll end up with the same behaviour you described.

The reason is obvious:
Your array has a size of (2*8*1000*1000) 16 MB. The size of your L3 cache is 20 MB. So if you are doing it with one process, it fits into the cache. One more process and you should end up with loads of cache misses.
Now if you write explicit loops, you don’t iterate over the whole array 1000 times, instead you iterate only once over the array and for each cell you’re doing the 1000 iterations of adding an cmplx. This time you get far less cache misses.
Edit: I messed up the order of the loops and fixed it now.

2 Likes

Thanks a lot, It is useful. :grinning:

So that means the caching behaviour differs between one- and two-dimensional arrays?

At least it seems to be the case. I guess the compiler cannot optimise that much if you have an 2D array and therefore two loops instead of one compared to the case with a 1D array.
And this is round about the factor I would expect when comparing RAM and L3 in terms of bandwidth/speed.

I did another test:
First I removed as much as possible from the code:

Minimal Fortran code
program main
    implicit none
    integer,parameter :: n=1000
    complex(kind=8)   :: a(n,n) ! or a(n*n)
    integer           :: i
    a=0
    do i=1,1000
        a=a+cmplx(2.d0,-1.d0,8)
    end do
end program main

Then I compiled it to assembly for both cases (1D and 2D) using this command: gfortran a.f90 -O3 -save-temps -fverbose-asm

After this I compared both outputs.
The 2D version ended up looking like this (only the interesting part)

2D assembly
.L4:
	leaq	a.0(%rip), %rax	#, ivtmp.30
.L3:
	xorl	%edx, %edx	# ivtmp.17
.L2:
# a.f90:8:         a=a+cmplx(2.d0,-1.d0,8)
	movapd	(%rax,%rdx), %xmm3	# MEM <vector(2) real(kind=8)> [(real(kind=8) *)vectp_a.8_24 + ivtmp.17_31 * 1], vect__26.10
	movapd	(%rax,%rdx), %xmm0	# MEM <vector(2) real(kind=8)> [(real(kind=8) *)vectp_a.8_24 + ivtmp.17_31 * 1], vect__25.11
	addpd	%xmm2, %xmm3	# tmp108, vect__26.10
	addpd	%xmm1, %xmm0	# tmp105, vect__25.11
	movsd	%xmm3, %xmm0	# vect__26.10, vect__25.11
	movaps	%xmm0, (%rax,%rdx)	# vect__25.11, MEM <vector(2) real(kind=8)> [(real(kind=8) *)vectp_a.8_24 + ivtmp.17_31 * 1]
	addq	$16, %rdx	#, ivtmp.17
	cmpq	$16000, %rdx	#, ivtmp.17
	jne	.L2	#,
	addq	$16000, %rax	#, ivtmp.30
	cmpq	%rax, %rsi	# ivtmp.30, _28
	jne	.L3	#,
# a.f90:7:     do i=1,1000
	subl	$1, %ecx	#, ivtmp_29
	jne	.L4	#,

while the 1D version was a lot more complex:

1D assembly
.L3:
	movsd	(%rsi), %xmm0	# MEM[(real(kind=8) *)_99], a_I_RE_lsm0.14
	movl	$499, %edx	#, ivtmp_101
.L2:
# a.f90:8:         a=a+cmplx(2.d0,-1.d0,8)
	addsd	%xmm1, %xmm0	# tmp122, tmp109
	addsd	%xmm1, %xmm0	# tmp122, a_I_RE_lsm0.14
	subl	$1, %edx	#, ivtmp_101
	jne	.L2	#,
	movsd	%xmm0, (%rsi)	# a_I_RE_lsm0.14, MEM[(real(kind=8) *)_99]
	addq	$16, %rsi	#, ivtmp.92
	cmpq	%rsi, %rcx	# ivtmp.92, _28
	jne	.L3	#,
	leaq	8+a.0(%rip), %rsi	#, ivtmp.77
	movsd	.LC1(%rip), %xmm1	#, tmp123
	leaq	16000000(%rsi), %rdi	#, _47
.L7:
	movsd	(%rsi), %xmm0	# MEM[(real(kind=8) *)_82], a_I_IM_lsm0.15
# a.f90:6:     a=0
	movl	$499, %edx	#, ivtmp_55
.L6:
# a.f90:8:         a=a+cmplx(2.d0,-1.d0,8)
	subsd	%xmm1, %xmm0	# tmp123, tmp112
	subsd	%xmm1, %xmm0	# tmp123, a_I_IM_lsm0.15
	subl	$1, %edx	#, ivtmp_55
	jne	.L6	#,
	movsd	%xmm0, (%rsi)	# a_I_IM_lsm0.15, MEM[(real(kind=8) *)_82]
	addq	$16, %rsi	#, ivtmp.77
	cmpq	%rsi, %rdi	# ivtmp.77, _47
	jne	.L7	#,
	movapd	.LC2(%rip), %xmm3	#, tmp120
	movapd	.LC3(%rip), %xmm2	#, tmp121
	jmp	.L8	#
	.p2align 4,,10
	.p2align 3
.L19:
	movaps	%xmm0, (%rax)	# vect__38.19, MEM <vector(2) real(kind=8)> [(real(kind=8) *)_106]
# a.f90:7:     do i=1,1000
	addq	$16, %rax	#, ivtmp.38
	cmpq	%rax, %rcx	# ivtmp.38, _28
	je	.L18	#,
.L8:
	movapd	(%rax), %xmm0	# MEM <vector(2) real(kind=8)> [(real(kind=8) *)_106], vect__38.19
# a.f90:6:     a=0
	movl	$2, %edx	#, ivtmp_14
.L10:
# a.f90:8:         a=a+cmplx(2.d0,-1.d0,8)
	movapd	%xmm0, %xmm1	# vect__38.19, vect__6.21
	addpd	%xmm2, %xmm0	# tmp121, vect__31.22
	addpd	%xmm3, %xmm1	# tmp120, vect__6.21
	movsd	%xmm1, %xmm0	# vect__6.21, vect__38.19
	cmpl	$1, %edx	#, ivtmp_14
	je	.L19	#,
	movl	$1, %edx	#, ivtmp_14
	jmp	.L10	#

Unfortunately I cannot read assembly (it would be interesting, though, maybe I should put some effort in it). But obviously there is a lot more stuff going on in the 1D case. This approves my theory on compiler optimisation.

2 Likes

In my haste, I omitted setting the number of threads in my OpenMP examples.
do threads = 1, omp_get_max_threads ()
call omp_set_num_threads (threads)

The results are now as expected.

gcc_dir=C:\Program Files (x86)\gcc_eq\gcc_11.1.0
options=-fimplicit-none -g -O2 -march=native -ffast-math -fopenmp
PROCESSOR_DESCRIPTION=Intel(R) Core(TM) i5-2300 CPU @ 2.80GHz 4 Cores 8.0 GB mem 6 MB cache
Rank 1
           1   1.0287353541218636                   (2000.0000000000000,-1000.0000000000000)
           2  0.50807772917505645                   (2000.0000000000000,-1000.0000000000000)
           3  0.34752818848176048                   (2000.0000000000000,-1000.0000000000000)
           4  0.26390584119735649                   (2000.0000000000000,-1000.0000000000000)

Rank 2
           1   1.0023055139536154                   (2000.0000000000000,-1000.0000000000000)
           2  0.50897061184457471                   (2000.0000000000000,-1000.0000000000000)
           3  0.34319462862890759                   (2000.0000000000000,-1000.0000000000000)
           4  0.26272559399058082                   (2000.0000000000000,-1000.0000000000000)

The use of explicit DO loops, rather than array syntax does allow for better control of sharing the calculation between the threads, if that was the original intention.
As Carltoffel notes, changing the loop order significantly affects performance, and would create a race condition for the OpenMP approach.

@Euler-37 : Did the original code intend any load sharing between threads ?

No, In my code,each threads do its calculation,I gather(reduce) and deal with the data in the end.
I apologise for that,Maybe the test code is so oversimplified that cann’t express my original question.

I add some details and rewrite it below

program main
   implicit none
   integer,parameter::n=5000,step=100 ! actually step=100000,and I changed it and do benchmark
   integer::i,j,k
   complex(kind=8)::a(n,n),vec(n),temp
   do i=1,step
       ! do something to calculate `vec`
       do j=1,n
           do k=1,n
               a(k,j)=a(k,j)+conjg(vec(k))*vec(j)
           end do
       end do
   end do
end program

and I did this calculation on supercomputer with 64 cores

and rank1d calculation spent 216.5460s
rank 2d calculation spent 666.6437s
nearly 3 times.