MPI run time and arrays rank

I modified your program by:
using a F77 wrapper for the inner loop ( minor improvement )
moving array a to static storage ( solves stack problems )
applying !$OMP to the J loop ( using J for OMP removes any race condition )
Total run time is 4.3 seconds on my very old i5-2300 using -fopenmp

program main
   implicit none
   integer,parameter :: n=5000, step=100 ! actually step=100000,and I changed it and do benchmark
   integer::i,j,k
   complex(kind=8)::a(n,n),vec(n),temp
   common /aaa/ a

    real(8), external :: elapse_time
    real(8)           :: time, all

   a = cmplx (2.d0,-1.d0,8)

   all = elapse_time ()
   do i=1,step
       time = elapse_time ()
       ! do something to calculate `vec`
       vec = cmplx (dble(mod(i,10)), -1.d0,8)
     !$omp parallel do private (j) shared (a,vec) schedule (STATIC)
       do j=1,n
         call rank_1_calc ( n, a(1,j), vec, vec(j) )
!           do k=1,n
!               a(k,j)=a(k,j)+conjg(vec(k))*vec(j)
!           end do
       end do
     !$omp end parallel do
       time = elapse_time () - time
       write (*,*) i, time
   end do
   all = elapse_time () - all
   write (*,*) 'total', all
end program

  subroutine rank_1_calc ( n, a_colj, vec, vec_j )
     integer :: n, k
     complex(kind=8) :: a_colj(n), vec(n), vec_j

       do k=1,n
           a_colj(k) = a_colj(k) + conjg(vec(k))*vec_j
       end do
  end subroutine rank_1_calc

  real(8) function elapse_time ()
    integer(8) :: tick, rate
     call SYSTEM_CLOCK ( tick, rate )
     elapse_time = dble(tick) / dble(rate)
  end function elapse_time

I have tested with and without omp and also your rank-2 do loop

  It is now Saturday, 11 December 2021 at 15:17:04.874
gcc_dir=C:\Program Files (x86)\gcc_eq\gcc_11.1.0
options=-fimplicit-none -g -O2 -march=native -ffast-math -fopenmp
 Intel(R) Core(TM) i5-2300 CPU @ 2.80GHz

rank1 test + omp
           1   4.4292258531640982E-002
           2   4.3710565267247148E-002
...
          99   4.3035405218688538E-002
         100   4.3139868092112010E-002
 total   4.3651893783644482     
================================================================ 
  It is now Saturday, 11 December 2021 at 15:22:38.841
gcc_dir=C:\Program Files (x86)\gcc_eq\gcc_11.1.0
options=-fimplicit-none -g -O2 -march=native -ffast-math
 Intel(R) Core(TM) i5-2300 CPU @ 2.80GHz

rank1 test no omp
           1   5.5288717059738701E-002
           2   5.5015280900988728E-002
           3   5.6292660257895477E-002
...
          99   5.5346996345178923E-002
         100   5.5426168204576243E-002
 total   5.5186695315169345     
================================================================ 
  It is now Saturday, 11 December 2021 at 15:24:06.301
gcc_dir=C:\Program Files (x86)\gcc_eq\gcc_11.1.0
options=-fimplicit-none -g -O2 -march=native -ffast-math
 Intel(R) Core(TM) i5-2300 CPU @ 2.80GHz

rank2 test no omp
           1   5.4993288718833355E-002
           2   5.5199282174726250E-002
...
          99   5.6634638716786867E-002
         100   5.6367800218140474E-002
 total   5.6620670013871859

Without "do something to calculate vec" these times are a lot less than you present.
I hope I havn’t omitted something important ?
!$OMP has only a moderate gain.

I have also tested on a Ryzen 5900X with some processor improvement, but again no significant OMP gain

================================================================ 
  It is now Saturday, 11 December 2021 at 15:40:18.833
gcc_dir=C:\Program Files (x86)\gcc_eq\gcc_11.1.0
options=-fimplicit-none -g -O2 -march=native -ffast-math -fopenmp
 AMD Ryzen 9 5900X 12-Core Processor
           1   2.5735300034284592E-002
           2   2.1049600094556808E-002
...
         998   2.0049700047820807E-002
         999   1.9942799583077431E-002
        1000   1.9521200098097324E-002
 total   19.683702200185508     

================================================================ 
  It is now Saturday, 11 December 2021 at 15:41:16.829
gcc_dir=C:\Program Files (x86)\gcc_eq\gcc_11.1.0
options=-fimplicit-none -g -O2 -march=native -ffast-math
 AMD Ryzen 9 5900X 12-Core Processor
           1   2.3990999907255173E-002
           2   2.4467099923640490E-002
           3   2.4172300007194281E-002
...
         998   2.4538800120353699E-002
         999   2.4536699987947941E-002
        1000   2.4490099865943193E-002
 total   24.578610499855131

Cores are not the solution; probably memory bandwidth, for this example.

I would like to expand on my last comment !
Over the last few years, I have been trying to improve my use of multi-threading, especially performance efficiency.
The processor class I have been using is relatively cheap dual memory channel Intel i5/i7 and Ryzen-Zen3. With these, there has been a progression of number of cores, and some increase in memory speed, but most computing problems I am solving (Structural finite element analysis) involve at least one large array ( array size >> L3 cache )
My expected solution is to package the computation into packets of L3 cache size ( and if possible, L1 size for inner loops ) with some success, but some classes of computation just don’t allow this, such as many time steps.

What I do think is that Intel / AMD have raced to marketing by increasing the core count, while ignoring the problem of memory bandwidth limitations, where capacity is not increased.
OpenMP is a multi-thread, shared memory model, so providing cores to increase the threads, also results in an increase in memory transfers, (although buffered by the cache) with little hardware improvement.
MPI threaded on the same hardware can have the same problem. I interpret Euler-37’s chart in post #5 as identifying this problem.
I also have problems with hyper-threading stalling, but this might be explained by limited vector processing capacity per core.
The increase in cores is more marketing, than providing for a better balanced solution.

Has anyone had success in addressing this memory bottleneck problem ?
Or, am I mistaken in my problem assessment ?
Does anyone have a different explanation or strategy to address this problem ?

1 Like