I modified your program by:
using a F77 wrapper for the inner loop ( minor improvement )
moving array a to static storage ( solves stack problems )
applying !$OMP to the J loop ( using J for OMP removes any race condition )
Total run time is 4.3 seconds on my very old i5-2300 using -fopenmp
program main
implicit none
integer,parameter :: n=5000, step=100 ! actually step=100000,and I changed it and do benchmark
integer::i,j,k
complex(kind=8)::a(n,n),vec(n),temp
common /aaa/ a
real(8), external :: elapse_time
real(8) :: time, all
a = cmplx (2.d0,-1.d0,8)
all = elapse_time ()
do i=1,step
time = elapse_time ()
! do something to calculate `vec`
vec = cmplx (dble(mod(i,10)), -1.d0,8)
!$omp parallel do private (j) shared (a,vec) schedule (STATIC)
do j=1,n
call rank_1_calc ( n, a(1,j), vec, vec(j) )
! do k=1,n
! a(k,j)=a(k,j)+conjg(vec(k))*vec(j)
! end do
end do
!$omp end parallel do
time = elapse_time () - time
write (*,*) i, time
end do
all = elapse_time () - all
write (*,*) 'total', all
end program
subroutine rank_1_calc ( n, a_colj, vec, vec_j )
integer :: n, k
complex(kind=8) :: a_colj(n), vec(n), vec_j
do k=1,n
a_colj(k) = a_colj(k) + conjg(vec(k))*vec_j
end do
end subroutine rank_1_calc
real(8) function elapse_time ()
integer(8) :: tick, rate
call SYSTEM_CLOCK ( tick, rate )
elapse_time = dble(tick) / dble(rate)
end function elapse_time
I have tested with and without omp and also your rank-2 do loop
It is now Saturday, 11 December 2021 at 15:17:04.874
gcc_dir=C:\Program Files (x86)\gcc_eq\gcc_11.1.0
options=-fimplicit-none -g -O2 -march=native -ffast-math -fopenmp
Intel(R) Core(TM) i5-2300 CPU @ 2.80GHz
rank1 test + omp
1 4.4292258531640982E-002
2 4.3710565267247148E-002
...
99 4.3035405218688538E-002
100 4.3139868092112010E-002
total 4.3651893783644482
================================================================
It is now Saturday, 11 December 2021 at 15:22:38.841
gcc_dir=C:\Program Files (x86)\gcc_eq\gcc_11.1.0
options=-fimplicit-none -g -O2 -march=native -ffast-math
Intel(R) Core(TM) i5-2300 CPU @ 2.80GHz
rank1 test no omp
1 5.5288717059738701E-002
2 5.5015280900988728E-002
3 5.6292660257895477E-002
...
99 5.5346996345178923E-002
100 5.5426168204576243E-002
total 5.5186695315169345
================================================================
It is now Saturday, 11 December 2021 at 15:24:06.301
gcc_dir=C:\Program Files (x86)\gcc_eq\gcc_11.1.0
options=-fimplicit-none -g -O2 -march=native -ffast-math
Intel(R) Core(TM) i5-2300 CPU @ 2.80GHz
rank2 test no omp
1 5.4993288718833355E-002
2 5.5199282174726250E-002
...
99 5.6634638716786867E-002
100 5.6367800218140474E-002
total 5.6620670013871859
Without "do something to calculate vec
" these times are a lot less than you present.
I hope I havn’t omitted something important ?
!$OMP has only a moderate gain.
I have also tested on a Ryzen 5900X with some processor improvement, but again no significant OMP gain
================================================================
It is now Saturday, 11 December 2021 at 15:40:18.833
gcc_dir=C:\Program Files (x86)\gcc_eq\gcc_11.1.0
options=-fimplicit-none -g -O2 -march=native -ffast-math -fopenmp
AMD Ryzen 9 5900X 12-Core Processor
1 2.5735300034284592E-002
2 2.1049600094556808E-002
...
998 2.0049700047820807E-002
999 1.9942799583077431E-002
1000 1.9521200098097324E-002
total 19.683702200185508
================================================================
It is now Saturday, 11 December 2021 at 15:41:16.829
gcc_dir=C:\Program Files (x86)\gcc_eq\gcc_11.1.0
options=-fimplicit-none -g -O2 -march=native -ffast-math
AMD Ryzen 9 5900X 12-Core Processor
1 2.3990999907255173E-002
2 2.4467099923640490E-002
3 2.4172300007194281E-002
...
998 2.4538800120353699E-002
999 2.4536699987947941E-002
1000 2.4490099865943193E-002
total 24.578610499855131
Cores are not the solution; probably memory bandwidth, for this example.