… But in the present form your example is not convincing. You try to beat a single BLAS3 call by double loops (which do not even unroll well) of BLAS1 calls. I cannot test your implementation because of the known problems with ddot
, but I strongly suspect that zgemm
will win at least in some domain of the n-l
parameter space.
We’ve tried a variety of alternatives but the current implementation is the fastest. However, these things have a tendency to change over the years as compilers and hardware evolve.
Here is the full routine:
subroutine zmctmu(l,n,a,b,ld,c)
use modomp
implicit none
! arguments
integer, intent(in) :: l,n
complex(8), intent(in) :: a(l,n),b(l,n)
integer, intent(in) :: ld
complex(8), intent(inout) :: c(ld,*)
! local variables
integer i,j,nthd
! external functions
real(8), external :: ddot
complex(8), external :: zdotc
call holdthd(n,nthd)
!$OMP PARALLEL DO DEFAULT(SHARED) &
!$OMP PRIVATE(i) &
!$OMP NUM_THREADS(nthd) SCHEDULE(DYNAMIC)
do j=1,n
do i=1,j-1
c(i,j)=c(i,j)+zdotc(l,a(1,i),1,b(1,j),1)
end do
c(j,j)=c(j,j)+ddot(2*l,a(1,j),1,b(1,j),1)
end do
!$OMP END PARALLEL DO
call freethd(nthd)
end subroutine
… and you can see there is some parallelism involved. The variable l is typically 100-200 and n can be 100s to ~10000, and the entire procedure is replicated hundreds of times. So a lot of CPU power is expended here.
The matrix is then fed into zhegvx which solves the generalised eigenvalue problem Ax = λBx, which turns out to be slightly faster than the packed version zhpgvx, at least with MKL the last time I tested it.
There is also a version of this where the final matrix is real because of inversion symmetry. In this case, the following routine is used instead:
subroutine rzmctmu(l,n,a,b,ld,c)
use modomp
implicit none
! arguments
integer, intent(in) :: l,n
! pass in complex arrays as real
real(8), intent(in) :: a(2*l,n),b(2*l,n)
integer, intent(in) :: ld
complex(8), intent(inout) :: c(ld,*)
! local variables
integer i,j,nthd
call holdthd(n,nthd)
!$OMP PARALLEL DO DEFAULT(SHARED) &
!$OMP PRIVATE(i) &
!$OMP NUM_THREADS(nthd) SCHEDULE(DYNAMIC)
do j=1,n
do i=1,j
c(i,j)=c(i,j)+dot_product(a(:,i),b(:,j))
end do
end do
!$OMP END PARALLEL DO
call freethd(nthd)
end subroutine
Here it is faster to use dot_product instead of ddot. Once again, a complex matrix and vectors are passed in as real and considerable efficiency can be gained.