Wrappers around DGEMM and DGEMV

aledinola · January 7, 2024, 3:17pm

Dear all,
I have created two wrappers around DGEMM and DGEMV. The syntax of DGEMM and DGEMV is cumbersome and I want to be able to call them in an easy way in my projects.

So here are my two wrappers saved in a module

module mod_numerical
implicit none 
private
! BLAS function wrappers
public :: mtimes_matvec, mtimes_matmat
!===============================================================================!
! NOTE: The functions mtimes_matvec and mtimes_matmat are wrappers around 
! dgemv and dgemm (from BLAS library). To compile this module, please add flag mkl
! or -llapack -lblas. Below is an example from intel website:
! Windows* OS: ifort /Qmkl src\dgemm_example.f
! Linux* OS, macOS*: ifort -mkl src/dgemm_example.f
function mtimes_matvec(A,B) result (C)
	real(8), intent(in) :: A(:,:),B(:)
	real(8) :: C(size(A,1))

	if ( size (A ,2)/= size (B)) then 
		error stop 'mtimes_matvec : size doesn''t conform '
	endif
		
	!C := alpha *op(A)* op(B) + beta *C,
	call dgemv('N',size (A ,1) , size (A ,2) ,1d0 ,A, size (A ,1) , B ,1 ,0d0 ,C ,1)

end function mtimes_matvec

function mtimes_matmat(A,B) result (C)
    ! Function inputs:
	real(8), intent(in) :: A(:,:),B(:,:)
    ! Function result (it is an allocatable dummy array)
	real(8), allocatable :: C(:,:) !C(size(A,1),size(B,2))
    ! Local variables:
    integer :: istat 

	if ( size (A ,2)/= size (B,1)) then 
		error stop 'mtimes_matmat : size doesn''t conform '
	endif

	allocate( C(size(A,1),size(B,2)), stat=istat)	
    if (istat/=0) then
        error stop "mtimes_matmat: Allocation failed!"
    endif

	!C := alpha *op(A)* op(B) + beta *C,
	CALL DGEMM('N','N',size (A ,1) ,size(B,2),size(A,2),1.0d0,A,size (A ,1),B,size(B,1),0.0d0,C,size(A,1))

end function mtimes_matmat
!===============================================================================!
end module mod_numerical

and here is a simple test program

program test
    use mod_numerical, only: mtimes_matvec, mtimes_matmat
    implicit none
    
    call test_DGEMM_DGEMV()

    write(*,*) "Program will close..."
    pause 

contains

subroutine test_DGEMM_DGEMV()
! y = A*x and C=A*B 
integer, parameter :: n1=100, n2=100, n3=100
real(8) :: err_y
real(8), allocatable :: A(:,:), B(:,:),x(:)
real(8), allocatable :: y(:), y1(:), C(:,:), C1(:,:)

allocate(A(n1,n2),x(n2),B(n2,n3))
call RANDOM_NUMBER(A)
call RANDOM_NUMBER(x)

! Matrix-vector multiplication
y  = matmul(A,x)
y1 = mtimes_matvec(A,x)

! Matrix-matrix multiplication
C  = matmul(A,B)
C1 = mtimes_matmat(A,B)

! Check that intrinsic matmul and mtimes_matvec,mtimes_matvec
! give same results
write(*,*) "Is y=y1?", all(y==y1)
write(*,*) "Is C=C1?", all(C==C1)

err_y = maxval(abs(y-y1))
write(*,*) "maxval(abs(y-y1)) = ", err_y

end subroutine test_DGEMM_DGEMV
end program test

To my surprise, the test for mtimes_matvec failed, in that y and y1 are not equal. The difference seems very small though (this is the same in ifort and ifx). Is this to be expected?

Is y=y1? F
Is C=C1? T
maxval(abs(y-y1)) = 7.105427357601002E-015
Program will close…
Fortran Pause - Enter command or to continue.

Thanks, any feedback (also on how I wrote the wrappers) is appreciated!

aledinola · January 7, 2024, 3:25pm

I have also another partially unrelated question: when I compile on the command line (in Windows with ifort) I write the following

ifort /Qmkl mod_numerical.f90 test_mod_numerical.f90 -o test\run_test.exe

and the code compiles successfully. If I want to do the same in VS2022, I have to choose from the library menu /Qmkl:sequential or /Qmkl:parallel (please see screenshot). Naive question: which of the two options is better? I assume that neither dominates the other, otherwise VS would not give the choice

PierU · January 7, 2024, 3:38pm

Yes. Floating-point arithmetic is inherently inexact, and the end result depends on in which order the operations and round-offs are performed. Nothing can garantees that matmult and dgemv perfom the operations in the same order.

Note that the magnitude of your “error” is of the order of the machine precision of real(8) (that is, 15 digits on your machine).

The BLAS/LAPACK interfaces are cumbersome, but very flexible. A drawback of your wrappers is if they are called with non-contiguous arrays : there will be copy-in/copy-out at some point, because this is what the BLAS/LAPACK routines expect. By calling them directly you can manage the discontiguous cases by using the LD* and INC* arguments, without needing any copy.

None. The latter is the parallel (multithreaded) version, where the computations are distributed on several cores. Depending on the cases, it can be faster… or not.

zaikunzhang · January 7, 2024, 3:40pm

First of all, do not write real(8). The standard does not ensure that real(8) provides 64-bit floating-point numbers (or real*8 if you want). real(8) is not always valid (e.g., try nagfor).

Secondly, floating-point numbers are not real numbers. They follow different arithmetic. For example,
floating-point addition is not commutative — the result changes if you reorder a sum. Similar for multiplication. Consequently, under floating-point arithmetic, it is unsurprising that two ways of calculating a matrix product lead to slightly different results.

Thirdly, partially due to the last point, testing the equality of floating-point numbers is discouraged.

aledinola · January 7, 2024, 7:38pm

Thanks for your answer! Regarding the non-contiguous arrays: would it help to add the contiguous attribute in the declaration?
Moreover, to understand what contiguous means: suppose A is a 3-dim array. Then

A(:,:,i )

is contiguous but

A(i,:,: )

or

A(:,i,: )

are not, correct? This has to do with the fact that arrays are stored in column-major order

RonShepard · January 7, 2024, 10:13pm

In a few cases, this might be good, but in general it just pushes the copy-in/copy-out up one level. You still pay for the overhead, just in a different place in your calling sequence. In addition to noncontiguous arrays, there can also be problems when you want to use a transposed matrix. The BLAS interface allows this through the ‘N’ and ‘T’ arguments. In fortran you do this with the transpose() intrinsic, which then can invoke its own allocation, copy-in/copy-out, and deallocation overhead.

I don’t know what is the best overall solution to this problem. One approach would be for compilers to compile matmul() calls to optimal code. Gfortran does a pretty good job when the option -fexternal-blas is used. Otherwise, compiler matmul() codes generally do not perform very well. But there is also an inherent problem with the matmul() function interface, which typically involves some kind of memory allocation to hold the intermediate results. In contrast, the BLAS subroutine interface allows updating the output matrix directly with no allocation overhead.

A(:,:,i) might be contiguous, but not necessarily. If A(:,:,:) is a dummy array associated with X(i,:,:,:), then it would not be contiguous. Also, there are all kinds of possibilities involving array slices that are not contiguous.

PierU · January 8, 2024, 6:17am

Not really, as described by @RonShepard . Now, if you know for sure that in your code you will always call the wrappers with contiguous arrays, then your wrappers are OK (but adding the contiguous attribute is a good idea)

Provided that A is itself contiguous, yes.

Note that A(:,::2,:), A(1:n1/2,:,:) (with n1=size(A,1)), etc, are not contiguous either.

Topic		Replies	Views
Writing wrappers for LAPACK and BLAS routines	54	2010	December 14, 2023
Achieving OpenBLAS DGEMM performance with Fortran vs C intrinsics: why is Fortran slower? Help	20	371	August 25, 2025
Matrix index-pointer confusion Help	21	360	August 11, 2024
DGBMV blas issue with 2d array Help	7	437	May 23, 2023
Mapping matrix & vector arithmetic to BLAS calls	8	1360	July 20, 2022

Wrappers around DGEMM and DGEMV

Related topics