Cannot deallocate after call to mkl_sparse_d_mv

Dear users,

I have a problem with deallocating an array which is an output of the routine mkl_sparse_d_mv (part of the sparse BLAS libraries)

My code is more or less this:

use mkl_spblas
implicit none
type(sparse_matrix_t) :: sflow
real(8), allocatable :: g_current(: )
real(8), allocatable :: inflow_couple(: ) 
 
istat = mkl_sparse_d_create_csr(sflow,SPARSE_INDEX_BASE_ONE, n_state,n_state,row_ind(1:n_state),row_ind(2:n_state+1),col_ind,values)
     descr%type = SPARSE_MATRIX_TYPE_GENERAL
 
 allocate(inflow_couple(n_state))
 allocate(g_current(n_state))
 
 istat = mkl_sparse_d_mv(SPARSE_OPERATION_TRANSPOSE,1.0d0,sflow,descr,g_current,0.0d0,inflow_couple)
 
 ! When the code reaches this line, there is an access violation
 if (allocated(inflow_couple)) deallocate(inflow_couple,stat=istat)

The error message is the following:

forrtl: severe (157): Program Exception - access violation

The strange thing is that if I use SPARSE_OPERATION_NON_TRANSPOSE, the array inflow_couple can be deallocated without problem.

This is likely a symptom of another program error somewhere, maybe not even related to this function reference. The array metadata for that argument is being overwritten and corrupted. It might be a dangling pointer, or an array access overrun, or something similar.

1 Like

This led me to the Intel forum, which led me to the article: Don't Touch Me There - What error 157 (Access Violation) is trying to tell you - Doctor Fortran

I’m guessing the error is in the sparse matrix indexes resulting in an out-of-bounds read of the array?

If values in col_ind are wrong, there could be an out-of-bounds read in g_current. If the row_ind are wrong, there could be an out-of-bounds read in col_ind or values.

2 Likes

Thanks for your answers. I learned that if I compile and run the code with debug options, the code completes successfully.

/O0 /Qopenmp /Qmkl /warn:all /check:all

But if I compile it with the following flags, I get the access violation error

/O3 /Qmkl /Qopenmp /Qipo /Qprec-div- /QxHost

Here is how a check of the indexes for the CSR storage scheme (with one-based indexing) might look like:

program check
implicit none
integer, parameter :: n = 4
integer, parameter :: nnz = 8
integer :: ia(n+1), ja(nnz)

! row index pointers
ia = [1,4,6,8,9]

! column indexes
ja = [ 1, 2,    4, &
          2, 3,    &
       1,       4, &
          2        ]

call csr_check(n,nnz,ia,ja)

contains

    ! Assumes square matrix of size n-by-n, and
    ! one based indexing of the sparsity pattern
    subroutine csr_check(n,nnz,ia,ja)
        integer, intent(in) :: n, nnz
        integer, intent(in) :: ia(n+1), ja(nnz)

        logical :: consistent
        integer :: i, nz, iaa, iab

        nz = 0
        ! Check consistency of row indexes
        do i = 1, n
            consistent = ia(i) < ia(i+1)
            if (.not. consistent) then
                write(*,*) "csr_check: error in row pointer ", i
                error stop 1
            end if
            nz = nz + (ia(i+1) - ia(i))
        end do

        ! Check number of non-zeros is consistent with the row pointers
        if (nnz /= nz) error stop 2
        if (nnz /= ia(n+1)-1) error stop 3

        ! Check column indexes are within the matrix
        do i = 1, n
            iaa = ia(i)
            iab = ia(i+1)-1
            if (any(ja(iaa:iab) > n)) then
                write(*,*) "csr_check: error in column index ", i
                error stop 4
            end if
        end do

end program

Edit: I made some changes to the csr_check procedure

1 Like

Thanks for providing the check! However, the check returns with no errors, but I still got the access violation when running without the debugging flags.

Since you mentioned the transpose makes the access violation disappear - have you checked the istat arguments?

if (istat /= SPARSE_STATUS_SUCCESS) error stop istat

Thanks for the suggestion! To be clear, this gives me the access violation error

istat = mkl_sparse_d_mv(SPARSE_OPERATION_TRANSPOSE,1.0d0,sflow,descr,g_current,0.0d0,inflow_couple)

and this does not

istat = mkl_sparse_d_mv(SPARSE_OPERATION_NON_TRANSPOSE,1.0d0,sflow,descr,g_current,0.0d0,inflow_couple)

In both cases, the command mkl_sparse_d_mv returns no (immediate) error (I checked istat), and the rest of the code completes correctly (I compared results with a version without using the sparse BLAS library and they are identical).
The error is with deallocating inflow_couple upon exiting the routine. I’m really puzzled by this error.

That is odd :eyes: , also because your matrix is of size n_state by n_state, so the transpose shouldn’t matter.

Sticking to the hypothesis that the problem comes from MKL,

  • can you try running the program with the environment variable MKL_NUM_THREADS=1?
  • replace the MKL inspector-executor routine with the older (and deprecated) routine mkl_?csrgemv:
external :: mkl_dcsrgemv  ! (assumes one-based indexing)

! Transposed
call mkl_dcsrgemv('T',n_state,values,row_ind,col_ind,g_current,inflow_couple)

! Non-transposed
call mkl_dcsrgemv('N',n_state,values,row_ind,col_ind,g_current,inflow_couple)
1 Like

Did you try running with lower optimization levels like O1 or O2. The fact that it runs with O0 and not O3 indicates to me an optimization problem. Unfortunately, with ifort (I presume thats what you are using) I will occasionally run into a problem with using O3 (even with codes that compiled correctly with O3 in an older version of the compiler). The weirdest one I’ve encounterd is a code that didn’t produce the correct results with O0 but did with O3, exactly opposite of what you would expect.

These kinds of program errors are often “Heisenbugs”. You can make trivial changes to the compiler options, or add write statements for debugging, etc. and the error will magically disappear and reappear. They can be difficult to locate and fix.

2 Likes

Particularly when its the compiler causing the problem and not the programmer. I’ve spent a couple of days rewritting code trying to kill a bug assuming it was something I did when it was a compiler bug or an optimization problem (which I consider a compiler bug).

2 Likes

Thanks for your answers! I think in this case it was the programmer (i.e. me) creating the problem :slight_smile:

The size of the transition matrix is very big, and the problem was the declaration of max_nz and counter. Before I declared them as integer, now I declare them as integer(8) and the problem disappeared.

When I was declaring max_nz as integer, I was losing some non zero elements

Does that mean you are using the ILP64 version of Intel MKL? Otherwise I don’t see how that would work.