Function inline in F90 code

Ashutosh_Londhe · June 13, 2024, 12:22am

I am trying to vectorize a loop which contain call to subroutine.
On top of this subroutine i have mentioned !DEC$ ATTRIBUTES FORCEINLINE :: subroutine_name due to which when i compile this with Intel, it says the required loop is vectorized.
Altough when i am trying this with GNU compiler, i am getting “missed: couldn’t vectorize, multiple nested loop”. I have provided the code snippet as well as command line which mentions all Flags i am using here.
Could anyone help me here how can i vectorize the following peice of code. Please keep in mind that the code is auto-generated by OPS-DSL library so i cannot replace the call statement with the body manually.

nbehrnd · June 24, 2024, 8:10pm

I speculate there is a high barrier of activation to overpass to help here because the relevant source code/output by the CLI is provided as picture. For future reference, I suggest either

for a short snippet: copy-paste it as fenced code block enclosed in a leading/trailing line of three back ticks/accent graves. With GitHub flavored markdown supported here, the leading line can be ```fortran (or use f90 instead) to then activate syntax highlighting (cf. the example around Ruby on learnxinyminutes).
for a somewhat long report (e.g., requires to scroll down): attach the report for instance, as .txt, .csv, or .f90 file.

ivanpribec · June 24, 2024, 9:00pm

Judging by the output on the left, your inner loop probably was vectorized:

[…] F90:153:16: optimized: loop vectorized using 32 byte vectors

It’s generally the inner-most loop which gets vectorized.

A few more things to note

The !DEC$ directive syntax doesn’t work with gfortran; as of GCC v14, gfortran does not have an equivalent directive to force inlining
Making the generated user function an internal procedure can help achieve inlining as the compiler is free to assume the procedure is not called elsewhere. With gfortran -O2 will already inline the internal procedure as shown below.
I’m curious why can’t the code generator - OP-DSL - just use multi-dimensional indexing? The offset of datX_base can be applied at the call site, instead of being passed as arguments.
The middle loop could be collapsed into the outer OMP directive.
The loop indexes are private by default within OMP loop-related directives

#define xdim1 40
#define ydim1 40
#define xdim2 40
#define ydim2 40
#define xdim3 40
#define ydim3 40
#define xdim4 40
#define ydim4 40
#define xdim5 40
#define ydim5 40
#define xdim6 40
#define ydim6 40
#define xdim7 40
#define ydim7 40
#define xdim8 40
#define ydim8 40

subroutine maths_kernel_eqAI_wrap( &
    opsDat1Local, &
    opsDat2Local, &
    opsDat3Local, &
    opsDat4Local, &
    opsDat5Local, &
    opsDat6Local, &
    opsDat7Local, &
    opsDat8Local, &
    start_indx, &
    end_indx )

real(kind=8), dimension(0:xdim1-1,0:ydim1-1,0:*), intent(inout) :: opsDat1Local
real(kind=8), dimension(0:xdim2-1,0:ydim2-1,0:*), intent(in) :: opsDat2Local
real(kind=8), dimension(0:xdim3-1,0:ydim3-1,0:*), intent(in) :: opsDat3Local
real(kind=8), dimension(0:xdim4-1,0:ydim4-1,0:*), intent(in) :: opsDat4Local
real(kind=8), dimension(0:xdim5-1,0:ydim5-1,0:*), intent(in) :: opsDat5Local
real(kind=8), dimension(0:xdim6-1,0:ydim6-1,0:*), intent(in) :: opsDat6Local
real(kind=8), dimension(0:xdim7-1,0:ydim7-1,0:*), intent(in) :: opsDat7Local
real(kind=8), dimension(0:xdim8-1,0:ydim8-1,0:*), intent(in) :: opsDat8Local

integer(kind=4), dimension(3), intent(in) :: start_indx, end_indx
integer(kind=4) :: n_x, n_y, n_z

!$omp parallel do collapse(2)
do n_z = 1, end_indx(3) - start_indx(3)+1
    do n_y = 1, end_indx(2) - start_indx(2)+1
        !$omp simd
        do n_x = 1, end_indx(1) - start_indx(1)+1
            call maths_kernel_eqAI( &
                opsDat1Local(n_x,n_y,n_z), &
                opsDat2Local(n_x,n_y,n_z), &
                opsDat3Local(n_x,n_y,n_z), &
                opsDat4Local(n_x,n_y,n_z), &
                opsDat5Local(n_x,n_y,n_z), &
                opsDat6Local(n_x,n_y,n_z), &
                opsDat7Local(n_x,n_y,n_z), &
                opsDat8Local(n_x,n_y,n_z) &
                )
        end do
    end do
end do

contains

    !  =============
    !  User function
    !  =============
    !DEC$ ATTRIBUTES FORCEINLINE :: maths_kernel_eqAI
    subroutine maths_kernel_eqAI(out_arr,in_arr1,in_arr2,in_arr3,in_arr4,in_arr5,in_arr6,in_arr7)
        real(kind=8), intent(inout) :: out_arr
        real(kind=8), intent(in) :: in_arr1, in_arr2,in_arr3,in_arr4,in_arr5,in_arr6,in_arr7
        
        out_arr = out_arr + (in_arr1 + in_arr2) &
                          + (in_arr3 + in_arr4) &
                          + (in_arr5 + in_arr6) &
                          + in_arr7
    end subroutine

end subroutine

You can inspect the results in Compiler Explorer: Compiler Explorer

For the flags, -c -cpp -O2 -ftree-vectorize -mprefer-vector-width=512 -march=icelake-server -fopt-info, the inner vector loop contains:

.L7:
        vmovupd zmm0, ZMMWORD PTR [r11+rax]
        vmovupd zmm1, ZMMWORD PTR [r9+rax]
        vaddpd  zmm0, zmm0, ZMMWORD PTR [rbx+rax]
        vaddpd  zmm1, zmm1, ZMMWORD PTR [r10+rax]
        vaddpd  zmm0, zmm0, ZMMWORD PTR [rdx+rax]
        vaddpd  zmm0, zmm0, zmm1
        vmovupd zmm1, ZMMWORD PTR [rdi+rax]
        vaddpd  zmm1, zmm1, ZMMWORD PTR [r8+rax]
        vaddpd  zmm0, zmm0, zmm1
        vaddpd  zmm0, zmm0, ZMMWORD PTR [rsi+rax]
        vmovupd ZMMWORD PTR [rdx+rax], zmm0
        add     rax, 64
        cmp     r12, rax
        jne     .L7

As expected it performs 7 additions using the ZMM registers.

Edit: Alternatively, the ???_indx arrays can also be used to set the dimensions (assuming these are contiguous chunks):

integer(kind=4), intent(in) :: start_indx(3), end_indx(3)
real(kind=8), intent(inout) :: opsDat1Local(start_indx(1):end_indx(1), start_indx(2):end_indx(2),start_indx(3):end_indx(3))
! ...

(Perhaps the some of the newly introduced F2023 syntax can make this even easier.)

Ashutosh_Londhe · June 24, 2024, 9:44pm

Hi,
Sorry for not putting the code in text format earlier

here is the code generated by OPS-translator

! Auto-generated at 2024-06-24 14:28:12.629804 by ops-translator

MODULE MATHS_KERNEL_EQW_MODULE

    USE OPS_FORTRAN_DECLARATIONS
    USE OPS_FORTRAN_RT_SUPPORT

    USE OPS_CONSTANTS
    USE, INTRINSIC :: ISO_C_BINDING

    IMPLICIT NONE

    INTEGER(KIND=4) :: xdim1
    INTEGER(KIND=4) :: ydim1
    INTEGER(KIND=4) :: zdim1
#define OPS_ACC1(x,y,z) ((x) + (xdim1*(y)) + (xdim1*ydim1*(z)) + 1)

    INTEGER(KIND=4) :: xdim2
    INTEGER(KIND=4) :: ydim2
    INTEGER(KIND=4) :: zdim2
#define OPS_ACC2(x,y,z) ((x) + (xdim2*(y)) + (xdim2*ydim2*(z)) + 1)

    INTEGER(KIND=4) :: xdim3
    INTEGER(KIND=4) :: ydim3
    INTEGER(KIND=4) :: zdim3
#define OPS_ACC3(x,y,z) ((x) + (xdim3*(y)) + (xdim3*ydim3*(z)) + 1)

    CONTAINS

!   =============
!   User function
!   =============
#ifdef __INTEL__COMPILER
!DEC$ ATTRIBUTES FORCEINLINE :: maths_kernel_eqW
#endif
SUBROUTINE maths_kernel_eqW(out_arr, in_arr1, in_arr2)
    real(kind=8), dimension(1) :: out_arr
    real(kind=8), dimension(1), intent(in) :: in_arr1, in_arr2

    out_arr(OPS_ACC1(0,0,0)) = in_arr1(OPS_ACC2(0,0,0)) * in_arr2(OPS_ACC3(0,0,0))

END SUBROUTINE

#undef OPS_ACC1
#undef OPS_ACC2
#undef OPS_ACC3

SUBROUTINE maths_kernel_eqW_wrap( &
    opsDat1Local, &
    opsDat2Local, &
    opsDat3Local, &
    dat1_base, &
    dat2_base, &
    dat3_base, &
    start_indx, &
    end_indx )

    REAL(KIND=8), DIMENSION(*), INTENT(OUT) :: opsDat1Local
    INTEGER(KIND=4), INTENT(IN) :: dat1_base

    REAL(KIND=8), DIMENSION(*), INTENT(IN) :: opsDat2Local
    INTEGER(KIND=4), INTENT(IN) :: dat2_base

    REAL(KIND=8), DIMENSION(*), INTENT(IN) :: opsDat3Local
    INTEGER(KIND=4), INTENT(IN) :: dat3_base

    INTEGER(KIND=4), DIMENSION(3), INTENT(IN) :: start_indx, end_indx

    INTEGER(KIND=4) :: n_x, n_y, n_z

    !$OMP PARALLEL DO PRIVATE(n_x,n_y,n_z)
    DO n_z = 1, end_indx(3)-start_indx(3)+1
        DO n_y = 1, end_indx(2)-start_indx(2)+1
            !$OMP SIMD
            DO n_x = 1, end_indx(1)-start_indx(1)+1

                CALL maths_kernel_eqW( &

                opsDat1Local(dat1_base + ((n_x-1)*1) + ((n_y-1)*xdim1*1) + ((n_z-1)*ydim1*xdim1*1)), &

                opsDat2Local(dat2_base + ((n_x-1)*1) + ((n_y-1)*xdim2*1) + ((n_z-1)*ydim2*xdim2*1)), &

                opsDat3Local(dat3_base + ((n_x-1)*1) + ((n_y-1)*xdim3*1) + ((n_z-1)*ydim3*xdim3*1)) &
               )

            END DO
        END DO
    END DO

END SUBROUTINE

!   ===============
!   Host subroutine
!   ===============
#ifndef OPS_LAZY
SUBROUTINE maths_kernel_eqW_host( userSubroutine, block, dim, range, &
    opsArg1, &
    opsArg2, &
    opsArg3 &
    )

    CHARACTER(KIND=C_CHAR,LEN=*), INTENT(IN) :: userSubroutine
    TYPE(ops_block), INTENT(IN) :: block
    INTEGER(KIND=4), INTENT(IN) :: dim
    INTEGER(KIND=4), DIMENSION(2*dim), INTENT(IN) :: range

    TYPE(ops_arg), INTENT(IN) :: opsArg1
    TYPE(ops_arg), INTENT(IN) :: opsArg2
    TYPE(ops_arg), INTENT(IN) :: opsArg3

    TYPE(ops_arg), DIMENSION(3) :: opsArgArray

#else
SUBROUTINE maths_kernel_eqW_host_execute( descPtr )

    TYPE(ops_kernel_descriptor), INTENT(IN) :: descPtr
    TYPE(ops_block) :: block
    INTEGER(KIND=C_INT) :: dim
    INTEGER(KIND=C_INT), POINTER, DIMENSION(:) :: range
    CHARACTER(KIND=C_CHAR), POINTER, DIMENSION(:) :: userSubroutine
    TYPE(ops_arg), POINTER, DIMENSION(:) :: opsArgArray

    TYPE(ops_arg) :: opsArg1
    TYPE(ops_arg) :: opsArg2
    TYPE(ops_arg) :: opsArg3

#endif

    REAL(KIND=8), POINTER, DIMENSION(:) :: opsDat1Local
    INTEGER(KIND=4) :: opsDat1Cardinality
    INTEGER(KIND=4), POINTER, DIMENSION(:)  :: dat1_size
    INTEGER(KIND=4) :: dat1_base

    REAL(KIND=8), POINTER, DIMENSION(:) :: opsDat2Local
    INTEGER(KIND=4) :: opsDat2Cardinality
    INTEGER(KIND=4), POINTER, DIMENSION(:)  :: dat2_size
    INTEGER(KIND=4) :: dat2_base

    REAL(KIND=8), POINTER, DIMENSION(:) :: opsDat3Local
    INTEGER(KIND=4) :: opsDat3Cardinality
    INTEGER(KIND=4), POINTER, DIMENSION(:)  :: dat3_size
    INTEGER(KIND=4) :: dat3_base

    REAL(KIND=8) :: t1__, t2__, t3__
    REAL(KIND=4) :: transfer_total, transfer

    INTEGER(KIND=4), DIMENSION(3) :: start_indx, end_indx
    INTEGER(KIND=4) :: n_indx
    CHARACTER(LEN=40) :: kernelName

    kernelName = "maths_kernel_eqW"

#ifdef OPS_LAZY
!   ==========================
!   Set from kernel descriptor
!   ==========================
    dim = descPtr%dim
    CALL c_f_pointer(descPtr%range, range, [2*dim])
    CALL c_f_pointer(descPtr%name, userSubroutine, [descPtr%name_len])
    block%blockCptr = descPtr%block
    CALL c_f_pointer(block%blockCptr, block%blockPtr)
    CALL c_f_pointer(descPtr%args, opsArgArray, [descPtr%nargs])

    opsArg1 = opsArgArray(1)
    opsArg2 = opsArgArray(2)
    opsArg3 = opsArgArray(3)
#else
    opsArgArray(1) = opsArg1
    opsArgArray(2) = opsArg2
    opsArgArray(3) = opsArg3
#endif

    CALL setKernelTime(204, kernelName//c_null_char, 0.0_8, 0.0_8, 0.0_4, 1)
    CALL ops_timers_core(t1__)

#if defined(OPS_MPI) && !defined(OPS_LAZY)
    IF ( getRange(block, start_indx, end_indx, range) < 0 ) THEN
        RETURN
    END IF
#elif !defined(OPS_MPI)  && !defined(OPS_LAZY)
    DO n_indx = 1, 3
        start_indx(n_indx) = range(2*n_indx-1)
        end_indx  (n_indx) = range(2*n_indx)
    END DO
#else
    DO n_indx = 1, 3
        start_indx(n_indx) = range(2*n_indx-1) + 1
        end_indx  (n_indx) = range(2*n_indx)
    END DO
#endif

    CALL c_f_pointer(getDatSizeFromOpsArg(opsArg1), dat1_size, [dim])
    xdim1 = dat1_size(1)
    ydim1 = dat1_size(2)
    zdim1 = dat1_size(3)
    opsDat1Cardinality = opsArg1%dim * xdim1 * ydim1 * zdim1
    dat1_base = getDatBaseFromOpsArg3D(opsArg1, start_indx, 1)
    CALL c_f_pointer(opsArg1%data, opsDat1Local, [opsDat1Cardinality])

    CALL c_f_pointer(getDatSizeFromOpsArg(opsArg2), dat2_size, [dim])
    xdim2 = dat2_size(1)
    ydim2 = dat2_size(2)
    zdim2 = dat2_size(3)
    opsDat2Cardinality = opsArg2%dim * xdim2 * ydim2 * zdim2
    dat2_base = getDatBaseFromOpsArg3D(opsArg2, start_indx, 1)
    CALL c_f_pointer(opsArg2%data, opsDat2Local, [opsDat2Cardinality])

    CALL c_f_pointer(getDatSizeFromOpsArg(opsArg3), dat3_size, [dim])
    xdim3 = dat3_size(1)
    ydim3 = dat3_size(2)
    zdim3 = dat3_size(3)
    opsDat3Cardinality = opsArg3%dim * xdim3 * ydim3 * zdim3
    dat3_base = getDatBaseFromOpsArg3D(opsArg3, start_indx, 1)
    CALL c_f_pointer(opsArg3%data, opsDat3Local, [opsDat3Cardinality])

!   ==============
!   Halo exchanges
!   ==============
#ifndef OPS_LAZY
    CALL ops_H_D_exchanges_host(opsArgArray, 3)
    CALL ops_halo_exchanges(opsArgArray, 3, range)
    CALL ops_H_D_exchanges_host(opsArgArray, 3)
#endif

    CALL ops_timers_core(t2__)

!   ==============================
!   Call kernel wrapper subroutine
!   ==============================
    CALL maths_kernel_eqW_wrap( &
                        opsDat1Local, &
                        opsDat2Local, &
                        opsDat3Local, &
                        dat1_base, &
                        dat2_base, &
                        dat3_base, &
                        start_indx, &
                        end_indx )

    CALL ops_timers_core(t3__)

#ifndef OPS_LAZY
    CALL ops_set_dirtybit_host(opsArgArray, 3)
    CALL ops_set_halo_dirtybit3(opsArg1, range)
#endif

!   ========================
!   Timing and data movement
!   ========================
    transfer_total = 0.0_4
    CALL ops_compute_transfer(3, start_indx, end_indx, opsArg1, transfer)
    transfer_total = transfer_total + transfer
    CALL ops_compute_transfer(3, start_indx, end_indx, opsArg2, transfer)
    transfer_total = transfer_total + transfer
    CALL ops_compute_transfer(3, start_indx, end_indx, opsArg3, transfer)
    transfer_total = transfer_total + transfer

    CALL setKernelTime(204, kernelName//c_null_char, t3__-t2__, t2__-t1__, transfer_total, 0)

END SUBROUTINE

#ifdef OPS_LAZY
SUBROUTINE maths_kernel_eqW_host( userSubroutine, block, dim, range, &
    opsArg1, &
    opsArg2, &
    opsArg3 &
    )

    CHARACTER(KIND=C_CHAR,LEN=*), INTENT(IN), TARGET :: userSubroutine
    TYPE(ops_block), INTENT(IN) :: block
    INTEGER(KIND=4), INTENT(IN) :: dim
    INTEGER(KIND=4), DIMENSION(2*dim), INTENT(INOUT), TARGET :: range
    INTEGER(KIND=4), DIMENSION(2*dim), TARGET :: range_tmp

    TYPE(ops_arg), INTENT(IN) :: opsArg1
    TYPE(ops_arg), INTENT(IN) :: opsArg2
    TYPE(ops_arg), INTENT(IN) :: opsArg3

    TYPE(ops_arg), DIMENSION(3), TARGET :: opsArgArray
    INTEGER(KIND=4) :: n_indx
    CHARACTER(LEN=40) :: namelit

    namelit = "maths_kernel_eqW"

    opsArgArray(1) = opsArg1
    opsArgArray(2) = opsArg2
    opsArgArray(3) = opsArg3

    DO n_indx = 1, 3
        range_tmp(2*n_indx-1) = range(2*n_indx-1)-1
        range_tmp(2*n_indx)   = range(2*n_indx)
    END DO

    CALL create_kerneldesc_and_enque(namelit//c_null_char, c_loc(opsArgArray), &
                                    3, 204, dim, 0, c_loc(range_tmp), &
                                    block%blockCptr, c_funloc(maths_kernel_eqW_host_execute))

END SUBROUTINE
#endif

END MODULE MATHS_KERNEL_EQW_MODULE

with the use of !DEC$ ATTRIBUTES FORCEINLINE :: maths_kernel_eqW, it is able to inline the function call present inside SIMD loop and able to vectorize with INTEL compiler
Although with GNU or CRAY it says “Loop was not vectorized because it contains call to subroutine”

With Cray compiler, i tried putting !DIR$ INLINEALWAYS maths_kerne_eqW as well as !DIR$ INLINE over the CALL statement, but it still not able to vectorize

ivanpribec · June 24, 2024, 9:59pm

If you can’t make this an internal procedure, could you make it private to the module at least?

What flags did you use?

Ashutosh_Londhe · June 24, 2024, 10:00pm

so after dropping a -h flex_mp=strict flag from compilation for CRAY compiler and putting !DIR$ INLINE over CALL statement, it is now vectorizing the loop

Ashutosh_Londhe · June 24, 2024, 10:03pm

I was using earlier

-O3 -g -fopenmp -fPIC -M 878 -f free -h flex_mp=strict -hlist=m -h nofma -hfp1=noapprox -K trap=divz,fp,inv,unf -N 1023

with CRAY compiler. After dropping -h flex_mp=strict it is now vectorized the loop

67.
   68.    M---------------<     !$OMP PARALLEL DO PRIVATE(n_x,n_y,n_z)
   69.  + M mF                  DO n_z = 1, end_indx(3)-start_indx(3)+1
   70.  + M mF F                    DO n_y = 1, end_indx(2)-start_indx(2)+1
   71.    M mF F                        !$OMP SIMD
   72.  + M mF F VFr4                   DO n_x = 1, end_indx(1)-start_indx(1)+1
   73.    M mF F VFr4       !DIR$ INLINE
   74.    M mF F VFr4 I                     CALL maths_kernel_eqW( &
   75.    M mF F VFr4
   76.    M mF F VFr4                       opsDat1Local(dat1_base + ((n_x-1)*1) + ((n_y-1)*xdim1*1) + ((n_z-1)*ydim1*xdim1*1)), &
   77.    M mF F VFr4
   78.    M mF F VFr4                       opsDat2Local(dat2_base + ((n_x-1)*1) + ((n_y-1)*xdim2*1) + ((n_z-1)*ydim2*xdim2*1)), &
   79.    M mF F VFr4
   80.    M mF F VFr4                       opsDat3Local(dat3_base + ((n_x-1)*1) + ((n_y-1)*xdim3*1) + ((n_z-1)*ydim3*xdim3*1)) &
   81.    M mF F VFr4                      )
   82.    M mF F VFr4
   83.    M mF F VFr4----->             END DO
   84.    M mF F---------->         END DO
   85.    M mF----------->>     END DO


ftn-6823 ftn: THREAD MATHS_KERNEL_EQW_WRAP, File = maths_kernel_eqW_seq_kernel.F90, Line = 68
  A region starting at line 68 and ending at line 85 was multi-threaded.

ftn-3182 ftn: IPA MATHS_KERNEL_EQW_WRAP, File = maths_kernel_eqW_seq_kernel.F90, Line = 69
  Loop has been flattened.

ftn-6294 ftn: VECTOR MATHS_KERNEL_EQW_WRAP, File = maths_kernel_eqW_seq_kernel.F90, Line = 69
  A loop starting at line 69 was not vectorized because a better candidate was found at line 72.

ftn-6817 ftn: THREAD MATHS_KERNEL_EQW_WRAP, File = maths_kernel_eqW_seq_kernel.F90, Line = 69
  A loop starting at line 69 was partitioned.

ftn-3182 ftn: IPA MATHS_KERNEL_EQW_WRAP, File = maths_kernel_eqW_seq_kernel.F90, Line = 70
  Loop has been flattened.

ftn-6294 ftn: VECTOR MATHS_KERNEL_EQW_WRAP, File = maths_kernel_eqW_seq_kernel.F90, Line = 70
  A loop starting at line 70 was not vectorized because a better candidate was found at line 72.

ftn-3182 ftn: IPA MATHS_KERNEL_EQW_WRAP, File = maths_kernel_eqW_seq_kernel.F90, Line = 72
  Loop has been flattened.

ftn-6005 ftn: SCALAR MATHS_KERNEL_EQW_WRAP, File = maths_kernel_eqW_seq_kernel.F90, Line = 72
  A loop starting at line 72 was unrolled 4 times.

ftn-6204 ftn: VECTOR MATHS_KERNEL_EQW_WRAP, File = maths_kernel_eqW_seq_kernel.F90, Line = 72
  A loop starting at line 72 was vectorized.

ftn-3001 ftn: IPA MATHS_KERNEL_EQW_WRAP, File = maths_kernel_eqW_seq_kernel.F90, Line = 74
  Leaf "maths_kernel_eqw"(/mnt/lustre/a2fs-work1/work/e01/e01/ashutoshl/cray_test/senga_ops_3dh2/SENGA2/src/mpi_openmp/maths_kernel_eqW_seq_kernel.F90:33) was inlined.

now the question is how to achive this for GNU as well, considering [ivanpribec] mentioned that there is no similar feature available with GCC

ivanpribec · June 24, 2024, 10:04pm

The gfortran directives are listed here: GNU Fortran Compiler Directives (The GNU Fortran Compiler)

Ashutosh_Londhe · June 24, 2024, 10:07pm

what do you mean by internal procedure??

the one xample you have provided that one??

!$omp parallel do collapse(2)
do n_z = 1, end_indx(3) - start_indx(3)+1
    do n_y = 1, end_indx(2) - start_indx(2)+1
        !$omp simd
        do n_x = 1, end_indx(1) - start_indx(1)+1
            call maths_kernel_eqAI( &
                opsDat1Local(n_x,n_y,n_z), &
                opsDat2Local(n_x,n_y,n_z), &
                opsDat3Local(n_x,n_y,n_z), &
                opsDat4Local(n_x,n_y,n_z), &
                opsDat5Local(n_x,n_y,n_z), &
                opsDat6Local(n_x,n_y,n_z), &
                opsDat7Local(n_x,n_y,n_z), &
                opsDat8Local(n_x,n_y,n_z) &
                )
        end do
    end do
end do

contains

    !  =============
    !  User function
    !  =============
    !DEC$ ATTRIBUTES FORCEINLINE :: maths_kernel_eqAI
    subroutine maths_kernel_eqAI(out_arr,in_arr1,in_arr2,in_arr3,in_arr4,in_arr5,in_arr6,in_arr7)
        real(kind=8), intent(inout) :: out_arr
        real(kind=8), intent(in) :: in_arr1, in_arr2,in_arr3,in_arr4,in_arr5,in_arr6,in_arr7
        
        out_arr = out_arr + (in_arr1 + in_arr2) &
                          + (in_arr3 + in_arr4) &
                          + (in_arr5 + in_arr6) &
                          + in_arr7
    end subroutine

end subroutine

this one??
sorry i dont have much idea about Fortran

Ashutosh_Londhe · June 24, 2024, 10:09pm

GCC has NOINLINE attributes but doesnt have INLINE one

ivanpribec · June 24, 2024, 10:13pm

An internal procedure is a procedure within another procedure.

Instead of having both procedures in the module scope, you would move the kernel procedure inside:

MODULE MATHS_KERNEL_EQW_MODULE
! ...
CONTAINS

SUBROUTINE maths_kernel_eqW_wrap( ... )
! ...
CONTAINS
   SUBROUTINE maths_kernel_eqW( ... )  ! Helpful for inlining 
   ! ...
   END SUBROUTINE
END SUBROUTINE

END MODULE

Ashutosh_Londhe · June 24, 2024, 10:20pm

Sure. thanks for the inputs. I will try with GNU if it helps.

Thanks once again for all the responses.

Topic		Replies	Views
Fortran is dead – Long live Fortran! Tutorials	9	815	June 27, 2025
Performance of vectorized code in ifort and ifx	28	1260	April 9, 2024
What really is vectorization and how to implement it?	9	4792	October 9, 2021
Vectorizing a conditional product	10	971	August 2, 2022
Will using Vectorization speed up the program? Help	22	1869	November 26, 2023

Function inline in F90 code

Related topics