I am trying to vectorize a loop which contain call to subroutine.
On top of this subroutine i have mentioned !DEC$ ATTRIBUTES FORCEINLINE :: subroutine_name
due to which when i compile this with Intel, it says the required loop is vectorized.
Altough when i am trying this with GNU compiler, i am getting “missed: couldn’t vectorize, multiple nested loop”. I have provided the code snippet as well as command line which mentions all Flags i am using here.
Could anyone help me here how can i vectorize the following peice of code. Please keep in mind that the code is auto-generated by OPS-DSL library so i cannot replace the call statement with the body manually.
I speculate there is a high barrier of activation to overpass to help here because the relevant source code/output by the CLI is provided as picture. For future reference, I suggest either
- for a short snippet: copy-paste it as fenced code block enclosed in a leading/trailing line of three back ticks/accent graves. With GitHub flavored markdown supported here, the leading line can be ```fortran (or use
f90
instead) to then activate syntax highlighting (cf. the example around Ruby on learnxinyminutes). - for a somewhat long report (e.g., requires to scroll down): attach the report for instance, as .txt, .csv, or .f90 file.
Judging by the output on the left, your inner loop probably was vectorized:
[…] F90:153:16: optimized: loop vectorized using 32 byte vectors
It’s generally the inner-most loop which gets vectorized.
A few more things to note
- The
!DEC$
directive syntax doesn’t work with gfortran; as of GCC v14, gfortran does not have an equivalent directive to force inlining - Making the generated user function an internal procedure can help achieve inlining as the compiler is free to assume the procedure is not called elsewhere. With gfortran
-O2
will already inline the internal procedure as shown below. - I’m curious why can’t the code generator - OP-DSL - just use multi-dimensional indexing? The offset of
datX_base
can be applied at the call site, instead of being passed as arguments. - The middle loop could be collapsed into the outer OMP directive.
- The loop indexes are private by default within OMP loop-related directives
#define xdim1 40
#define ydim1 40
#define xdim2 40
#define ydim2 40
#define xdim3 40
#define ydim3 40
#define xdim4 40
#define ydim4 40
#define xdim5 40
#define ydim5 40
#define xdim6 40
#define ydim6 40
#define xdim7 40
#define ydim7 40
#define xdim8 40
#define ydim8 40
subroutine maths_kernel_eqAI_wrap( &
opsDat1Local, &
opsDat2Local, &
opsDat3Local, &
opsDat4Local, &
opsDat5Local, &
opsDat6Local, &
opsDat7Local, &
opsDat8Local, &
start_indx, &
end_indx )
real(kind=8), dimension(0:xdim1-1,0:ydim1-1,0:*), intent(inout) :: opsDat1Local
real(kind=8), dimension(0:xdim2-1,0:ydim2-1,0:*), intent(in) :: opsDat2Local
real(kind=8), dimension(0:xdim3-1,0:ydim3-1,0:*), intent(in) :: opsDat3Local
real(kind=8), dimension(0:xdim4-1,0:ydim4-1,0:*), intent(in) :: opsDat4Local
real(kind=8), dimension(0:xdim5-1,0:ydim5-1,0:*), intent(in) :: opsDat5Local
real(kind=8), dimension(0:xdim6-1,0:ydim6-1,0:*), intent(in) :: opsDat6Local
real(kind=8), dimension(0:xdim7-1,0:ydim7-1,0:*), intent(in) :: opsDat7Local
real(kind=8), dimension(0:xdim8-1,0:ydim8-1,0:*), intent(in) :: opsDat8Local
integer(kind=4), dimension(3), intent(in) :: start_indx, end_indx
integer(kind=4) :: n_x, n_y, n_z
!$omp parallel do collapse(2)
do n_z = 1, end_indx(3) - start_indx(3)+1
do n_y = 1, end_indx(2) - start_indx(2)+1
!$omp simd
do n_x = 1, end_indx(1) - start_indx(1)+1
call maths_kernel_eqAI( &
opsDat1Local(n_x,n_y,n_z), &
opsDat2Local(n_x,n_y,n_z), &
opsDat3Local(n_x,n_y,n_z), &
opsDat4Local(n_x,n_y,n_z), &
opsDat5Local(n_x,n_y,n_z), &
opsDat6Local(n_x,n_y,n_z), &
opsDat7Local(n_x,n_y,n_z), &
opsDat8Local(n_x,n_y,n_z) &
)
end do
end do
end do
contains
! =============
! User function
! =============
!DEC$ ATTRIBUTES FORCEINLINE :: maths_kernel_eqAI
subroutine maths_kernel_eqAI(out_arr,in_arr1,in_arr2,in_arr3,in_arr4,in_arr5,in_arr6,in_arr7)
real(kind=8), intent(inout) :: out_arr
real(kind=8), intent(in) :: in_arr1, in_arr2,in_arr3,in_arr4,in_arr5,in_arr6,in_arr7
out_arr = out_arr + (in_arr1 + in_arr2) &
+ (in_arr3 + in_arr4) &
+ (in_arr5 + in_arr6) &
+ in_arr7
end subroutine
end subroutine
You can inspect the results in Compiler Explorer: Compiler Explorer
For the flags, -c -cpp -O2 -ftree-vectorize -mprefer-vector-width=512 -march=icelake-server -fopt-info
, the inner vector loop contains:
.L7:
vmovupd zmm0, ZMMWORD PTR [r11+rax]
vmovupd zmm1, ZMMWORD PTR [r9+rax]
vaddpd zmm0, zmm0, ZMMWORD PTR [rbx+rax]
vaddpd zmm1, zmm1, ZMMWORD PTR [r10+rax]
vaddpd zmm0, zmm0, ZMMWORD PTR [rdx+rax]
vaddpd zmm0, zmm0, zmm1
vmovupd zmm1, ZMMWORD PTR [rdi+rax]
vaddpd zmm1, zmm1, ZMMWORD PTR [r8+rax]
vaddpd zmm0, zmm0, zmm1
vaddpd zmm0, zmm0, ZMMWORD PTR [rsi+rax]
vmovupd ZMMWORD PTR [rdx+rax], zmm0
add rax, 64
cmp r12, rax
jne .L7
As expected it performs 7 additions using the ZMM registers.
Edit: Alternatively, the ???_indx
arrays can also be used to set the dimensions (assuming these are contiguous chunks):
integer(kind=4), intent(in) :: start_indx(3), end_indx(3)
real(kind=8), intent(inout) :: opsDat1Local(start_indx(1):end_indx(1), start_indx(2):end_indx(2),start_indx(3):end_indx(3))
! ...
(Perhaps the some of the newly introduced F2023 syntax can make this even easier.)
Hi,
Sorry for not putting the code in text format earlier
here is the code generated by OPS-translator
! Auto-generated at 2024-06-24 14:28:12.629804 by ops-translator
MODULE MATHS_KERNEL_EQW_MODULE
USE OPS_FORTRAN_DECLARATIONS
USE OPS_FORTRAN_RT_SUPPORT
USE OPS_CONSTANTS
USE, INTRINSIC :: ISO_C_BINDING
IMPLICIT NONE
INTEGER(KIND=4) :: xdim1
INTEGER(KIND=4) :: ydim1
INTEGER(KIND=4) :: zdim1
#define OPS_ACC1(x,y,z) ((x) + (xdim1*(y)) + (xdim1*ydim1*(z)) + 1)
INTEGER(KIND=4) :: xdim2
INTEGER(KIND=4) :: ydim2
INTEGER(KIND=4) :: zdim2
#define OPS_ACC2(x,y,z) ((x) + (xdim2*(y)) + (xdim2*ydim2*(z)) + 1)
INTEGER(KIND=4) :: xdim3
INTEGER(KIND=4) :: ydim3
INTEGER(KIND=4) :: zdim3
#define OPS_ACC3(x,y,z) ((x) + (xdim3*(y)) + (xdim3*ydim3*(z)) + 1)
CONTAINS
! =============
! User function
! =============
#ifdef __INTEL__COMPILER
!DEC$ ATTRIBUTES FORCEINLINE :: maths_kernel_eqW
#endif
SUBROUTINE maths_kernel_eqW(out_arr, in_arr1, in_arr2)
real(kind=8), dimension(1) :: out_arr
real(kind=8), dimension(1), intent(in) :: in_arr1, in_arr2
out_arr(OPS_ACC1(0,0,0)) = in_arr1(OPS_ACC2(0,0,0)) * in_arr2(OPS_ACC3(0,0,0))
END SUBROUTINE
#undef OPS_ACC1
#undef OPS_ACC2
#undef OPS_ACC3
SUBROUTINE maths_kernel_eqW_wrap( &
opsDat1Local, &
opsDat2Local, &
opsDat3Local, &
dat1_base, &
dat2_base, &
dat3_base, &
start_indx, &
end_indx )
REAL(KIND=8), DIMENSION(*), INTENT(OUT) :: opsDat1Local
INTEGER(KIND=4), INTENT(IN) :: dat1_base
REAL(KIND=8), DIMENSION(*), INTENT(IN) :: opsDat2Local
INTEGER(KIND=4), INTENT(IN) :: dat2_base
REAL(KIND=8), DIMENSION(*), INTENT(IN) :: opsDat3Local
INTEGER(KIND=4), INTENT(IN) :: dat3_base
INTEGER(KIND=4), DIMENSION(3), INTENT(IN) :: start_indx, end_indx
INTEGER(KIND=4) :: n_x, n_y, n_z
!$OMP PARALLEL DO PRIVATE(n_x,n_y,n_z)
DO n_z = 1, end_indx(3)-start_indx(3)+1
DO n_y = 1, end_indx(2)-start_indx(2)+1
!$OMP SIMD
DO n_x = 1, end_indx(1)-start_indx(1)+1
CALL maths_kernel_eqW( &
opsDat1Local(dat1_base + ((n_x-1)*1) + ((n_y-1)*xdim1*1) + ((n_z-1)*ydim1*xdim1*1)), &
opsDat2Local(dat2_base + ((n_x-1)*1) + ((n_y-1)*xdim2*1) + ((n_z-1)*ydim2*xdim2*1)), &
opsDat3Local(dat3_base + ((n_x-1)*1) + ((n_y-1)*xdim3*1) + ((n_z-1)*ydim3*xdim3*1)) &
)
END DO
END DO
END DO
END SUBROUTINE
! ===============
! Host subroutine
! ===============
#ifndef OPS_LAZY
SUBROUTINE maths_kernel_eqW_host( userSubroutine, block, dim, range, &
opsArg1, &
opsArg2, &
opsArg3 &
)
CHARACTER(KIND=C_CHAR,LEN=*), INTENT(IN) :: userSubroutine
TYPE(ops_block), INTENT(IN) :: block
INTEGER(KIND=4), INTENT(IN) :: dim
INTEGER(KIND=4), DIMENSION(2*dim), INTENT(IN) :: range
TYPE(ops_arg), INTENT(IN) :: opsArg1
TYPE(ops_arg), INTENT(IN) :: opsArg2
TYPE(ops_arg), INTENT(IN) :: opsArg3
TYPE(ops_arg), DIMENSION(3) :: opsArgArray
#else
SUBROUTINE maths_kernel_eqW_host_execute( descPtr )
TYPE(ops_kernel_descriptor), INTENT(IN) :: descPtr
TYPE(ops_block) :: block
INTEGER(KIND=C_INT) :: dim
INTEGER(KIND=C_INT), POINTER, DIMENSION(:) :: range
CHARACTER(KIND=C_CHAR), POINTER, DIMENSION(:) :: userSubroutine
TYPE(ops_arg), POINTER, DIMENSION(:) :: opsArgArray
TYPE(ops_arg) :: opsArg1
TYPE(ops_arg) :: opsArg2
TYPE(ops_arg) :: opsArg3
#endif
REAL(KIND=8), POINTER, DIMENSION(:) :: opsDat1Local
INTEGER(KIND=4) :: opsDat1Cardinality
INTEGER(KIND=4), POINTER, DIMENSION(:) :: dat1_size
INTEGER(KIND=4) :: dat1_base
REAL(KIND=8), POINTER, DIMENSION(:) :: opsDat2Local
INTEGER(KIND=4) :: opsDat2Cardinality
INTEGER(KIND=4), POINTER, DIMENSION(:) :: dat2_size
INTEGER(KIND=4) :: dat2_base
REAL(KIND=8), POINTER, DIMENSION(:) :: opsDat3Local
INTEGER(KIND=4) :: opsDat3Cardinality
INTEGER(KIND=4), POINTER, DIMENSION(:) :: dat3_size
INTEGER(KIND=4) :: dat3_base
REAL(KIND=8) :: t1__, t2__, t3__
REAL(KIND=4) :: transfer_total, transfer
INTEGER(KIND=4), DIMENSION(3) :: start_indx, end_indx
INTEGER(KIND=4) :: n_indx
CHARACTER(LEN=40) :: kernelName
kernelName = "maths_kernel_eqW"
#ifdef OPS_LAZY
! ==========================
! Set from kernel descriptor
! ==========================
dim = descPtr%dim
CALL c_f_pointer(descPtr%range, range, [2*dim])
CALL c_f_pointer(descPtr%name, userSubroutine, [descPtr%name_len])
block%blockCptr = descPtr%block
CALL c_f_pointer(block%blockCptr, block%blockPtr)
CALL c_f_pointer(descPtr%args, opsArgArray, [descPtr%nargs])
opsArg1 = opsArgArray(1)
opsArg2 = opsArgArray(2)
opsArg3 = opsArgArray(3)
#else
opsArgArray(1) = opsArg1
opsArgArray(2) = opsArg2
opsArgArray(3) = opsArg3
#endif
CALL setKernelTime(204, kernelName//c_null_char, 0.0_8, 0.0_8, 0.0_4, 1)
CALL ops_timers_core(t1__)
#if defined(OPS_MPI) && !defined(OPS_LAZY)
IF ( getRange(block, start_indx, end_indx, range) < 0 ) THEN
RETURN
END IF
#elif !defined(OPS_MPI) && !defined(OPS_LAZY)
DO n_indx = 1, 3
start_indx(n_indx) = range(2*n_indx-1)
end_indx (n_indx) = range(2*n_indx)
END DO
#else
DO n_indx = 1, 3
start_indx(n_indx) = range(2*n_indx-1) + 1
end_indx (n_indx) = range(2*n_indx)
END DO
#endif
CALL c_f_pointer(getDatSizeFromOpsArg(opsArg1), dat1_size, [dim])
xdim1 = dat1_size(1)
ydim1 = dat1_size(2)
zdim1 = dat1_size(3)
opsDat1Cardinality = opsArg1%dim * xdim1 * ydim1 * zdim1
dat1_base = getDatBaseFromOpsArg3D(opsArg1, start_indx, 1)
CALL c_f_pointer(opsArg1%data, opsDat1Local, [opsDat1Cardinality])
CALL c_f_pointer(getDatSizeFromOpsArg(opsArg2), dat2_size, [dim])
xdim2 = dat2_size(1)
ydim2 = dat2_size(2)
zdim2 = dat2_size(3)
opsDat2Cardinality = opsArg2%dim * xdim2 * ydim2 * zdim2
dat2_base = getDatBaseFromOpsArg3D(opsArg2, start_indx, 1)
CALL c_f_pointer(opsArg2%data, opsDat2Local, [opsDat2Cardinality])
CALL c_f_pointer(getDatSizeFromOpsArg(opsArg3), dat3_size, [dim])
xdim3 = dat3_size(1)
ydim3 = dat3_size(2)
zdim3 = dat3_size(3)
opsDat3Cardinality = opsArg3%dim * xdim3 * ydim3 * zdim3
dat3_base = getDatBaseFromOpsArg3D(opsArg3, start_indx, 1)
CALL c_f_pointer(opsArg3%data, opsDat3Local, [opsDat3Cardinality])
! ==============
! Halo exchanges
! ==============
#ifndef OPS_LAZY
CALL ops_H_D_exchanges_host(opsArgArray, 3)
CALL ops_halo_exchanges(opsArgArray, 3, range)
CALL ops_H_D_exchanges_host(opsArgArray, 3)
#endif
CALL ops_timers_core(t2__)
! ==============================
! Call kernel wrapper subroutine
! ==============================
CALL maths_kernel_eqW_wrap( &
opsDat1Local, &
opsDat2Local, &
opsDat3Local, &
dat1_base, &
dat2_base, &
dat3_base, &
start_indx, &
end_indx )
CALL ops_timers_core(t3__)
#ifndef OPS_LAZY
CALL ops_set_dirtybit_host(opsArgArray, 3)
CALL ops_set_halo_dirtybit3(opsArg1, range)
#endif
! ========================
! Timing and data movement
! ========================
transfer_total = 0.0_4
CALL ops_compute_transfer(3, start_indx, end_indx, opsArg1, transfer)
transfer_total = transfer_total + transfer
CALL ops_compute_transfer(3, start_indx, end_indx, opsArg2, transfer)
transfer_total = transfer_total + transfer
CALL ops_compute_transfer(3, start_indx, end_indx, opsArg3, transfer)
transfer_total = transfer_total + transfer
CALL setKernelTime(204, kernelName//c_null_char, t3__-t2__, t2__-t1__, transfer_total, 0)
END SUBROUTINE
#ifdef OPS_LAZY
SUBROUTINE maths_kernel_eqW_host( userSubroutine, block, dim, range, &
opsArg1, &
opsArg2, &
opsArg3 &
)
CHARACTER(KIND=C_CHAR,LEN=*), INTENT(IN), TARGET :: userSubroutine
TYPE(ops_block), INTENT(IN) :: block
INTEGER(KIND=4), INTENT(IN) :: dim
INTEGER(KIND=4), DIMENSION(2*dim), INTENT(INOUT), TARGET :: range
INTEGER(KIND=4), DIMENSION(2*dim), TARGET :: range_tmp
TYPE(ops_arg), INTENT(IN) :: opsArg1
TYPE(ops_arg), INTENT(IN) :: opsArg2
TYPE(ops_arg), INTENT(IN) :: opsArg3
TYPE(ops_arg), DIMENSION(3), TARGET :: opsArgArray
INTEGER(KIND=4) :: n_indx
CHARACTER(LEN=40) :: namelit
namelit = "maths_kernel_eqW"
opsArgArray(1) = opsArg1
opsArgArray(2) = opsArg2
opsArgArray(3) = opsArg3
DO n_indx = 1, 3
range_tmp(2*n_indx-1) = range(2*n_indx-1)-1
range_tmp(2*n_indx) = range(2*n_indx)
END DO
CALL create_kerneldesc_and_enque(namelit//c_null_char, c_loc(opsArgArray), &
3, 204, dim, 0, c_loc(range_tmp), &
block%blockCptr, c_funloc(maths_kernel_eqW_host_execute))
END SUBROUTINE
#endif
END MODULE MATHS_KERNEL_EQW_MODULE
with the use of !DEC$ ATTRIBUTES FORCEINLINE :: maths_kernel_eqW
, it is able to inline the function call present inside SIMD loop and able to vectorize with INTEL compiler
Although with GNU or CRAY it says “Loop was not vectorized because it contains call to subroutine”
With Cray compiler, i tried putting !DIR$ INLINEALWAYS maths_kerne_eqW
as well as !DIR$ INLINE
over the CALL statement, but it still not able to vectorize
If you can’t make this an internal procedure, could you make it private
to the module at least?
What flags did you use?
so after dropping a -h flex_mp=strict
flag from compilation for CRAY compiler and putting !DIR$ INLINE
over CALL statement, it is now vectorizing the loop
I was using earlier
-O3 -g -fopenmp -fPIC -M 878 -f free -h flex_mp=strict -hlist=m -h nofma -hfp1=noapprox -K trap=divz,fp,inv,unf -N 1023
with CRAY compiler. After dropping -h flex_mp=strict
it is now vectorized the loop
67.
68. M---------------< !$OMP PARALLEL DO PRIVATE(n_x,n_y,n_z)
69. + M mF DO n_z = 1, end_indx(3)-start_indx(3)+1
70. + M mF F DO n_y = 1, end_indx(2)-start_indx(2)+1
71. M mF F !$OMP SIMD
72. + M mF F VFr4 DO n_x = 1, end_indx(1)-start_indx(1)+1
73. M mF F VFr4 !DIR$ INLINE
74. M mF F VFr4 I CALL maths_kernel_eqW( &
75. M mF F VFr4
76. M mF F VFr4 opsDat1Local(dat1_base + ((n_x-1)*1) + ((n_y-1)*xdim1*1) + ((n_z-1)*ydim1*xdim1*1)), &
77. M mF F VFr4
78. M mF F VFr4 opsDat2Local(dat2_base + ((n_x-1)*1) + ((n_y-1)*xdim2*1) + ((n_z-1)*ydim2*xdim2*1)), &
79. M mF F VFr4
80. M mF F VFr4 opsDat3Local(dat3_base + ((n_x-1)*1) + ((n_y-1)*xdim3*1) + ((n_z-1)*ydim3*xdim3*1)) &
81. M mF F VFr4 )
82. M mF F VFr4
83. M mF F VFr4-----> END DO
84. M mF F----------> END DO
85. M mF----------->> END DO
ftn-6823 ftn: THREAD MATHS_KERNEL_EQW_WRAP, File = maths_kernel_eqW_seq_kernel.F90, Line = 68
A region starting at line 68 and ending at line 85 was multi-threaded.
ftn-3182 ftn: IPA MATHS_KERNEL_EQW_WRAP, File = maths_kernel_eqW_seq_kernel.F90, Line = 69
Loop has been flattened.
ftn-6294 ftn: VECTOR MATHS_KERNEL_EQW_WRAP, File = maths_kernel_eqW_seq_kernel.F90, Line = 69
A loop starting at line 69 was not vectorized because a better candidate was found at line 72.
ftn-6817 ftn: THREAD MATHS_KERNEL_EQW_WRAP, File = maths_kernel_eqW_seq_kernel.F90, Line = 69
A loop starting at line 69 was partitioned.
ftn-3182 ftn: IPA MATHS_KERNEL_EQW_WRAP, File = maths_kernel_eqW_seq_kernel.F90, Line = 70
Loop has been flattened.
ftn-6294 ftn: VECTOR MATHS_KERNEL_EQW_WRAP, File = maths_kernel_eqW_seq_kernel.F90, Line = 70
A loop starting at line 70 was not vectorized because a better candidate was found at line 72.
ftn-3182 ftn: IPA MATHS_KERNEL_EQW_WRAP, File = maths_kernel_eqW_seq_kernel.F90, Line = 72
Loop has been flattened.
ftn-6005 ftn: SCALAR MATHS_KERNEL_EQW_WRAP, File = maths_kernel_eqW_seq_kernel.F90, Line = 72
A loop starting at line 72 was unrolled 4 times.
ftn-6204 ftn: VECTOR MATHS_KERNEL_EQW_WRAP, File = maths_kernel_eqW_seq_kernel.F90, Line = 72
A loop starting at line 72 was vectorized.
ftn-3001 ftn: IPA MATHS_KERNEL_EQW_WRAP, File = maths_kernel_eqW_seq_kernel.F90, Line = 74
Leaf "maths_kernel_eqw"(/mnt/lustre/a2fs-work1/work/e01/e01/ashutoshl/cray_test/senga_ops_3dh2/SENGA2/src/mpi_openmp/maths_kernel_eqW_seq_kernel.F90:33) was inlined.
now the question is how to achive this for GNU as well, considering [ivanpribec] mentioned that there is no similar feature available with GCC
The gfortran directives are listed here: GNU Fortran Compiler Directives (The GNU Fortran Compiler)
what do you mean by internal procedure??
the one xample you have provided that one??
!$omp parallel do collapse(2)
do n_z = 1, end_indx(3) - start_indx(3)+1
do n_y = 1, end_indx(2) - start_indx(2)+1
!$omp simd
do n_x = 1, end_indx(1) - start_indx(1)+1
call maths_kernel_eqAI( &
opsDat1Local(n_x,n_y,n_z), &
opsDat2Local(n_x,n_y,n_z), &
opsDat3Local(n_x,n_y,n_z), &
opsDat4Local(n_x,n_y,n_z), &
opsDat5Local(n_x,n_y,n_z), &
opsDat6Local(n_x,n_y,n_z), &
opsDat7Local(n_x,n_y,n_z), &
opsDat8Local(n_x,n_y,n_z) &
)
end do
end do
end do
contains
! =============
! User function
! =============
!DEC$ ATTRIBUTES FORCEINLINE :: maths_kernel_eqAI
subroutine maths_kernel_eqAI(out_arr,in_arr1,in_arr2,in_arr3,in_arr4,in_arr5,in_arr6,in_arr7)
real(kind=8), intent(inout) :: out_arr
real(kind=8), intent(in) :: in_arr1, in_arr2,in_arr3,in_arr4,in_arr5,in_arr6,in_arr7
out_arr = out_arr + (in_arr1 + in_arr2) &
+ (in_arr3 + in_arr4) &
+ (in_arr5 + in_arr6) &
+ in_arr7
end subroutine
end subroutine
this one??
sorry i dont have much idea about Fortran
GCC has NOINLINE attributes but doesnt have INLINE one
An internal procedure is a procedure within another procedure.
Instead of having both procedures in the module scope, you would move the kernel procedure inside:
MODULE MATHS_KERNEL_EQW_MODULE
! ...
CONTAINS
SUBROUTINE maths_kernel_eqW_wrap( ... )
! ...
CONTAINS
SUBROUTINE maths_kernel_eqW( ... ) ! Helpful for inlining
! ...
END SUBROUTINE
END SUBROUTINE
END MODULE
Sure. thanks for the inputs. I will try with GNU if it helps.
Thanks once again for all the responses.