@greenrongreen, the âdebate of stack vs heapâ can be clarified if the size of the arrays is also considered.
(My impression is) For âlargeâ arrays, there is little difference between stack and heap compute efficiency, but for âsmallâ arrays, it can be more efficient to have these on the stack and so have better access to L1 cache. Small arrays are also probably more frequently allocated.
For OpenMP, especially on the heap, it would be an improvement if large arrays could be positioned on a new memory page to mitigate memory consistency problems. (I have not seen this as a compile option?)
program test_matmul modification
I have reviewed the loop order in program test_matmul and enhanced the program to provide comparison with other approaches. These other approaches consider:
- revised DO loop order
- partitioning the calculation into sub matrices to reduce memory <> cache transfers
- applying OpenMP to improve performance
- applying partitioning to OpenMP to reduce memory <> cache transfers
The test is for a range of matrix sizes x 7 coded approaches.
! Program to test different approaches to emulate MATMUL
!
! 1 MATMUL use intrinsic routine
! 2 maprod_v0 original proposed DO order : j i k
! 3 maprod_v1 revised DO order j k i to improve sequential memory access
! 4 maprod_v2 replace inner loop with array syntax
! 5 maprod_part attempt to partition the multiplication to improve L1 Cache usage
! 6 maprod_omp OpenMP approach from V2 using temporary column vector on thread stack
! 7 maprod_omp_part OpenMP approach on J that includes partitioning ( different sizes )
I am reporting as both elapsed time and computation as GFLOP/sec vs matrix size.
This GFLOP/sec vs matrix size shows how the memory <> cache transfers bottleneck is affected by the different strategies, which helps to demonstrate the benefits of the different strategies.
The range of sizes should be used carefully, as the elapsed time is related to O(n^3) so larger sizes take a lot longer!
The partitioning sizes ( di,dj,dk ) are related to the processor cache sizes and also the number of threads available. I use different sizes for single thread (related to L1 cache size) vs OpenMP (related to L1 and shared L3 cache size). The optimal sizes will vary with processor.
Hopefully these results show some how each computation approach addresses the memory <> cache bottleneck. The approaches provided do show significant improvement on the initial posting.
I tested the program using gfortran, built with
del %1.exe
del %1.o
set vec=-march=native
set options=-fimplicit-none -O3 %vec% -fstack-arrays -ffast-math -fopenmp
gfortran %1.f90 %options% -o %1.exe
dir %1.* /od
I am sure there are better approaches available.
I would welcome suggestions so we could all learn from these better approaches.
test_matmulp.f90 (11.6 KB)