There is no such thing as a fast language. Given a computation spec and a machine, there is a stream of instructions that will achieve the goal in the shortest time. If the spec is expressible in terms the language provides (if it is Turing complete), then all compilers are in principle able to match the spec to that stream. How could it be otherwise?
Languages differ in the abstractions they provide. Fortran’s innovations include the Expression (in algebraic notation), the Array, the Assignment, the Do-Loop and the Association.
In somewhat similar thread already mentioned by @ivanpribec in post #6 above, there was a discussion about including BLAS etc. into C++ standard library, with some guys prophesying the end of Fortran (a.k.a. EOF
). But they will struggle with the very same problem. There is no single superfast C/C++ function using assembler inlines available for a general purpose libstdc++ or any other library, which all have to support a huge variety of CPU hardware. So IMHO the only way is to provide several functions for different ‘-march’ or equivalent options.
That is not much different from what Fortran implementation would have to provide for intrinsics like matmul to make them really fast.
Why do you think that ? On vector computers in the 80/90s it was also easy to write inefficient code, and to write code using the nominal peak performances you had to have a fair understanding of the characteristics of the hardware.
Using a high-level language like Fortran (or C, or C++) is always a compromise between performances and other criteria such as simplicity, readability, maintenance, portability, etc…
C/C++ or Julia are not “more performant” than Fortran. Again, you are showing in this thread that going down to the assembly (low) level is more performant, not that C/C++ is more performant.
And nobody objects that assembly is more performant.
Is that really true? Do people really pick the optimal language for every task based on speed? I think if that were true, then we would all be using assembler for everything.
The reason I look at such benchmarks is to understand the various algorithms, some of which are easier to implement in one language than another, or easier to envision in one language than another, and then see if I can implement that algorithm in whatever language I’m using for the task (usually fortran, but not always).
For example, some algorithms involving linked lists and tree structures used to be easier to implement in C than in fortran. That was because fortran 77 and earlier did not have allocatable arrays and pointers. You could do those things in f77, but you had to emulate that functionality with arrays and array indexes. Now modern fortran has those features, and one can argue that fortran is now better than C in some ways for those types of algorithms.
I don’t think that is what “Turing complete” means. I think it just means that the algorithm can be implemented, not that it can be implemented in the minimal number of machine instructions.
My postdoc advisor worked on some of the early electronic computers after WWII, such as the EDSAC at Cambridge. Everyone programmed in machine code then, not high level languages, and sometimes not even assembler. He said a mantra then was that “every program can be reduced by one instruction.” There are several implications of that statement. I knew him decades later, and he was still a good programmer even then.
In discussions like this, I’m always puzzled that people seem to ignore just how fast modern processors are compared to even 10 years ago. The response time of most algorithms people are going to run on a workstation is probably down in the milliseconds or less. These algorithms might have taken a minute or more on a 1980’s Cray. HPC is the one area where speed still matters as much as accuracy, code maintenance, usability etc. For most low to mid level computing, current processors are more than fast enough. In my opinion you need to concentrate more on accuracy (are you getting the correct numbers) than on speed (are you getting a less accurate number faster). In other words, get the algorithm producing the most accurate answers you can and then worry about making it faster (but only if you plan to run HPC exascale size problems).
I am curious about the performance of this code in the original post after these heap allocations were removed. Did that improve performance, leave it the same, or what?
These arrays are allocated once for all before all the loops, and are significantly smaller than the sizes that are tested (this code seems to target large matrices), so I don’t expect any significant difference by allocating on the stack instead.
Yes, but those heap allocations are done within the timing step. It should be easy to change the declarations to
real(dp) :: A_block(MC*KC), B_block(KC*n)
to see if it makes any difference. It looks like the SAVE attribute could be added to A_block, which would make it statically allocated. One would need to look further at how B_block is used to see if it could be preallocated. So there are a couple of simple things to try in the fortran code, maybe there is some low-hanging fruit there.
I think there are two major differences. MATMUL can take arguments with arbitrary strides in both columns and rows, whereas the DGEMM array argumments must be contiguous in the columns. With nonunit stride arguments, the compiler must add an extra copy-in/copy-out for argument association for DGEMM, and that allocation and copy would also affect any timings in these cases. The other difference, mentioned previously, is that MATMUL is a function, while DGEMM is a subroutine. That function interface is a disadvantage when it comes to performance because of the inherent memory allocation that is required for the function result. One hopes that the compiler optimization will eliminate that when possible, but it is a potential disadvantage nonetheless. An intrinsic subroutine interface that combines the best features of MATMUL and DGEMM would be a welcome addition to programmers interested in performance.
I’m not sure if anyone has mentioned this, but you’ve written some very nice Fortran code @Said-H minimizing cache misses.
I think I’ve traced the problem of why Fortran is slower to the following lines:
! accumulate into cols 0..3
cr(1:4,1,r) = cr(1:4,1,r) + arow*b0(1:4)
! accumulate into cols 4..7
cr(1:4,2,r) = cr(1:4,2,r) + arow*b1(1:4)
For some reason, gfortran is refusing to vectorize these operations. Maybe it thinks the iteration number is too low to do so.
On my system with
gfortran -Ofast -march=native -mtune=native -fopenmp -ffpe-summary=none compare_dgemm.f90 -o compare_dgemm dgemm_kernel.o -lmkl_gf_lp64 -lmkl_gnu_thread -lmkl_core -lpthread
I get
==================================================
Matrix size: 1000
Fortran(s): 0.057000
C-Intrinsics(s): 0.034000
OpenBLAS(s): 0.032000
Speedup (OpenBLAS): 1.781250
Speedup (C-Intrinsics): 1.676471
Error (Fortran): 0.000000
Error (C): 0.000000
==================================================
Matrix size: 1500
Fortran(s): 0.190000
C-Intrinsics(s): 0.116000
OpenBLAS(s): 0.109000
Speedup (OpenBLAS): 1.743119
Speedup (C-Intrinsics): 1.637931
Error (Fortran): 0.000000
Error (C): 0.000000
==================================================
Matrix size: 2000
Fortran(s): 0.446000
C-Intrinsics(s): 0.277000
OpenBLAS(s): 0.259000
Speedup (OpenBLAS): 1.722008
Speedup (C-Intrinsics): 1.610108
Error (Fortran): 0.000000
Error (C): 0.000000
which is consistent with your results.
However, replacing the lines with
! accumulate into cols 0..3
!$OMP SIMD SIMDLEN(4)
do q=1,4
cr(q,1,r)=cr(q,1,r)+arow*b0(q)
end do
! accumulate into cols 4..7
!$OMP SIMD SIMDLEN(4)
do q=1,4
cr(q,2,r)=cr(q,2,r)+arow*b1(q)
end do
yields
==================================================
Matrix size: 1000
Fortran(s): 0.037000
C-Intrinsics(s): 0.034000
OpenBLAS(s): 0.032000
Speedup (OpenBLAS): 1.156250
Speedup (C-Intrinsics): 1.088235
Error (Fortran): 0.000000
Error (C): 0.000000
==================================================
Matrix size: 1500
Fortran(s): 0.120000
C-Intrinsics(s): 0.116000
OpenBLAS(s): 0.109000
Speedup (OpenBLAS): 1.100917
Speedup (C-Intrinsics): 1.034483
Error (Fortran): 0.000000
Error (C): 0.000000
==================================================
Matrix size: 2000
Fortran(s): 0.283000
C-Intrinsics(s): 0.276000
OpenBLAS(s): 0.259000
Speedup (OpenBLAS): 1.092664
Speedup (C-Intrinsics): 1.025362
Error (Fortran): 0.000000
Error (C): 0.000000
bringing the three codes to almost the same performance.
I tried several gfortran vectorization options to see if the same could be obtained without SIMD instructions, but no luck so far (perhaps someone more knowledgeable could figure this out).
In short (and as others have mentioned), I think the issue is not with Fortran the language, but rather the choices the compiler makes in vectorizing the code. Fortran does have several native language hints which allow for easier optimization such as no aliasing, pure subroutines, do concurrent, array operations and vector/matrix intrinsics.
Isn’t it also related to alignment?
!GCC$ vector also helps with vectorization.
==================================================
Matrix size: 1000
Fortran(s): 0.029000
C-Intrinsics(s): 0.026000
Error (Fortran): 0.000000
Error (C): 0.000000
==================================================
Matrix size: 1500
Fortran(s): 0.095000
C-Intrinsics(s): 0.087000
Error (Fortran): 0.000000
Error (C): 0.000000
==================================================
Matrix size: 2000
Fortran(s): 0.221000
C-Intrinsics(s): 0.212000
Error (Fortran): 0.000000
Error (C): 0.000000
Strangely, $OMP SIMD SIMDLEN(4) doesn’t help at all ifort (21). ifort (at least this version) is badly failing on this code.
Perhaps try
! accumulate into cols 0..3
cr(:,1,r) = cr(:,1,r) + arow*b0(:)
! accumulate into cols 4..7
cr(:,2,r) = cr(:,2,r) + arow*b1(:)
This may encourage AVX ?
Thanks for the solution. It solved the problem on my machine. The only issue is figuring out the right optimization flags.
Using gfortran, did the Fortran implementation perform as well as the MKL-BLAS implementation on the Linux machine with a Xeon Gold 6348?
No, it’s still 2x behind (but it’s probably AVX512 vs AVX256, as you mentioned). But the Fortran version is now slighty faster than the C version (say 10%). This is with gcc 10.
Nice! I have tried to make the Fortran code more general to take advantage of AVX512 if available, by having a loop like this
!$OMP SIMD SIMDLEN(4)
do q=1,8
cr(q,r) = cr(q,r) + arow * bb(q)
end do
But the performance is 4 times slower than the C version. Achieving the same results on different architectures will likely be challenging. We need different implementations for different architectures
What about SIMDLEN(8) (or no SIMDLEN at all) ?
The same problem
@PierU Strangely,
$OMP SIMD SIMDLEN(4)doesn’t help at all ifort (21). ifort (at least this version) is badly failing on this code.
I found using !$OMP SIMD SIMDLEN(2) or !DIR$ VECTOR VECTORLENGTH(2)doubles the speed. But you’re right: ifort is struggling with this one. ifx performs better but still not as well as gfortran.
That statement, right there, is why fortran has needed a standard preprocessor (i.e. one that works the same everywhere) since the early 1980s. In addition to conditional compilation, it must also provide macros that can be set at compile time for these kinds of tasks (e.g. specifying the SIMDLEN value).