I think the codes are in
and
I think the codes are in
and
What is the need of such a convoluted (pun intended ) code to perform something as simple as a convolution? What about the OOP runtime overheads?
Here is my code for 1D convolution:
!*******************************************************************************
subroutine sconv1D &
(d,dfirst,dlast, &
e,efirst,elast, &
x,xfirst,xlast)
!*******************************************************************************
! x = x + d * e
!*******************************************************************************
implicit none
integer, intent(in) :: dfirst, dlast
integer, intent(in) :: efirst, elast
integer, intent(in) :: xfirst, xlast
real, intent(in) :: d(dfirst:dlast), e(efirst:elast)
real, intent(inout) :: x(xfirst:xlast)
integer :: id, ixmin, ixmax
do id = dfirst, dlast
ixmin = max( xfirst, efirst+id)
ixmax = min( xlast, elast +id)
x(ixmin:ixmax) = x(ixmin:ixmax) + d(id)*e(ixmin-id:ixmax-id)
end do
end subroutine sconv1D
KISS… no OOP, explicit shape arguments because they have less overheads than assumed shape… The 2D and 3D versions are essentially the same, just with more nested loops.
Not to mention this (auto-)vectorizes nicely. Say with gfortran -O3 -march=skylake-avx512 -mprefer-vector-width=512
, the bulk of the work gets done in the hot loop:
.L5:
vmovups zmm0, ZMMWORD PTR [r15+rax]
vfmadd213ps zmm0, zmm1, ZMMWORD PTR [rdx+rax]
vmovups ZMMWORD PTR [rdx+rax], zmm0
add rax, 64
cmp rdi, rax
jne .L5
I’m just getting aware that there is an overhead when using explicit shape instead of assumed shape, I always thought that it did not matter. Does that overhead becomes more noticeable as the arrays get larger?
All my codes are using assumed shape, with pretty small arrays (30x30 is a relatively huge dimension for my cases). But the routines are called millions of times so there might be a room for improvement by using explicit shape? I might do some tests later
Yes, the main advantage of Fortran is that it’s easy for a domain expert (say a physicist) to write fast code.
That being said, Fortran should absolutely give the best performance. I looked at @gronki’s examples, but they are layers and layers of OOP, so it would be a long investigation why the Fortran version is slower, which I don’t have time for right now (you should compare gfortran with the same version of gcc, and clang with the same version of flang, etc.). I recommend writing Fortran code like @PierU did above (or how @jkd2022 recommends). @PierU’s convolution code should look simpler than the correspodning C++ code, and it should be as fast or faster — otherwise the Fortran compilers must improve.
In my version, kernel
is an allocatable
, therefore array is guaranteed to be contiguous. Actually, it can be guaranteed by contiguous
attribute that I use in my procedural part of the code. Anyway, most compilers nowadays do generate contigous/non-contigous branches. So I disagree that OOP introduces any overhead here while providing much cleaner interface. (From my tests, there was not much difference caused by that, but I should provide such tests for comparison on the repo). Subroutine with 10 arguments feels like 70’s coding, but it must be a matter of preference, since I know many people who hate OOP interfaces! I think the art is to use them in non-critical parts of the code (configuring the computation) and stick to procedural in the critical parts (where we perform the computation).
To be fair, Fortran version is performing faster with gcc/gfortran
combo compared to C/C++. Intel optimizations seem to be much superior, but perhaps a bit better for C/C++ compiler. I want to analyze the code with a profiler today and see the curlpit. When analyzing the assembly, both versions seem to be vectorized correctly, but Fortran version for some reason is just a little bit slower.
Anyway, a new thread about “OOP overhead in Fortran” could be a better place for the further discussion, I do not want to derail the topic of FAR++.
Contiguity and OOP are two orthogonal problems. Explicit shape dummy arrays are also guaranteed to be contiguous. And the compiler has just to pass an address in all cases, whereas with assumed shape the compiler possibly has to create a full descriptor, depending on what is the actual argument.
Yes, explicit shapes requires more arguments, but beyond the preferences it’s really a choice to ensure the least possible overheads in the calls. I do also use assumed shapes whenever performances are not an issue. And although I do not often use OOP, I don’t hate it and sometimes use it. But, frankly, the 70’s coding style requires here just 20 lines of code, is easy to understand, to maintain (assuming there is something to maintain), and to use. What else?
I find that it is the use of derived types that is mostly responsible for reducing the number of subroutine arguments. Closely related variables (scalars and arrays) are grouped together into just a few derived types, and then those derived types are passed as single arguments. Those derived types themselves can, of course, be scalars, assumed shape arrays, or explicit shape arrays. Fortran is very flexible in this regard. The explicit shape vs. assumed shape is a separate issue, the only thing here really is whether the bounds and dimensions are passed as separate arguments or as part of the assumed shape declaration. Unless the call is in a tight loop with varying size arrays, the overhead associated with the array constructor for the argument association is trivial.
It depends. For instance if such a 1D convolution routine, but with assumed shapes, is called on each column of a 2D array, the compiler may have to generate array descriptors on each call:
do j = 1, size(e,2)
call sconv1d( f, e(:,i), x(:,i) )
end do
Moreover, passing the lower and upper bounds also makes the routine much more versatile.
Subroutines with many arguments can be easily usable if only a few of the arguments are required and they appear before the optional
arguments. The alternative may be an inflexible subroutine with various parameters hard-coded internally.
I created a topic to continue this side discussion: Discussion about performance of OOP in Fortran