Discussion about performance of OOP in Fortran

Continuation of the discussion: Fortran Arrays in C++ - #17 by gronki

@PierU @Beliavsky @certik I continue here our discussion regarding pros/cons of OOP in Fortran.

I do not think @PierU implementation is easy to understand, quite the contrary. I can sacrifice fraction of a second on a heavy operation if that makes the interface simpler and helps to avoid an error. But again, it is really a matter of preference, and while I see that approach as worse, many people seem to prefer that straightforward style.

Certainly it will not be any faster, just less practical.

Regarding using optional arguments for subroutines with many parameters as @Beliavsky sugested, I agree. But I still think that the approach where we may use the derived type as an input to computation routine as its configuration is a good pattern. Additionally, it provides a great way to handle default values of parameters, and can be elegantly used with namelists to input/write the configuration. (Convolution example is perhaps a bit too simple to illustrate that.) This of course only makes sense for small scalars since the values need to be copied into the derived type. And OOP is basically just wraps function pointers to be invisible to a user, and allows implementing a strategy pattern.

I do not think OOP is the reason for the difference between Fortran and C++ here, since my C++ code is a direct translation of Fortran code and is also using virtual methods and inheritance to implement different convolution strategies. To be honest, the difference is only 15% and in the case of one compiler, yet I still think something can be learned.

1 Like

For heavy OOP codes I usually prefer C++ to Fortran. The syntax is simpler / shorter, and C++ usually delivers decent performance for OOP.

I agree that derived types / structs are sometimes useful to lump arguments together.

Regarding general OOP vs procedure/direct style, it’s a matter of preference. Both are tools that you can use. Usually a progression of a programmer is to start with procedural style; then learn OOP and use it everywhere; finally the last step is to realize that the procedural style is actually simply better in many cases, all things considered. This is true for both Fortran and C++. This is, btw, why some very experienced programmers often just use C.

7 Likes

I really like OOP in Fortran, (I use mostly the strategy design pattern in cases like: yaeos/src/models/residual_helmholtz/cubic/generic_cubic.f90 at main · ipqa-research/yaeos · GitHub), I think it helps a lot in cases when there is a need for heavy flexibility (for example, working with different mathematical models that have a similar interface but different behaviours). But, as time passed on I’m slowly migrating to using more and more the basic types of Fortran and just using OOP to wrap a bit those basic subroutines. This helps a lot to keep the strictly mathematical part of the code as just math, so it is a bit simpler to understand with a quick glace and probably easier for the compiler to optimize (I assume). I think a good middle-ground is keeping the number crunching part with basic intrinsic types (even if the number of arguments gets a bit ugly), and wrapping them with OOP to provide a nice high level interface.

In the case of convolution (area outside of my knowledge) if I compare PierU implementation and the one on the repo, I see the OOP implementation extremely convoluted and really hard to follow, feels a bit like Hello World Enterprise Edition · GitHub, but maybe I’m missing something since is not my area of knowledge. I think that it is important to always keep in mind when each kind of methodology (OOP or procedural) is the best one.

I have heard this from more than one person! Certainly part of it is that when you learn doing X, you do X all the time until you realize it is not always neccesary. I think I am still in my OOP phase, particularly that I am trying to accomplish a scriptable image processing pipeline in Fortran, and very heavy OOP design with a lot of inheritance is the only feasible way to achieve this.

I had a good laugh with this one! Honestly I wrote it a year ago when I was learning design patterns, and wanted to see how it can be applied in Fortran. I do not think it was that convoluted, but if a third person is telling that it was an overkill perhaps there is some truth here. Thanks for the link anyway, I see there are some improvements I can introduce. :smiley:

Good point noted! I was thinking about it yesterday. Another thing to note is usage of type vs class: the latter assumes polymorphism so inlining may not be possible. I wonder if using

class(obj_t), intent(in) :: obj
select type(obj)
type is (specific_type_t) ! note TYPE not CLASS
   call obj % hopefully_inlined_sub()
end select

My understanding was that as soon as a procedure has dummy arguments class and definitely with select type, the compiler is forced to determine everything at runtime. I may be wrong though, having not spent a great deal of time writing and analyzing assembly for Fortran OOP stuff.

1 Like

I made a small test case based on the my basic 1d convolution routine with explicit shapes. I made a version with assumed shapes, and then I built an OOP version (similar to @gronki) on top of that. So I’m testing 3 versions:

  • explicit shape interface
  • assumed shape interface
  • OOP interface (with explicit shapes, so one can separate the overheads due to the assumed shapes from the ones due to OOP).

In the assumed version, I declared the array arguments as contiguous, as not all compilers do test the contiguity at runtime to take advantage of it (it seems that gfortran doesn’t, since I have much better performances when contiguous is specified).

The test consists in repeatedly performing a 1d convolution of a vector of size N with a kernel of size L. The vectors are the columns of a 2D array. The number of columns m is determined such that the total amount of floating point operations (O(N\space L\space m)) is constant.

With gfortran (14) -Ofast -march=native -fopenmp (I am linking OpenMP just to get the omp_get_wtime() function):

  • there is a slight overhead of the assumed shape interface, which is observable only with very short vectors and kernels.
  • the overhead is even pushed down by using the link time optimization -flto
  • the is a more important overhead of the OOP layer, even if it also gets non observable with longer vectors/kernels; the link-time optimization also helps, but not that much

With ifort (21) -Ofast -march=native -fopenmp:

  • there is no overhead of the assumed shape interface, even without using the the inter procedural optimization -ipo
  • the overhead of the OOP layer is similar to the one with gfortran

So… On the one hand, if one deals only with long enough vectors/kernels, none of the overheads really matters. On the other hand, the old style explicit shapes routine always ensures optimum performances whatever the vector/kernel lengths and whatever the compiler (OK, I’ve tested only 2 of them…)

The test outputs (the ifort test is running on a more powerful machine than the gfortran test, hence the overall better performances):

$ gfortran -Ofast -march=native -fopenmp conv.f90 testconv.f90 && ./a.out
N=  50 L=  3 m=66666   explicit shape:   5.04247 sec.   429.331238
N=  50 L=  3 m=66666   assumed  shape:   5.67779 sec.   429.331238
N=  50 L=  3 m=66666   oop (explicit):  11.94931 sec.   429.331238
N= 100 L= 11 m= 9090   explicit shape:   3.50213 sec.   3415.54297
N= 100 L= 11 m= 9090   assumed  shape:   3.26650 sec.   3415.54297
N= 100 L= 11 m= 9090   oop (explicit):   4.10882 sec.   3415.54297
N=1000 L= 11 m=  909   explicit shape:   2.50628 sec.   2853.36157
N=1000 L= 11 m=  909   assumed  shape:   2.50305 sec.   2853.36157
N=1000 L= 11 m=  909   oop (explicit):   2.58195 sec.   2853.36157
N=1000 L=101 m=   99   explicit shape:   2.72735 sec.   28535.1992
N=1000 L=101 m=   99   assumed  shape:   2.80414 sec.   28535.1992
N=1000 L=101 m=   99   oop (explicit):   2.72536 sec.   28535.1992
$ 
$ gfortran -Ofast -march=native -flto -fopenmp conv.f90 testconv.f90 && ./a.out
N=  50 L=  3 m=66666   explicit shape:   4.39111 sec.   1358.89661
N=  50 L=  3 m=66666   assumed  shape:   4.77631 sec.   1358.89661
N=  50 L=  3 m=66666   oop (explicit):  10.60820 sec.   1358.89661
N= 100 L= 11 m= 9090   explicit shape:   2.96375 sec.   3414.63672
N= 100 L= 11 m= 9090   assumed  shape:   2.98961 sec.   3414.63672
N= 100 L= 11 m= 9090   oop (explicit):   3.89261 sec.   3414.63672
N=1000 L= 11 m=  909   explicit shape:   2.49971 sec.   1322.84998
N=1000 L= 11 m=  909   assumed  shape:   2.41724 sec.   1322.84998
N=1000 L= 11 m=  909   oop (explicit):   2.97342 sec.   1322.84998
N=1000 L=101 m=   99   explicit shape:   2.89645 sec.   22934.1426
N=1000 L=101 m=   99   assumed  shape:   2.73086 sec.   22934.1426
N=1000 L=101 m=   99   oop (explicit):   2.80207 sec.   22934.1426


$ ifort -Ofast -march=native -fopenmp conv.f90 testconv.f90 && ./a.out
N=  50 L=  3 m=66666   explicit shape:   3.22394 sec.   543.7036
N=  50 L=  3 m=66666   assumed  shape:   3.20795 sec.   543.7036
N=  50 L=  3 m=66666   oop (explicit):   7.10191 sec.   543.7036
N= 100 L= 11 m= 9090   explicit shape:   2.05390 sec.   1746.136
N= 100 L= 11 m= 9090   assumed  shape:   2.32972 sec.   1746.136
N= 100 L= 11 m= 9090   oop (explicit):   2.76197 sec.   1746.136
N=1000 L= 11 m=  909   explicit shape:   2.11248 sec.   2002.490
N=1000 L= 11 m=  909   assumed  shape:   2.15468 sec.   2002.490
N=1000 L= 11 m=  909   oop (explicit):   2.18171 sec.   2002.490
N=1000 L=101 m=   99   explicit shape:   1.32205 sec.   31236.11
N=1000 L=101 m=   99   assumed  shape:   1.58577 sec.   31236.11
N=1000 L=101 m=   99   oop (explicit):   1.33595 sec.   31236.11
$ 
$ ifort -Ofast -march=native -ipo -fopenmp conv.f90 testconv.f90 && ./a.out
N=  50 L=  3 m=66666   explicit shape:   3.22681 sec.   543.7036
N=  50 L=  3 m=66666   assumed  shape:   3.20301 sec.   543.7036
N=  50 L=  3 m=66666   oop (explicit):   6.16719 sec.   543.7036
N= 100 L= 11 m= 9090   explicit shape:   2.33493 sec.   1746.136
N= 100 L= 11 m= 9090   assumed  shape:   2.11606 sec.   1746.136
N= 100 L= 11 m= 9090   oop (explicit):   2.71851 sec.   1746.136
N=1000 L= 11 m=  909   explicit shape:   2.16270 sec.   2002.490
N=1000 L= 11 m=  909   assumed  shape:   2.15879 sec.   2002.490
N=1000 L= 11 m=  909   oop (explicit):   2.18736 sec.   2002.490
N=1000 L=101 m=   99   explicit shape:   1.57657 sec.   31236.11
N=1000 L=101 m=   99   assumed  shape:   1.59752 sec.   31236.11
N=1000 L=101 m=   99   oop (explicit):   1.60242 sec.   31236.11

conv.f90:

module phconv
implicit none

   type conv1D_t
      real, allocatable :: k(:)
   contains
      procedure :: set_kernel
      procedure :: apply
      procedure :: clear_kernel
   end type


contains

   subroutine set_kernel(this,k,kfirst,klast)
   class(conv1D_t), intent(inout) :: this
   integer, intent(in) :: kfirst, klast
   real, intent(in) :: k(kfirst:klast)
   allocate( this%k, source=k )
   end subroutine

   subroutine clear_kernel(this)
   class(conv1D_t), intent(inout) :: this
   deallocate( this%k )
   end subroutine
   
   subroutine apply(this,x,xfirst,xlast,y,yfirst,ylast)
   class(conv1D_t), intent(inout) :: this
   integer, intent(in) :: xfirst, xlast, yfirst, ylast
   real, intent(in) :: x(xfirst:xlast)
   real, intent(inout) :: y(yfirst:ylast)
   call sconv1D_es( this%k, lbound(this%k,1), ubound(this%k,1) &
                  , x, xfirst, xlast                           &
                  , y, yfirst, ylast                           )
   end subroutine

   
   !*******************************************************************************
   subroutine sconv1D_es      &
             (d,dfirst,dlast, &
              e,efirst,elast, &
              x,xfirst,xlast)
   !*******************************************************************************
   ! x = x + d * e
   !*******************************************************************************
   implicit none

   integer, intent(in) :: dfirst, dlast
   integer, intent(in) :: efirst, elast
   integer, intent(in) :: xfirst, xlast
   real,    intent(in) :: d(dfirst:dlast), e(efirst:elast)
   real, intent(inout) :: x(xfirst:xlast)

   integer :: id, ixmin, ixmax

   do id = dfirst, dlast
      ixmin = max( xfirst, efirst+id)
      ixmax = min( xlast,  elast +id)
      x(ixmin:ixmax) = x(ixmin:ixmax) + d(id)*e(ixmin-id:ixmax-id)
   end do

   end subroutine sconv1D_es

   
   !*******************************************************************************
   subroutine sconv1D_as(d,dfirst,e,efirst,x,xfirst)
   !*******************************************************************************
   ! x = x + d * e
   !*******************************************************************************
   implicit none

   integer, intent(in) :: dfirst
   integer, intent(in) :: efirst
   integer, intent(in) :: xfirst
   real,    intent(in), contiguous :: d(dfirst:), e(efirst:)
   real, intent(inout), contiguous :: x(xfirst:)

   integer :: id, ixmin, ixmax

   do id = dfirst, ubound(d,1)
      ixmin = max( xfirst     , efirst      +id)
      ixmax = min( ubound(x,1), ubound(e,1) +id)
      x(ixmin:ixmax) = x(ixmin:ixmax) + d(id)*e(ixmin-id:ixmax-id)
   end do

   end subroutine sconv1D_as

end module

testconv.f90:

program convtest
use omp_lib
use phconv
implicit none

real, allocatable :: f(:,:), x(:,:), y(:,:)
integer, parameter :: N(*) = [50, 100, 1000, 1000]
integer, parameter :: L(*) = [ 3,  11,   11,  101]

integer :: i, j, k, m, L2
double precision :: tic, toc

type(conv1D_t) :: ooconv

100 format (A,I4,X,A,I3,X,A,I5,3X,A,F10.5,X,A,3X,G0)

do i = 1, size(N)
   ASSOCIATE( N => N(i), L=>L(i) )
   
   L2 = L/2
   
   m = 10**7 / (N*L)
   
   allocate( x(N,m), y(N,m), f(0:L-1,m) )
   
   y(:,:) = 0.0
   call random_number(x)
   call random_number(f)
   
   tic = omp_get_wtime()
   
   do k = 1, 1000
   do j = 1, m
      
      call sconv1D_es( f(:,j)      , -L2,   L2 &
                     , x(:,j)      ,   1,    N &
                     , y(L2:N-L2,j),  L2, N-L2 )
      
   end do
   end do

   toc = omp_get_wtime()
   print 100, "N=", N, "L=", L, "m=", m, "explicit shape:", toc-tic, "sec.", y(N/2,1)

   y(:,:) = 0.0
   
   tic = omp_get_wtime()
   
   do k = 1, 1000
   do j = 1, m
      
      call sconv1D_as( f(:,j)      , -L2 &
                     , x(:,j)      ,   1 &
                     , y(L2:N-L2,j),  L2 )
      
   end do
   end do

   toc = omp_get_wtime()
   print 100, "N=", N, "L=", L, "m=", m, "assumed  shape:", toc-tic, "sec.", y(N/2,1)
   
   
   y(:,:) = 0.0

   tic = omp_get_wtime()
      
   do k = 1, 1000
   do j = 1, m
      
      call ooconv%set_kernel( f(:,j), -L2, L2 )
      call ooconv%apply( x(:,j), 1, N, y(L2:N-L2,j), L2, N-L2 )
      call ooconv%clear_kernel()
      
   end do
   end do

   toc = omp_get_wtime()
   print 100, "N=", N, "L=", L, "m=", m, "oop (explicit):", toc-tic, "sec.", y(N/2,1)

   deallocate( x, y, f )
   
   END ASSOCIATE
end do

end
3 Likes

The non_overridable attribute is helpful in this regard. It can have the effect of turning off dynamic dispatching, which I’m guessing is one of the things being referenced by “determine everything at runtime.”

1 Like

I suspect the majority of the problems people encounter with OOP are actually problems with Object-Oriented Design (OOD). OOP is just the mechanics of creating classes. OOD is about which classes one chooses to create and it makes all the difference between code clarity and code obfuscation. Choosing classes that match the abstractions of a given domain while preserving high performance is as much art as science. There are likely just as many examples of OOD aiding performance as there are of it hurting performance. Moving from synchronization (e.g., sync all) to semaphores via event_type is in my view one example of OO likely enhancing performance. Because directly manipulating raw atomics can be challenging, I think the idea of encapsulating a private atomic integer in an object and giving the user only a limited number of ways to interact with the private component, makes asynchronous programming much easier for many common cases and is an example of careful OOD aiding performance. There are many others.

Over time, I’ve come to value OOP mostly for packaging, organizing, and abstracting.

  • Packaging: use foo_m, only : foo_t imports the type and all of the procedures packaged with the type via the procedure bindings,
  • Organizing: most of the procedures that work on a specific type can be found in the bindings on that type,
  • Abstracting: defining derived types that represent the abstractions that are already used in everyday language in a domain gives users a way to think and talk about the code using terminology that they use in the domain more broadly.

For me at least, OOD is really challenging and I frequently have to make a second or third pass that involves a major overhaul before settling on a design that feels right. For example, two years into developing the library on which I currently spend most of my time, I engaged in an overhaul that touched nearly every type and nearly every procedure, sometimes in very significant ways. In this context, what I valued most about OOP/OOD is data privacy. Quite often I could rip the guts out of a derived type and completely restructure the components without impacting external code as long as I kept the procedure bindings the same. Over time, some of the bindings changed too, but the whole process could be much more incremental in a way that would have been much more difficult without private components.

The latter point especially matters to me when I’m new to a project. If I can’t make isolated changes without impacting a lot of code that I had no desire to change, I’ll likely give up before finishing the change.

3 Likes

Thank you for this amazing analysis! So i guess it confirms that the overhead of the assumed shape is small or none, and that OOP has some small overhead which should be considered for processing few elements but for large data pieces it does not matter.

I have redone my benchmark, and it turned out that ~10000 items is where the OOP performance seems to hit. Above that threshold, Fortran outperformed C++ for every compiler.

The mystery remains why C++ version did perfom better for short arrays (it also used virtual functions that could no be inlined).

As for my design of the library (with set_kernel and convolution procedures being separate), I wanted to achieve:

  • flexibility when it comes to used algorithm (FFT and CUDA convolution were in plans)
  • scenario when one kernel is used to process a big batch of data (and kernel preprocessing, such as inversion, padding or computing FFT also possibly adds some overhead and you don’t want to repeat it for every item)

This thread is very insightful and I will think how to make things simpler. :slight_smile:

How much is the impact of constantly allocating and de-allocating the kernel in the OOP version? Could that be what impacts the performance?

I think 15 % is too small of a difference to make meaningful conclusions without also looking at absolute performance. Sometimes the compilers just don’t generate the right instructions. Recently I was looking at a procedure involving sums like:

real(dp), parameter :: rhol = 0.001_dp, rhoh = 1.0_dp 
real(dp), allocatable :: c(:,:), rho(:,:), h(:,:,:)

! ... initialize

! within an outer time loop
   c = sum(h,dim=3)
   rho = rhol + c * (rhoh - rhol)

I got a ~50 % difference between gfortran and ifx because of how (un)successful they were in vectorizing (and presumably fusing) the array expressions.

When faced with complex APIs with several parameters I find it helpful to write helper functions (as internal procedures) with sensible defaults when available. For instance a convolution procedure with many arguments can be wrapped to use assumed-shape arrays:

    ! Convolution helper function, y = k * x
    subroutine conv(k,x,y)
        use convolution_lib, only: sconv1d
        real, intent(in), contiguous :: k(:), x(:)
        real, intent(inout), contiguous :: y(:)
        y = 0
        call sconv1d( &
            k,1,size(k), &
            x,1,size(x), &
            y,1,size(y))
    end subroutine

On the topic of performance of OOP, it would be interesting to have a Fortran variant of the Stepanov Benchmark: SingleSource/Benchmarks/Misc-C++/stepanov_v1p2.cpp - llvm-test-suite - Git at Google.

I found some related Fortran results at the page of Boleslaw K. Szymanski (work in collaboration with V. Decyk and C. Norton) from the late 90’s:

This is before F2003 introduced polymorphism.

If that is indeed the case, my guess is that C++ comes from the non-HPC world where doing a lot of work with smaller collections matters more. Consequently, compiler implementers spent more time thinking how to speed up even small things, and occasionally succeeded. Fortran implementers tend to care more about the largest single problems that will fit on a machine.

It should be noticeable. I did not include that part in the measurement in my benchmark, I think Pier did.

Good point… I actually tested that at the beginning, by allocating the kernel only once per inner loop, and didn’t see obvious timing differences, so I dropped the test. But later on I found that I was not correctly timing everything: I corrected that, but forgot testing again without repeated allocations and correct timings. Would be worth doing it !

I think that might have been true in the past, say in the 1980s with f77 and K&R C compilers. Back then, I do think that C compilers were better at optimizing function calls than fortran compilers, and probably for the reason you mention. Then f90 introduced array syntax, and there was nothing comparable within C or C++, so fortran compilers probably focused on optimizing array expressions rather than the equivalent operations done as function references. Certainly there is a lot of engineering and science done with size 3 vectors, but in modern fortran those operations would likely be done with the native array syntax rather than as functions. So maybe this performance difference in favor of C or C++ for functions with small length vectors is simply an artifact of that history?