Polymorphism and generic operator

Hi, I am having a hard time understanding how to combine polymorphism and generic operators.

See below my attempt at implementing an abstract class Vect and its addition, then providing an implementation Vect2D.
The code compiles without warning using gcc 11.5, 13.2, 15.1, ifx 2025, and flang 21.1.0. It segfaults using gcc but runs fine with flang and ifc:

bbserv:VectClass $ make test1 FC=gfortran
bbserv:VectClass $ ./test1
v2D1       :    0.00000000       1.00000000    
v2D2       :    3.00000000       4.00000000    
v2D3       :    3.00000000       4.00000000    
v2D1 + v2D2:  not a Vect2D
In file 'test1.F90', around line 19: Error allocating 140721480245456 bytes: Cannot allocate memory

Error termination. Backtrace:
#0  0x7f5a044238a0 in ???
#

vs.

bbserv:VectClass $ make test1 FC=ifx
bbserv:VectClass $ ./test1
v2D1       :   0.0000000E+00   1.000000    
v2D2       :    3.000000       4.000000    
v2D3       :    3.000000       4.000000    
v2D1 + v2D2:    3.000000       5.000000    
v1         :    3.000000       5.000000    

Is gfortran correct here? Am I missing something?

test1.F90 (487 Bytes)

m_VectClassAdd.F90 (1.6 KB)

1 Like

Replying to myself:

I managed to make it work, but quite don’t understand why…
In the definition of the extended class Vect2D, if I replace:

    function Vect2DAdd(self, other)
        class(Vect2D), intent(in) :: self
        class(Vect), intent(in) :: other
        class(Vect), allocatable :: Vect2DAdd
        select type(o => other)
        type is (Vect2D)
            Vect2DAdd = Vect2D(self%X + o%X, self%Y + o%Y)
        end select
    end function Vect2DAdd

with

    function Vect2DAdd2(self, other) result(s)
        class(Vect2D), intent(in) :: self
        class(Vect), intent(in) :: other
        class(Vect), allocatable :: s
        select type(o => other)
        type is (Vect2D)
            s = Vect2D(self%X + o%X, self%Y + o%Y)
        end select
    end function Vect2DAdd2

my code runs fine with all compilers.

I was under the impression that both forms were strictly equivalent, are they not?

This looks like a gfortran compiler bug (of which there are quite a few on modern Fortran code).

I personally use ifx and especially flang-20 for object-oriented code. In my experience, LLVM/Flang is presently the compiler with the most reliable implementation when it comes to modern Fortran features.

3 Likes

flang > 20 seems like to way to go indeed. I have not managed to compile mpich under macOS with it yet. Maybe that’ll motivate me more…

@certik and the LFortran team over at Github are presently also doing great progress, with a redesign of LFortran’s OOP implementation.

This means that, in the future, we could not only have two open source compilers with reliable support for standard Fortran OOP, but also a compiler (LFortran) that supports some very needed extensions.

These are great times for object-oriented programming in Fortran.

1 Like

I second that. flang 20 has surprised me to be the most reliable compiler for my heavily object-oriented code. (At least until I decided to move all the OOP part to C++ and only keep compute kernels in Fortran).

[Updated: my timings for testDispatch were obviously wrong (was calling AXPY 100 times instead of 10M times)]

Following up: I became interested in trying evaluate the performance cost of polymorphism vs overloading. I wrote a simple code implementing an abstract class Vect and a subclass Vect2D containing 2 reals. I then implemented left and right scalar multiplication and addition in three different ways: classical overloading without OO (testOverloading), manually dispatching to the proper implementation (testDispatch) and using full polymorphism (testPolymorphism). The code is available on github at GitHub - bourdin/fortranPolymorphismBenchmark: Basic code to highlight the performance of polymorphism vs overloading vs dispatch

See timing results for running 10M AXPY operations this way:

macOS:

  • gfortran GNU Fortran (Homebrew GCC 15.1.0) 15.1.0

  • running benchmark with N=100000000. Results will be saved in timing-gfortran.txt
    timing testPolymorphism
    Doing 100000000 AXPY
    Result:  2.18182E+00 6.18182E+00
    
    real	0m2.712s
    user	0m2.379s
    sys	0m0.179s
    
    timing testOverloading
    Doing 100000000 AXPY
    Result:  2.18182E+00 6.18182E+00
    
    real	0m1.264s
    user	0m1.101s
    sys	0m0.023s
    
    timing testDispatch
    Doing 100000000 AXPY
    Result:  2.18182E+00 6.18182E+00
    
    real	0m2.792s
    user	0m2.465s
    sys	0m0.178s
    
    
  • flang Homebrew flang version 21.1.0

  • running benchmark with N=100000000. Results will be saved in timing-flang.txt
    timing testPolymorphism
    Doing 100000000 AXPY
    Result:  2.18182E+00 6.18182E+00
    
    real	0m37.877s
    user	0m37.395s
    sys	0m0.342s
    
    timing testOverloading
    Doing 100000000 AXPY
    Result:  2.18182E+00 6.18182E+00
    
    real	0m7.447s
    user	0m7.185s
    sys	0m0.121s
    
    timing testDispatch
    Doing 100000000 AXPY
    Result:  2.18182E+00 6.18182E+00
    
    real	0m32.577s
    user	0m32.205s
    sys	0m0.237s
    
    
    

See the performance difference for the testPolymorphism? In all fairness, this is probably a skewed test. I suspect that the gfortran code has a memory leak at m_Vect2D.F90:25, m_Vect2D.F90:39, m_Vect2D.F90:51, and m_Vect2D.F90:62 while flang does not and that a lot of etc overhead has to do with reallocation. I don’t know how to properly trace and inspect memory usage under macOS which lack valgrind.

linux rockylinux 9.5 (using older versions of gfortran and flang installed with spack):

  • flang AMD clang version 17.0.6 (CLANG: AOCC_5.0.0-Build#1377 2024_09_24)

  • running benchmark with N=100000000. Results will be saved in timing-flang.txt
    timing testPolymorphism
    Doing 100000000 AXPY
    Result:  2.18182E+00 6.18182E+00
    
    real	0m18.212s
    user	0m18.203s
    sys	0m0.003s
    
    timing testOverloading
    Doing 100000000 AXPY
    Result:  2.18182E+00 6.18182E+00
    
    real	0m0.823s
    user	0m0.819s
    sys	0m0.002s
    
    timing testDispatch
    Doing 100000000 AXPY
    
    real	0m4.689s
    user	0m4.685s
    sys	0m0.002s
    
  • gfortran GNU Fortran (Spack GCC) 13.2.0

  • running benchmark with N=100000000. Results will be saved in timing-gfortran.txt
    timing testPolymorphism
    Doing 100000000 AXPY
    Result:  2.18182E+00 6.18182E+00
    
    real	0m6.400s
    user	0m3.891s
    sys	0m2.505s
    
    timing testOverloading
    Doing 100000000 AXPY
    Result:  2.18182E+00 6.18182E+00
    
    real	0m0.670s
    user	0m0.667s
    sys	0m0.001s
    
    timing testDispatch
    Doing 100 AXPY
    
    Program received signal SIGSEGV: Segmentation fault - invalid memory reference.
    
    Backtrace for this error:
    #0  0x7f316cc3e72f in ???
    #1  0x0 in ???
    
    real	0m0.189s
    user	0m0.024s
    sys	0m0.016s
    
  • intel oneAPI ifx (IFX) 2025.0.0 20241008

  • running benchmark with N=100000000. Results will be saved in timing-ifx.txt
    timing testPolymorphism
    Doing 100000000 AXPY
    Result:  2.18182E+00 6.18182E+00
    
    real	0m34.264s
    user	0m34.229s
    sys	0m0.023s
    
    timing testOverloading
    Doing 100000000 AXPY
    Result:  2.18182E+00 6.18182E+00
    
    real	0m1.153s
    user	0m1.141s
    sys	0m0.001s
    
    timing testDispatch
    Doing 100000000 AXPY
    Result:  2.18182E+00 6.18182E+00
    
    real	0m35.863s
    user	0m35.845s
    sys	0m0.005s
    
    
    

Note that gfortran crashes on the testDispatch test, and that valgrind confirms a memory leak at the locations highlighted above while both flang and ifx–generated codes are valgrind clean.

Conclusions:

  1. gfortran outperforms all other compilers, but at the expense of a memory leak, which is problematic, to say the least.
  2. amongst compiler that generate correct code, aocc flang outperforms ifx by quite a lot. I would be interested in seeing the same benchmark repeated with flang 21, which I am still having a hard time building on my cluster.
  3. I was surprised by the overhead of using polymorphism vs overloading. It seems to me that the main issue is that this induces frequent memory reallocation.It is entirely possible that my implementation of addition and scalar multiplication as polymorphic functions is clumsy and that there is a way to avoid this reallocation.

I would definitely like to hear what the experts have to say here.

Regards,

Blaise

I just glanced over your examples, but from what I’ve seen from a very brief look you are using polymorphism wrongly.

Your code is operating on just two reals, and is full of select type statements. i.e. run-time type inspections. Which under these circumstances incur significant overhead. Hence, any conclusions from this code regarding the performance of run-time polymorphism vs. overloading, or the performance of different compilers, are pretty much meaningless.

Operate on entire arrays instead (the larger the arrays the better) and get rid of those pesky select typestatements.

Dynamic dispatch has a certain latency, and your polymorphic methods need to be doing sufficient work, in order for this latency to become negligible.

The same principles apply here as for optimization on vector computers. Operate (with your polymorphic methods) on Fortran arrays, not scalars!

Are you running these tests on an AMD CPU. If so, I would expect there is a good chance a compiler tuned for AMD Zen family CPUs will give better performance than ifx. Although, ifx (Intel) will usually give acceptable performance on AMD processors, Intel is under no obligation to perform the level of optimizations for AMD processors that they do for their native processors. I have had instances where Ifort refused to support some optimizations on an AMD processor.

Actually, this was by design.

I tried to reduce the computational load as light as possible (I could have removed all flops) to highlight the cost ‘dynamic’ vs ‘static’ dispatch. You could also argue that sine I implemented one class, the whole OO is pointless here.

Within the current infrastructure, i.e. a main class without any component and multiple implementations that may have completely different components, how would I implement a scalar multiplication (the method scalMultL of Vect2D implemented in Vect2DScalMultL) without the use of select type?

About acting on arrays vs scalar, note that even in the context of a linear algebra class, this is not necessarily possible. Think of implementing Hooke’s laws, i.e. fourth order tensors with various level of symmetry. Addition of the Hooke’s laws does not necessarily translate into addition of their components. For instance, if A1 corresponds to a Hooke’s law with young modulus E1 and Poisson ration nu1 and A2 is E2 and nu2, E1+E2 does not have elastic module E1+E2 and nu1+nu2, so there would be no advantage in representing a Hooke’s law define by E and nu as an array of length 2.

This is exactly what I was trying to estimate. i.e. in a large code, what is the deepest level at which OO can lead to simplified, more readable code without too much performance loss.

Fair point. I will try to find an intel machine. I could also have manually selected the processor type. If I use FFLAGS=-O2 -march=core-avx2 for ifx, the outcome does not change significantly:

running benchmark with N=100000000. Results will be saved in timing-ifx.txt
timing testPolymorphism
Doing 100000000 AXPY
Result:  2.18182E+00 6.18182E+00

real	0m35.477s
user	0m35.464s
sys	0m0.001s

timing testOverloading
Doing 100000000 AXPY
Result:  2.18182E+00 6.18182E+00

real	0m1.167s
user	0m1.154s
sys	0m0.003s

timing testDispatch
Doing 100000000 AXPY
Result:  2.18182E+00 6.18182E+00

real	0m35.789s
user	0m35.764s
sys	0m0.009s

For what I was trying to highlight, a better metric would have been the ratio between the execution time of the polymorphic version vs. that of the overloading anyway.

As far as I’ve seen, the need for select type statements in your code stems from employing user-defined operators. To get rid of these statements, you’d need to get rid of these operators.

It is always possible – even in your present code. You have that do i = 1, N loop in your main program, that runs over the axpy operations. Push that loop into your polymorphic method(s) to make them operate over arrays of size N, instead of having some scalar polymorphic method(s) being called N times in that loop.

This is purely a code structuring question. Hooke’s law is irrelevant in this context.

Indeed, employing user-defined operator is precisely what I would like to do. I want to be able to reuse the same code of various data types. For instance, I would like to be able to solve the scalar and vector valued heat equation in 2 or 3D using the same exact code.

That said, not all codes can be rewritten this way.

What if each Vect is just part of a long process? What if they are just a small kernel and all are not executed the same number of time? Think of the embarrassingly parallel problem of solving N odes using an explicit solver, where each ode may not converge at the same rate? Or adaptive integration of a constitutive law at each integration point.

I see your point. I misunderstood your comment as suggesting to use arrays inside the Vect vs doing operations on arrays of Vect.

All of them can. People have been vectorizing codes since the 1970ies.

That’s what gather/scatter operations are good for.

Look, I didn’t say it is easy. Sometimes it may require a different algorithm. But it can always be done, if one is trying to solve a sufficiently large problem – and all HPC problems are by definition “sufficiently large”.

EDIT: The second post by OP already mentioned this!! I just started to write after reading the 1st post, so sorry for a redundant post… :sweat_smile: (I will keep this post anyway for a bit more info)

Hi @Blaise, I guess the segmentation fault is caused by the use of the function name as the result variable (which seems to be a compiler bug that occurs for “complicated” result variable). If it is declared explicitly like below, the program seems to run correctly. Also, polymorphic allocation upon assignment might also trigger a compiler bug for old gfortran, so in that case “sourced allocation” might be a workaround if really necessary.

!! m_m_VectClassAdd.F90

    function Vect2DAdd(self, other) result(res)  !! <---
        class(Vect2D), intent(in) :: self
        class(Vect), intent(in) :: other
        !! class(Vect), allocatable :: Vect2DAdd  !! (buggy)
        class(Vect), allocatable :: res   !! <---
        select type(o => other)
        type is (Vect2D)
            !! Vect2DAdd = Vect2D(self%X + o%X, self%Y + o%Y)
            res = Vect2D(self%X + o%X, self%Y + o%Y)  !! <---
            !! allocate( res, source = Vect2D(self%X + o%X, self%Y + o%Y) ) !! <--- (alt)
        end select
    end function Vect2DAdd

(the following replacement seems not necessary but anyway…)

!! test1.F90

    !!! This crashes gfortran but neither flang nor ifx
    v1 = v2D1 + v2D2
    !! allocate( v1, source = v2D1 + v2D2 )  !! <--- (alt)

Result (gfortran-15.1 installed via Homebrew on Mac M1):

v2D1       :    0.00000000       1.00000000    
v2D2       :    3.00000000       4.00000000    
v2D3       :    3.00000000       4.00000000    
v2D1 + v2D2:    3.00000000       5.00000000    
v1         :    3.00000000       5.00000000 

BTW, if I compile the original codes with flang-21.1 (on the same Mac), it gives this error:

error: Semantic errors in test1.F90
././m_vect2d.mod:9:20: error: A PRIVATE procedure may not override
an accessible procedure
  procedure,private::add=>vect2dadd
                     ^^^
././m_vectclass.mod:5:54: Declaration of 'add'
  procedure(addinterface),pass(self),deferred,private::add
                                                       ^^^

If private is commented out, then I get this result (same as above):

v2D1       :  0. 1.
v2D2       :  3. 4.
v2D3       :  3. 4.
v2D1 + v2D2:  3. 5.
v1         :  3. 5.