Comments in ResearchGate about Fortran

Hi everyone,
I came across this conversation in ResearchGate about Fortran and c++.
https://www.researchgate.net/post/Is_FORTRAN_a_good_choice_for_scientific_computing
Please read the comments by [Simon Schröder]. He says that he heard from a person belonging to Fortran standardization committee that when we use ‘OO’ features of Fortran - it slows by 100 times !! Is it true ? Any comments…

3 Likes

I don’t know about OO, but ResearchGate is an important place where we should present the community’s work: Fortran-lang.org, stdlib, fpm, LFortran, etc. They can be tools for research.
When you are on ResearchGate, you can create a “Project” where you present the objectives and you can post news about the advancing of your project.
You can also add your FortranCon2020 presentations.

Are there any documents that consolidate what the current compilers definitely support in Modern Fortran ? Because there are so many “can do’s” in Modern Fortran… We should not leave it for people to figure them out after going through a long journey into their project or at least conceptualization of the project. It would be nice if they know what the compilers support before conceptualization of the project. Even for “OO” we have to have some benchmarks…

1 Like

I have not checked the conversation yet, but I would be surprised if it was that bad. I know that some uses of OO-features can cause a performance drop in combination with a particular compiler, but I never heard of such an impact. Performance measurements are always tricky and a bad/clumsy/careless implementation can cause bad results. For instance: iterating over an array is fast, but not OO, so an OO approach might use instead a linked-list and iterate over the elements of the list. That will definitely cause a slow-down.

Nevertheless, this is an intriguing point and I think we should have some benchmarks, like you suggest. They should be carefully constructed,

Hi Ashok,

I have seen such performance tests before in the course “Programming with Fortran” offered at the Leibniz-Rechenzentrum. Specifically you can check the sections “Further performance aspects and use of Parameterized derived types” in the course slides.

The problem typically boils down to the issue of using an array of structures (AoS) or a structure of arrays (SoA). Imagine you are doing a particle simulation. Each particle can be represented as an instance of the derived type:

type :: body
  character(len =4) :: units
  real :: mass
  real :: pos(3), vel(3)
end type body

In your main code you will then allocate an array of bodies:

type(body), allocatable :: traj(:)
allocate(traj(ntraj))

Alternatively, you can fold the array properties into the derived type:

type :: body_p( k, ntraj)
  integer, kind :: k = kind(1.0)
  integer, len ::   ntraj = 1
  character(len=4) :: units
  real(kind=k) :: mass(ntraj)
  real(kind=k) :: pos(ntraj,3),   vel (ntraj,3)
end type body_p

In the main code you would use this structure of arrays as follows:

type( body_p(ntraj=:) ), allocatable ::  dyn_traj
allocate(body_p(ntraj=20) :: dyn_trag)

What is not immediately obvious is the two objects differ in their memory alignment as illustrated by the image below:

Depending what you are doing with the particles, the size of your array, compiler, etc. - the contiguous memory layout in the SoA format can potentially lead to improved vectorization, resulting in faster executables.

In some cases the poor performance of derived types, could be simply due to an immature compiler implementation, which doesn’t manage to exploit all the vectorization opportunities.

7 Likes

Schröder later writes

One huge disadvantage of Fortran is that it does not have static type checking. This is really bad for subroutine calls. A while ago we turned on interface checking in Fortran. It took us quite long to get our software to compile again with interface checking turned on. Some subroutine calls were totally wrong (somehow the software still worked correctly). You should have as much static checking as possible. Fortran as a language does not have enough, in my opinion.

It certainly does from Fortran 90 on, if you put your procedures in MODULEs, as is widely advised. I don’t think the author is too familiar with modern Fortran.

There was a 2013 thread in the Intel Fortran forum about Speed loss using object oriented features finding that OO features could increase the execution time of a program by a factor of 1.6. That is a more plausible number than the 100 number from hearsay. The code there (removing a PAUSE statement) is

module module_A_class
    implicit none
    type A
    contains
        procedure , public :: timetest => timetest_A
    end type A        
    contains
    subroutine timetest_A (this,ap,bp,cp,dp)
        implicit none
        class (A) , intent (in) :: this
        double precision , intent (in) :: ap,bp,cp,dp
        double precision :: res
        res = ap * bp * cp * dp
    end subroutine 
end module module_A_class

module module_AA_class
    use module_A_class
    implicit none
    type, extends (A) :: AA
    contains
        procedure , public :: timetest => timetest_AA
    end type AA          
contains
    subroutine timetest_AA (this,ap,bp,cp,dp)
        implicit none
        class (AA) , intent (in) :: this
        double precision , intent (in) :: ap,bp,cp,dp
        double precision :: res
        res = ap * bp * cp * dp
    end subroutine 
end module module_AA_class

module module_B_class
    use module_AA_class
    implicit none
    type B
        class (AA) , pointer , public :: AA_type
    contains    
        procedure , public :: timetest => timetest_B
    end type B        
    contains
    subroutine timetest_B (this,ap,bp,cp,dp)
        implicit none
        class (B) , intent (in) :: this
        double precision , intent (in) :: ap,bp,cp,dp
        double precision :: res
        res = ap * bp * cp * dp
    end subroutine 
end module module_B_class    

program test
    use module_A_class
    use module_AA_class
    use module_B_class

    double precision :: t1 , t2
    type (A) :: class_A
    type (AA) :: class_AA
    type (B) :: class_B
    type (AA) , target :: class_AA_target
    
    class_B%AA_type => class_AA_target
    
    call cpu_time ( t1 )
    do i = 1, 100000000
        call timetest (5.0d0 , 5.0d0 , 5.0d0 , 5.0d0)
    end do
    call cpu_time ( t2 )
    write ( *, * ) 'Elapsed CPU time = ', t2 - t1 , 'In Main'

    
    call cpu_time ( t1 )
    do i = 1, 100000000
        call class_A%timetest (5.0d0 , 5.0d0 , 5.0d0 , 5.0d0)
    end do
    call cpu_time ( t2 )
    write ( *, * ) 'Elapsed CPU time = ', t2 - t1 , 'In A'

    
    call cpu_time ( t1 )
    do i = 1, 100000000
        call class_AA%timetest (5.0d0 , 5.0d0 , 5.0d0 , 5.0d0)
    end do
    call cpu_time ( t2 )
    write ( *, * ) 'Elapsed CPU time = ', t2 - t1 , 'In AA'
    
    call cpu_time ( t1 )
    do i = 1, 100000000
        call class_B%AA_type%timetest (5.0d0 , 5.0d0 , 5.0d0 , 5.0d0)
    end do
    call cpu_time ( t2 )
    write ( *, * ) 'Elapsed CPU time = ', t2 - t1 , 'In B%AA_type'
    
    
    call cpu_time ( t1 )
    do i = 1, 100000000
        call class_B%timetest (5.0d0 , 5.0d0 , 5.0d0 , 5.0d0)
    end do
    call cpu_time ( t2 )
    write ( *, * ) 'Elapsed CPU time = ', t2 - t1 , 'In B'
    
contains

    subroutine timetest (ap,bp,cp,dp)
        implicit none
        double precision , intent (in) :: ap,bp,cp,dp
        double precision :: res
        res = ap * bp * cp * dp
    end subroutine 
end program

Intel Fortran now says the execution time is zero for all versions, because the calculations can be skipped since the results are not used. Gfortran on Windows gives

 Elapsed CPU time =   0.25000000000000000      In Main
 Elapsed CPU time =   0.29687500000000000      In A
 Elapsed CPU time =   0.28125000000000000      In AA
 Elapsed CPU time =   0.35937500000000000      In B%AA_type
 Elapsed CPU time =   0.29687500000000000      In B

which is not a huge speed penalty. Even when I modify the program to use res

module module_A_class
    implicit none
    type A
    contains
        procedure , public :: timetest => timetest_A
    end type A        
    contains
    subroutine timetest_A (this,ap,bp,cp,dp,res)
        implicit none
        class (A) , intent (in) :: this
        double precision , intent (in) :: ap,bp,cp,dp
        double precision, intent(out) :: res
        res = ap * bp * cp * dp
    end subroutine 
end module module_A_class

module module_AA_class
    use module_A_class
    implicit none
    type, extends (A) :: AA
    contains
        procedure , public :: timetest => timetest_AA
    end type AA          
contains
    subroutine timetest_AA (this,ap,bp,cp,dp,res)
        implicit none
        class (AA) , intent (in) :: this
        double precision , intent (in) :: ap,bp,cp,dp
        double precision, intent(out) :: res
        res = ap * bp * cp * dp
    end subroutine 
end module module_AA_class

module module_B_class
    use module_AA_class
    implicit none
    type B
        class (AA) , pointer , public :: AA_type
    contains    
        procedure , public :: timetest => timetest_B
    end type B        
    contains
    subroutine timetest_B (this,ap,bp,cp,dp,res)
        implicit none
        class (B) , intent (in) :: this
        double precision , intent (in) :: ap,bp,cp,dp
        double precision, intent(out) :: res
        res = ap * bp * cp * dp
    end subroutine 
end module module_B_class    

program test
    use module_A_class
    use module_AA_class
    use module_B_class

    double precision :: t1 , t2, res
    type (A) :: class_A
    type (AA) :: class_AA
    type (B) :: class_B
    type (AA) , target :: class_AA_target
    
    class_B%AA_type => class_AA_target
    
    call cpu_time ( t1 )
    do i = 1, 100000000
        call timetest (5.0d0 , 5.0d0 , 5.0d0 , 5.0d0, res)
    end do
    if (res < 0) print*,"res=",res
    call cpu_time ( t2 )
    write ( *, * ) 'Elapsed CPU time = ', t2 - t1 , 'In Main'

    
    call cpu_time ( t1 )
    do i = 1, 100000000
        call class_A%timetest (5.0d0 , 5.0d0 , 5.0d0 , 5.0d0, res)
    end do
    if (res < 0) print*,"res=",res
    call cpu_time ( t2 )
    write ( *, * ) 'Elapsed CPU time = ', t2 - t1 , 'In A'

    
    call cpu_time ( t1 )
    do i = 1, 100000000
        call class_AA%timetest (5.0d0 , 5.0d0 , 5.0d0 , 5.0d0, res)
    end do
    if (res < 0) print*,"res=",res
    call cpu_time ( t2 )
    write ( *, * ) 'Elapsed CPU time = ', t2 - t1 , 'In AA'
    
    call cpu_time ( t1 )
    do i = 1, 100000000
        call class_B%AA_type%timetest (5.0d0 , 5.0d0 , 5.0d0 , 5.0d0, res)
    end do
    if (res < 0) print*,"res=",res
    call cpu_time ( t2 )
    write ( *, * ) 'Elapsed CPU time = ', t2 - t1 , 'In B%AA_type'
    
    
    call cpu_time ( t1 )
    do i = 1, 100000000
        call class_B%timetest (5.0d0 , 5.0d0 , 5.0d0 , 5.0d0, res)
    end do
    if (res < 0) print*,"res=",res
    call cpu_time ( t2 )
    write ( *, * ) 'Elapsed CPU time = ', t2 - t1 , 'In B'
    
contains

    subroutine timetest (ap,bp,cp,dp,res)
        implicit none
        double precision , intent (in) :: ap,bp,cp,dp
        double precision, intent(out) :: res
        res = ap * bp * cp * dp
    end subroutine 
end program

Intel Fortran detects that the calculations do not need to be looped over and returns zero execution times.

3 Likes

Building on my previous post, the LRZ course slides show the following performance comparison:

Indeed one can see a two- to four-fold difference (and not 100!) as a result of memory layout issues linked to usage of derived types.

As @Beliavsky notes, the author of the ResearchGate post made other erroneous comments. Without concrete examples, it is impossible to say if and what went wrong in his case, or if it was simply an exaggeration.

4 Likes

Here is a version of the program where Intel Fortran is actually forced to do some computation.

module module_A_class
    implicit none
    type A
    contains
        procedure , public :: timetest => timetest_A
    end type A        
    contains
    subroutine timetest_A (this,ap,bp,cp,dp,res)
        implicit none
        class (A) , intent (in) :: this
        double precision , intent (in) :: ap,bp,cp,dp
        double precision, intent(out) :: res
        res = ap * bp * cp * dp
    end subroutine 
end module module_A_class

module module_AA_class
    use module_A_class
    implicit none
    type, extends (A) :: AA
    contains
        procedure , public :: timetest => timetest_AA
    end type AA          
contains
    subroutine timetest_AA (this,ap,bp,cp,dp,res)
        implicit none
        class (AA) , intent (in) :: this
        double precision , intent (in) :: ap,bp,cp,dp
        double precision, intent(out) :: res
        res = ap * bp * cp * dp
    end subroutine 
end module module_AA_class

module module_B_class
    use module_AA_class
    implicit none
    type B
        class (AA) , pointer , public :: AA_type
    contains    
        procedure , public :: timetest => timetest_B
    end type B        
    contains
    subroutine timetest_B (this,ap,bp,cp,dp,res)
        implicit none
        class (B) , intent (in) :: this
        double precision , intent (in) :: ap,bp,cp,dp
        double precision, intent(out) :: res
        res = ap * bp * cp * dp
    end subroutine 
end module module_B_class    

program test
    use module_A_class
    use module_AA_class
    use module_B_class
    integer, parameter :: n = 10000000, niter = 10
    integer :: iter
    double precision :: t1 , t2, res, xx(n,4)
    type (A) :: class_A
    type (AA) :: class_AA
    type (B) :: class_B
    type (AA) , target :: class_AA_target
    call random_number(xx)
    class_B%AA_type => class_AA_target
    
    call cpu_time ( t1 )
    do iter=1,niter
       do i = 1, n
          call timetest (xx(i,1), xx(i,2) , xx(i,3) , xx(i,4), res)
       end do
    end do
    if (res < 0) print*,"res=",res
    call cpu_time ( t2 )
    write ( *, * ) 'Elapsed CPU time = ', t2 - t1 , 'In Main'

    call cpu_time ( t1 )
    do iter=1,niter
       do i = 1, n
          call class_A%timetest (xx(i,1), xx(i,2) , xx(i,3) , xx(i,4), res)
       end do
    end do
    if (res < 0) print*,"res=",res
    call cpu_time ( t2 )
    write ( *, * ) 'Elapsed CPU time = ', t2 - t1 , 'In A'

    
    call cpu_time ( t1 )
    do iter=1,niter
       do i = 1, n
          call class_AA%timetest (xx(i,1), xx(i,2) , xx(i,3) , xx(i,4), res)
       end do
    end do
    if (res < 0) print*,"res=",res
    call cpu_time ( t2 )
    write ( *, * ) 'Elapsed CPU time = ', t2 - t1 , 'In AA'
    
    call cpu_time ( t1 )
    do iter=1,niter
       do i = 1, n
          call class_B%AA_type%timetest (xx(i,1), xx(i,2) , xx(i,3) , xx(i,4), res)
       end do
    end do
    if (res < 0) print*,"res=",res
    call cpu_time ( t2 )
    write ( *, * ) 'Elapsed CPU time = ', t2 - t1 , 'In B%AA_type'
    
    
    call cpu_time ( t1 )
    do iter=1,niter
       do i = 1, n
          call class_B%timetest (xx(i,1), xx(i,2) , xx(i,3) , xx(i,4), res)
       end do
    end do
    if (res < 0) print*,"res=",res
    call cpu_time ( t2 )
    write ( *, * ) 'Elapsed CPU time = ', t2 - t1 , 'In B'
    
contains

    subroutine timetest (ap,bp,cp,dp,res)
        implicit none
        double precision , intent (in) :: ap,bp,cp,dp
        double precision, intent(out) :: res
        res = ap * bp * cp * dp
    end subroutine 
end program

Intel Fortran on Windows gives

 Elapsed CPU time =   0.125000000000000      In Main
 Elapsed CPU time =   0.140625000000000      In A
 Elapsed CPU time =   0.125000000000000      In AA
 Elapsed CPU time =   0.140625000000000      In B%AA_type
 Elapsed CPU time =   0.125000000000000      In B

and gfortran -O2 says

 Elapsed CPU time =   0.15625000000000000      In Main
 Elapsed CPU time =   0.15625000000000000      In A
 Elapsed CPU time =   0.15625000000000000      In AA
 Elapsed CPU time =   0.26562500000000000      In B%AA_type
 Elapsed CPU time =   0.17187500000000000      In B

So here there are 1.1 and 1.7 penalty factors for using CLASS with Intel and gfortran.

It is true that if you don’t use MODULEs or explicit INTERFACEs, a Fortran compiler may not detect mismatches when the procedure call and procedure are in separate source files. I wonder if compilers have options to catch this or if there are other tools do so.

Well, the Intel Fortran compiler may generate interface files for this purpose. Of course it does depend on the order in which the files are compiled.
It seems gfortran has -Wimplicit-interface and -Wimplicit-procedure to warn about a lack of an explicit interface definition. I have not tried these options yet.

Just tried these on an almost trivial little program:
program chkinterface
implicit none

    call print_something
end program chkinterface 
  • Both options warn about the routine not having an explicit interface, but with slightly different messages
  • The option -Wall does NOT warn about it. So you need to use either of the above

For the source files

function twice(x) result(y)
real, intent(in) :: x
real             :: y
y = 2*x
end function twice

program xtwice_dp
double precision :: x
x = 3.14d0
print*,twice(x)
end program xtwice_dp

ifort -nologo --warn:interface says

xtwice_dp.f90(4): error #6633: The type of the actual argument differs from the type of the dummy argument.   [X]
print*,twice(x)
-------------^
compilation aborted for xtwice_dp.f90 (code 1)

Putting the two source files in a zip file and submitting them to Lahey/Fujitsu Fortran 95 Source Check, you get

Compiling program unit twice at line 1:
Encountered 0 errors, 0 warnings, 0 informations in file TWICE_REAL.F90.
Compiling program unit xtwice_dp at line 1:
  2603-S: "XTWICE_DP.F90", line 4: Argument number '1' type of procedure 'twice' shall be the same between definition and reference. The previous appearance is in '"TWICE_REAL.F90", line 1'.
Encountered 1 error, 0 warnings, 0 informations in file XTWICE_DP.F90.
Compiling file TWICE_REAL.F90.
Compiling file XTWICE_DP.F90.

Since Intel Fortran is now free, as is Lahey Source Check, type checking even without MODULEs or explicit INTERFACEs is available to everyone, although one should still use MODULEs.

Your comment:
Since Intel Fortran is now free, as is Lahey Source Check, type checking even without MODULEs or explicit INTERFACEs is available to everyone, although one should still use MODULEs.

Yes, I agree: always use modules!

1 Like

I believe OO performance in any language is a function of two things. 1. How much dynamic polymorphism (ie late or run time binding you use and 2. how much operator overloading you do. I remember many years ago reading a discussion about why C++ was so slow. Most people thought it was overuse of dynamic polymorphism. However, when people dug deeper into what was really happening it was all the allocation and deallocation of temporary arrays etc that came with operator overloading (and the compilers not optimizing those away) that was a major cause of the slowdown. The introduction of static polymorphism via templates (particulary expression templates) is what allowed C++ to catch up with Fortran performance wise for many (now most) tasks. I also remember when F90 compilers appeared there was a lot of discussion and a few papers about the performance differences between using arrays and derived types (AKA structures). As others have stated those were usually a function of the compilers and/or the options used to compile the code. Most of the newer C++ based CFD and Finite Element codes I’ve seen make heavy use of templates (static polymorphism) for a lot of the basic data structures etc but write the rest of the code in Ctran (C written like Fortran).

Another issue is naive implementations of encapsulation where everything is private and you use getter and setter routines to access the data. Many times the getters and setters would be making copys of arrays and passing those back and forth instead of just passing a pointer. I recommend people read Stefano Toninel’s PhD dissertation (https://sourceryinstitute.github.io/MORFEUS-Source/media/toninel_phd.pdf) where he discusses the performance differences in the context of developing an OO based computational mechanics solver. He also makes a good case for how using an Object based as opposed to a pure Object oriented approach can give you most of the benefits of pure OO without many of the performance losses that are inherit in OO programming. Note also, The Sourcery Institute folks have updated and extended Toninel’s original NEMO code and now release it as the open-source MORFEUS code (see https://sourceryinstitute.github.io/MORFEUS-Source).

3 Likes

Btw, this graph agrees with my experience about Intel Fortran being able to deliver significantly better performance over GFortran for smaller vector sizes (over 4000 MFlops vs under 3000 MFlops).

@Ashok,

Too bad that thread is several years old and the chances of any meaningful response to a query with that person (Simon Schröder) for any details are remote.

Regardless, I presume you follow the scientific method wherever your interests lie including with engineering and computations. That is, to try to reproduce results, to verify and to analyze observations and to discuss them while remaining open-minded about possibilities and opportunities. I feel the same applies when it comes to any compute-performance aspects of programs and their algorithms and coding approaches.

You will note both C++ and Fortran are multiparadigm languages. Moreover you will know the development of a computational solution and/or framework for anything including in science and engineering increasingly involves multiparadigm architecture. And the object-oriented (OO) paradigm is a critical component of modern application development in any domain, be it scientific / technical computing or otherwise.

As you will know, one can read all kinds of stuff on the internet about OO, often on the extremes about its ills and also benefits. I personally think one has to be judicious about its design and use just as one has to be with any tool in engineering.

When adopted carefully with all the right tools toward calculations that have some heft to them and especially with actual applications (and not “toy” programs), the OO approach can give good results in Fortran, particularly with Intel Fortran - now available free - and which provides better optimization for Intel CPUs.

You may want to review this thread at the Intel Fortran forum: community.intel.com/t5/Intel-Fortran-Compiler/declared-type-and-the-issue-of-contiguous-memory/m-p/1135054#M135463. With the case presented by OP on that thread, the results with OO are essentially the same as the long-established procedural approach involving subroutines with array arguments: In this particular case, the performance aspect requires focus on SoA approach, as pointed out by @ivanpribec above with LRZ course slides. This comes about here with parametrized derived types (PDTs), a feature introduced way back in Fortran 2003 but for which Intel Fortran is among the few compilers providing reasonable support.

C:\temp>p.exe
Block 1: PDT
aux = -6.999837968078546E+016
Calc time = 0.499 seconds

Block 2: Derived Type
aux = -6.999837968078546E+016
Calc time = 0.852 seconds

Block 3: Arrays
aux = -6.999837968078546E+016
Calc time = 0.499 seconds

2 Likes

Thank you everyone… Now I have some solid ground to refute such allegations on Fortran…