Hi everyone,
I came across this conversation in ResearchGate about Fortran and c++.
https://www.researchgate.net/post/Is_FORTRAN_a_good_choice_for_scientific_computing
Please read the comments by [Simon Schröder]. He says that he heard from a person belonging to Fortran standardization committee that when we use ‘OO’ features of Fortran - it slows by 100 times !! Is it true ? Any comments…
I don’t know about OO, but ResearchGate is an important place where we should present the community’s work: Fortran-lang.org, stdlib, fpm, LFortran, etc. They can be tools for research.
When you are on ResearchGate, you can create a “Project” where you present the objectives and you can post news about the advancing of your project.
You can also add your FortranCon2020 presentations.
Are there any documents that consolidate what the current compilers definitely support in Modern Fortran ? Because there are so many “can do’s” in Modern Fortran… We should not leave it for people to figure them out after going through a long journey into their project or at least conceptualization of the project. It would be nice if they know what the compilers support before conceptualization of the project. Even for “OO” we have to have some benchmarks…
I have not checked the conversation yet, but I would be surprised if it was that bad. I know that some uses of OO-features can cause a performance drop in combination with a particular compiler, but I never heard of such an impact. Performance measurements are always tricky and a bad/clumsy/careless implementation can cause bad results. For instance: iterating over an array is fast, but not OO, so an OO approach might use instead a linked-list and iterate over the elements of the list. That will definitely cause a slow-down.
Nevertheless, this is an intriguing point and I think we should have some benchmarks, like you suggest. They should be carefully constructed,
Hi Ashok,
I have seen such performance tests before in the course “Programming with Fortran” offered at the Leibniz-Rechenzentrum. Specifically you can check the sections “Further performance aspects and use of Parameterized derived types” in the course slides.
The problem typically boils down to the issue of using an array of structures (AoS) or a structure of arrays (SoA). Imagine you are doing a particle simulation. Each particle can be represented as an instance of the derived type:
type :: body
character(len =4) :: units
real :: mass
real :: pos(3), vel(3)
end type body
In your main code you will then allocate an array of bodies:
type(body), allocatable :: traj(:)
allocate(traj(ntraj))
Alternatively, you can fold the array properties into the derived type:
type :: body_p( k, ntraj)
integer, kind :: k = kind(1.0)
integer, len :: ntraj = 1
character(len=4) :: units
real(kind=k) :: mass(ntraj)
real(kind=k) :: pos(ntraj,3), vel (ntraj,3)
end type body_p
In the main code you would use this structure of arrays as follows:
type( body_p(ntraj=:) ), allocatable :: dyn_traj
allocate(body_p(ntraj=20) :: dyn_trag)
What is not immediately obvious is the two objects differ in their memory alignment as illustrated by the image below:
Depending what you are doing with the particles, the size of your array, compiler, etc. - the contiguous memory layout in the SoA format can potentially lead to improved vectorization, resulting in faster executables.
In some cases the poor performance of derived types, could be simply due to an immature compiler implementation, which doesn’t manage to exploit all the vectorization opportunities.
Schröder later writes
One huge disadvantage of Fortran is that it does not have static type checking. This is really bad for subroutine calls. A while ago we turned on interface checking in Fortran. It took us quite long to get our software to compile again with interface checking turned on. Some subroutine calls were totally wrong (somehow the software still worked correctly). You should have as much static checking as possible. Fortran as a language does not have enough, in my opinion.
It certainly does from Fortran 90 on, if you put your procedures in MODULEs, as is widely advised. I don’t think the author is too familiar with modern Fortran.
There was a 2013 thread in the Intel Fortran forum about Speed loss using object oriented features finding that OO features could increase the execution time of a program by a factor of 1.6. That is a more plausible number than the 100 number from hearsay. The code there (removing a PAUSE statement) is
module module_A_class
implicit none
type A
contains
procedure , public :: timetest => timetest_A
end type A
contains
subroutine timetest_A (this,ap,bp,cp,dp)
implicit none
class (A) , intent (in) :: this
double precision , intent (in) :: ap,bp,cp,dp
double precision :: res
res = ap * bp * cp * dp
end subroutine
end module module_A_class
module module_AA_class
use module_A_class
implicit none
type, extends (A) :: AA
contains
procedure , public :: timetest => timetest_AA
end type AA
contains
subroutine timetest_AA (this,ap,bp,cp,dp)
implicit none
class (AA) , intent (in) :: this
double precision , intent (in) :: ap,bp,cp,dp
double precision :: res
res = ap * bp * cp * dp
end subroutine
end module module_AA_class
module module_B_class
use module_AA_class
implicit none
type B
class (AA) , pointer , public :: AA_type
contains
procedure , public :: timetest => timetest_B
end type B
contains
subroutine timetest_B (this,ap,bp,cp,dp)
implicit none
class (B) , intent (in) :: this
double precision , intent (in) :: ap,bp,cp,dp
double precision :: res
res = ap * bp * cp * dp
end subroutine
end module module_B_class
program test
use module_A_class
use module_AA_class
use module_B_class
double precision :: t1 , t2
type (A) :: class_A
type (AA) :: class_AA
type (B) :: class_B
type (AA) , target :: class_AA_target
class_B%AA_type => class_AA_target
call cpu_time ( t1 )
do i = 1, 100000000
call timetest (5.0d0 , 5.0d0 , 5.0d0 , 5.0d0)
end do
call cpu_time ( t2 )
write ( *, * ) 'Elapsed CPU time = ', t2 - t1 , 'In Main'
call cpu_time ( t1 )
do i = 1, 100000000
call class_A%timetest (5.0d0 , 5.0d0 , 5.0d0 , 5.0d0)
end do
call cpu_time ( t2 )
write ( *, * ) 'Elapsed CPU time = ', t2 - t1 , 'In A'
call cpu_time ( t1 )
do i = 1, 100000000
call class_AA%timetest (5.0d0 , 5.0d0 , 5.0d0 , 5.0d0)
end do
call cpu_time ( t2 )
write ( *, * ) 'Elapsed CPU time = ', t2 - t1 , 'In AA'
call cpu_time ( t1 )
do i = 1, 100000000
call class_B%AA_type%timetest (5.0d0 , 5.0d0 , 5.0d0 , 5.0d0)
end do
call cpu_time ( t2 )
write ( *, * ) 'Elapsed CPU time = ', t2 - t1 , 'In B%AA_type'
call cpu_time ( t1 )
do i = 1, 100000000
call class_B%timetest (5.0d0 , 5.0d0 , 5.0d0 , 5.0d0)
end do
call cpu_time ( t2 )
write ( *, * ) 'Elapsed CPU time = ', t2 - t1 , 'In B'
contains
subroutine timetest (ap,bp,cp,dp)
implicit none
double precision , intent (in) :: ap,bp,cp,dp
double precision :: res
res = ap * bp * cp * dp
end subroutine
end program
Intel Fortran now says the execution time is zero for all versions, because the calculations can be skipped since the results are not used. Gfortran on Windows gives
Elapsed CPU time = 0.25000000000000000 In Main
Elapsed CPU time = 0.29687500000000000 In A
Elapsed CPU time = 0.28125000000000000 In AA
Elapsed CPU time = 0.35937500000000000 In B%AA_type
Elapsed CPU time = 0.29687500000000000 In B
which is not a huge speed penalty. Even when I modify the program to use res
module module_A_class
implicit none
type A
contains
procedure , public :: timetest => timetest_A
end type A
contains
subroutine timetest_A (this,ap,bp,cp,dp,res)
implicit none
class (A) , intent (in) :: this
double precision , intent (in) :: ap,bp,cp,dp
double precision, intent(out) :: res
res = ap * bp * cp * dp
end subroutine
end module module_A_class
module module_AA_class
use module_A_class
implicit none
type, extends (A) :: AA
contains
procedure , public :: timetest => timetest_AA
end type AA
contains
subroutine timetest_AA (this,ap,bp,cp,dp,res)
implicit none
class (AA) , intent (in) :: this
double precision , intent (in) :: ap,bp,cp,dp
double precision, intent(out) :: res
res = ap * bp * cp * dp
end subroutine
end module module_AA_class
module module_B_class
use module_AA_class
implicit none
type B
class (AA) , pointer , public :: AA_type
contains
procedure , public :: timetest => timetest_B
end type B
contains
subroutine timetest_B (this,ap,bp,cp,dp,res)
implicit none
class (B) , intent (in) :: this
double precision , intent (in) :: ap,bp,cp,dp
double precision, intent(out) :: res
res = ap * bp * cp * dp
end subroutine
end module module_B_class
program test
use module_A_class
use module_AA_class
use module_B_class
double precision :: t1 , t2, res
type (A) :: class_A
type (AA) :: class_AA
type (B) :: class_B
type (AA) , target :: class_AA_target
class_B%AA_type => class_AA_target
call cpu_time ( t1 )
do i = 1, 100000000
call timetest (5.0d0 , 5.0d0 , 5.0d0 , 5.0d0, res)
end do
if (res < 0) print*,"res=",res
call cpu_time ( t2 )
write ( *, * ) 'Elapsed CPU time = ', t2 - t1 , 'In Main'
call cpu_time ( t1 )
do i = 1, 100000000
call class_A%timetest (5.0d0 , 5.0d0 , 5.0d0 , 5.0d0, res)
end do
if (res < 0) print*,"res=",res
call cpu_time ( t2 )
write ( *, * ) 'Elapsed CPU time = ', t2 - t1 , 'In A'
call cpu_time ( t1 )
do i = 1, 100000000
call class_AA%timetest (5.0d0 , 5.0d0 , 5.0d0 , 5.0d0, res)
end do
if (res < 0) print*,"res=",res
call cpu_time ( t2 )
write ( *, * ) 'Elapsed CPU time = ', t2 - t1 , 'In AA'
call cpu_time ( t1 )
do i = 1, 100000000
call class_B%AA_type%timetest (5.0d0 , 5.0d0 , 5.0d0 , 5.0d0, res)
end do
if (res < 0) print*,"res=",res
call cpu_time ( t2 )
write ( *, * ) 'Elapsed CPU time = ', t2 - t1 , 'In B%AA_type'
call cpu_time ( t1 )
do i = 1, 100000000
call class_B%timetest (5.0d0 , 5.0d0 , 5.0d0 , 5.0d0, res)
end do
if (res < 0) print*,"res=",res
call cpu_time ( t2 )
write ( *, * ) 'Elapsed CPU time = ', t2 - t1 , 'In B'
contains
subroutine timetest (ap,bp,cp,dp,res)
implicit none
double precision , intent (in) :: ap,bp,cp,dp
double precision, intent(out) :: res
res = ap * bp * cp * dp
end subroutine
end program
Intel Fortran detects that the calculations do not need to be looped over and returns zero execution times.
Building on my previous post, the LRZ course slides show the following performance comparison:
Indeed one can see a two- to four-fold difference (and not 100!) as a result of memory layout issues linked to usage of derived types.
As @Beliavsky notes, the author of the ResearchGate post made other erroneous comments. Without concrete examples, it is impossible to say if and what went wrong in his case, or if it was simply an exaggeration.
Here is a version of the program where Intel Fortran is actually forced to do some computation.
module module_A_class
implicit none
type A
contains
procedure , public :: timetest => timetest_A
end type A
contains
subroutine timetest_A (this,ap,bp,cp,dp,res)
implicit none
class (A) , intent (in) :: this
double precision , intent (in) :: ap,bp,cp,dp
double precision, intent(out) :: res
res = ap * bp * cp * dp
end subroutine
end module module_A_class
module module_AA_class
use module_A_class
implicit none
type, extends (A) :: AA
contains
procedure , public :: timetest => timetest_AA
end type AA
contains
subroutine timetest_AA (this,ap,bp,cp,dp,res)
implicit none
class (AA) , intent (in) :: this
double precision , intent (in) :: ap,bp,cp,dp
double precision, intent(out) :: res
res = ap * bp * cp * dp
end subroutine
end module module_AA_class
module module_B_class
use module_AA_class
implicit none
type B
class (AA) , pointer , public :: AA_type
contains
procedure , public :: timetest => timetest_B
end type B
contains
subroutine timetest_B (this,ap,bp,cp,dp,res)
implicit none
class (B) , intent (in) :: this
double precision , intent (in) :: ap,bp,cp,dp
double precision, intent(out) :: res
res = ap * bp * cp * dp
end subroutine
end module module_B_class
program test
use module_A_class
use module_AA_class
use module_B_class
integer, parameter :: n = 10000000, niter = 10
integer :: iter
double precision :: t1 , t2, res, xx(n,4)
type (A) :: class_A
type (AA) :: class_AA
type (B) :: class_B
type (AA) , target :: class_AA_target
call random_number(xx)
class_B%AA_type => class_AA_target
call cpu_time ( t1 )
do iter=1,niter
do i = 1, n
call timetest (xx(i,1), xx(i,2) , xx(i,3) , xx(i,4), res)
end do
end do
if (res < 0) print*,"res=",res
call cpu_time ( t2 )
write ( *, * ) 'Elapsed CPU time = ', t2 - t1 , 'In Main'
call cpu_time ( t1 )
do iter=1,niter
do i = 1, n
call class_A%timetest (xx(i,1), xx(i,2) , xx(i,3) , xx(i,4), res)
end do
end do
if (res < 0) print*,"res=",res
call cpu_time ( t2 )
write ( *, * ) 'Elapsed CPU time = ', t2 - t1 , 'In A'
call cpu_time ( t1 )
do iter=1,niter
do i = 1, n
call class_AA%timetest (xx(i,1), xx(i,2) , xx(i,3) , xx(i,4), res)
end do
end do
if (res < 0) print*,"res=",res
call cpu_time ( t2 )
write ( *, * ) 'Elapsed CPU time = ', t2 - t1 , 'In AA'
call cpu_time ( t1 )
do iter=1,niter
do i = 1, n
call class_B%AA_type%timetest (xx(i,1), xx(i,2) , xx(i,3) , xx(i,4), res)
end do
end do
if (res < 0) print*,"res=",res
call cpu_time ( t2 )
write ( *, * ) 'Elapsed CPU time = ', t2 - t1 , 'In B%AA_type'
call cpu_time ( t1 )
do iter=1,niter
do i = 1, n
call class_B%timetest (xx(i,1), xx(i,2) , xx(i,3) , xx(i,4), res)
end do
end do
if (res < 0) print*,"res=",res
call cpu_time ( t2 )
write ( *, * ) 'Elapsed CPU time = ', t2 - t1 , 'In B'
contains
subroutine timetest (ap,bp,cp,dp,res)
implicit none
double precision , intent (in) :: ap,bp,cp,dp
double precision, intent(out) :: res
res = ap * bp * cp * dp
end subroutine
end program
Intel Fortran on Windows gives
Elapsed CPU time = 0.125000000000000 In Main
Elapsed CPU time = 0.140625000000000 In A
Elapsed CPU time = 0.125000000000000 In AA
Elapsed CPU time = 0.140625000000000 In B%AA_type
Elapsed CPU time = 0.125000000000000 In B
and gfortran -O2 says
Elapsed CPU time = 0.15625000000000000 In Main
Elapsed CPU time = 0.15625000000000000 In A
Elapsed CPU time = 0.15625000000000000 In AA
Elapsed CPU time = 0.26562500000000000 In B%AA_type
Elapsed CPU time = 0.17187500000000000 In B
So here there are 1.1 and 1.7 penalty factors for using CLASS with Intel and gfortran.
It is true that if you don’t use MODULEs or explicit INTERFACEs, a Fortran compiler may not detect mismatches when the procedure call and procedure are in separate source files. I wonder if compilers have options to catch this or if there are other tools do so.
Well, the Intel Fortran compiler may generate interface files for this purpose. Of course it does depend on the order in which the files are compiled.
It seems gfortran has -Wimplicit-interface and -Wimplicit-procedure to warn about a lack of an explicit interface definition. I have not tried these options yet.
Just tried these on an almost trivial little program:
program chkinterface
implicit none
call print_something
end program chkinterface
- Both options warn about the routine not having an explicit interface, but with slightly different messages
- The option -Wall does NOT warn about it. So you need to use either of the above
For the source files
function twice(x) result(y)
real, intent(in) :: x
real :: y
y = 2*x
end function twice
…
program xtwice_dp
double precision :: x
x = 3.14d0
print*,twice(x)
end program xtwice_dp
ifort -nologo --warn:interface
says
xtwice_dp.f90(4): error #6633: The type of the actual argument differs from the type of the dummy argument. [X]
print*,twice(x)
-------------^
compilation aborted for xtwice_dp.f90 (code 1)
Putting the two source files in a zip file and submitting them to Lahey/Fujitsu Fortran 95 Source Check, you get
Compiling program unit twice at line 1:
Encountered 0 errors, 0 warnings, 0 informations in file TWICE_REAL.F90.
Compiling program unit xtwice_dp at line 1:
2603-S: "XTWICE_DP.F90", line 4: Argument number '1' type of procedure 'twice' shall be the same between definition and reference. The previous appearance is in '"TWICE_REAL.F90", line 1'.
Encountered 1 error, 0 warnings, 0 informations in file XTWICE_DP.F90.
Compiling file TWICE_REAL.F90.
Compiling file XTWICE_DP.F90.
Since Intel Fortran is now free, as is Lahey Source Check, type checking even without MODULEs or explicit INTERFACEs is available to everyone, although one should still use MODULEs.
Your comment:
Since Intel Fortran is now free, as is Lahey Source Check, type checking even without MODULEs or explicit INTERFACEs is available to everyone, although one should still use MODULEs.
Yes, I agree: always use modules!
I believe OO performance in any language is a function of two things. 1. How much dynamic polymorphism (ie late or run time binding you use and 2. how much operator overloading you do. I remember many years ago reading a discussion about why C++ was so slow. Most people thought it was overuse of dynamic polymorphism. However, when people dug deeper into what was really happening it was all the allocation and deallocation of temporary arrays etc that came with operator overloading (and the compilers not optimizing those away) that was a major cause of the slowdown. The introduction of static polymorphism via templates (particulary expression templates) is what allowed C++ to catch up with Fortran performance wise for many (now most) tasks. I also remember when F90 compilers appeared there was a lot of discussion and a few papers about the performance differences between using arrays and derived types (AKA structures). As others have stated those were usually a function of the compilers and/or the options used to compile the code. Most of the newer C++ based CFD and Finite Element codes I’ve seen make heavy use of templates (static polymorphism) for a lot of the basic data structures etc but write the rest of the code in Ctran (C written like Fortran).
Another issue is naive implementations of encapsulation where everything is private and you use getter and setter routines to access the data. Many times the getters and setters would be making copys of arrays and passing those back and forth instead of just passing a pointer. I recommend people read Stefano Toninel’s PhD dissertation (https://sourceryinstitute.github.io/MORFEUS-Source/media/toninel_phd.pdf) where he discusses the performance differences in the context of developing an OO based computational mechanics solver. He also makes a good case for how using an Object based as opposed to a pure Object oriented approach can give you most of the benefits of pure OO without many of the performance losses that are inherit in OO programming. Note also, The Sourcery Institute folks have updated and extended Toninel’s original NEMO code and now release it as the open-source MORFEUS code (see https://sourceryinstitute.github.io/MORFEUS-Source).
Btw, this graph agrees with my experience about Intel Fortran being able to deliver significantly better performance over GFortran for smaller vector sizes (over 4000 MFlops vs under 3000 MFlops).
Too bad that thread is several years old and the chances of any meaningful response to a query with that person (Simon Schröder) for any details are remote.
Regardless, I presume you follow the scientific method
wherever your interests lie including with engineering and computations. That is, to try to reproduce results, to verify and to analyze observations and to discuss them while remaining open-minded about possibilities and opportunities. I feel the same applies when it comes to any compute-performance aspects of programs and their algorithms and coding approaches.
You will note both C++ and Fortran are multiparadigm languages. Moreover you will know the development of a computational solution and/or framework for anything including in science and engineering increasingly involves multiparadigm architecture. And the object-oriented (OO) paradigm is a critical component of modern application development in any domain, be it scientific / technical computing or otherwise.
As you will know, one can read all kinds of stuff on the internet about OO, often on the extremes about its ills and also benefits. I personally think one has to be judicious about its design and use just as one has to be with any tool in engineering.
When adopted carefully with all the right tools toward calculations that have some heft to them and especially with actual applications (and not “toy” programs), the OO approach can give good results in Fortran, particularly with Intel Fortran - now available free - and which provides better optimization for Intel CPUs.
You may want to review this thread at the Intel Fortran forum: community.intel.com/t5/Intel-Fortran-Compiler/declared-type-and-the-issue-of-contiguous-memory/m-p/1135054#M135463. With the case presented by OP on that thread, the results with OO are essentially the same as the long-established procedural approach involving subroutines with array arguments: In this particular case, the performance aspect requires focus on SoA approach, as pointed out by @ivanpribec above with LRZ course slides. This comes about here with parametrized derived types (PDTs), a feature introduced way back in Fortran 2003 but for which Intel Fortran is among the few compilers providing reasonable support.
C:\temp>p.exe
Block 1: PDT
aux = -6.999837968078546E+016
Calc time = 0.499 secondsBlock 2: Derived Type
aux = -6.999837968078546E+016
Calc time = 0.852 secondsBlock 3: Arrays
aux = -6.999837968078546E+016
Calc time = 0.499 seconds
Thank you everyone… Now I have some solid ground to refute such allegations on Fortran…