The LANL report is making an impact, be it positive or negative.
Also,
The LANL report is making an impact, be it positive or negative.
Also,
We are prepared for this. Fortran is simple, even Coarray Fortran. We can achieve things with a simple syntax and only few lines of code. That should help to attract those who have never programmed before, it should help to attract future generations to Fortran for heterogeneous programming.
Code examples are from my Github page, of course there is still room for improvement:
code example 4-1: parallel loop to execute asynchronous coroutines:
parallel_loop: do ! parallel loop does execute on all coarray images simultaneously
! calling multiple asynchronous coroutines simultaneously herein
! timer for exit the loop:
call system_clock(count = i_time2); r_timeShift = real(i_time2-i_time1) / r_countRate
if (r_timeShift > r_timeLimitInSec) exit parallel_loop ! time limit exceeded
end do parallel_loop
code example 5-1: spatial kernel using a user-defined channel to send data to a different remote kernel on one or several remote coarray images:
kernel_1: block
integer :: i_value
real :: r_value
i_value = 22
r_value = 22.222
call channel % fill (i_val = i_value, r_val = r_value)
call channel % send (i_chstat = i_channelStatus)
! this image is ready to execute the next kernel:
i_channelStatus = enum_channelStatus % kernel_2
end block kernel_1
code example 6-1: implementing a non-blocking synchronization (send):
!**** atomic send for synchronization and for control the execution flow: ****
sync memory ! to complete the non-atomic send and to achieve segment ordering
each_image: do i_loopImages = 1, fo % m_i_numberOfRemoteImages
i_currentRemoteImageNumber = fo % m_ia1_remoteImageNumbers (i_loopImages)
if (fo % m_l_useRemoteImageNumbersAsArrayIndex) i_arrayIndex = i_currentRemoteImageNumber
! set an array element in remote PGAS memory atomically:
call atomic_define (fo % m_atomic_ia1_channelStatus_cc (i_arrayIndex) [i_currentRemoteImageNumber], i_chStat)
end do each_image
@Federchen , what is the URL for this code? I could not find it.
I guess I have a contrarian view of the lack of Fortran talent. To me this is not a technical or educational problem, its a management problem. If organizations (companies, Gov labs etc), required either prior knowledge of Fortran or state that willing to learn Fortran is a condition for employment for new hires and are willing to provide internal company training (or better yet pay old folks like me to teach their training classes) the problem will lessen. Same is true for current employees. Tell them they have to learn Fortran and make it worth their time in the way of a bonus for completing training. Relying on CS departments in major universities to teach Fortran is a waste of everyone’s time. Several years ago, one of my alma maters decided to stop requiring a programming class as part of the basic engineering curiccula. However, companies were telling potential hires they needed some programming experience. One of the engineering faculty members (an old friend who was in my group when we worked for the same aerospace company) decided to teach a programming class for her students (it included both python and Fortran). Of course when the CS department heard about this all hell broke loose. How dare engineers teach programming. I would be willing to bet a similar scenario occurs at other engineering schools both here in the States and abroad.
I think it’s a culture problem. The mainstream perception of Fortran must change so that the broad segment of programmers becomes willing to learn it, use it, and talk to their friends and colleagues about it.
I recently consulted for a large company (household name) in the energy sector that has a large legacy FORTRAN codebase used in production. They’ve been struggling finding candidates for as soon as they mention Fortran in the job ad the number of applicants plummets (even when knowledge of Fortran is not a requirement, i.e. you can learn it on the job as you go, any programming experience will do).
What I believe will fix the culture problem (not over a year, but over a decade) is to keep improving the tooling and the library ecosystem, consistent and high-quality marketing (blogs, YouTube, social media), and a welcoming community for new users to grow with.
Even the LANL report, for example. I can’t help but think, between the lines, that the technical arguments made in the report are merely a facade for the managers to make the argument convincing. Underneath, I believe the true argument is “We like C++, we don’t like Fortran”. And that’s the best possible argument one can make, one that is impossible to argue against.
French proverb:
“Qui veut noyer son chien l’accuse de la rage”
A literal translation in English would be: "“He who wants to drown his dog accuses him of rabies”.
I agree that there is a cultural component that needs to be addressed. I’m still amazed at the number of people (who you would think would know better) who when you say Fortran immediately think Fortran 77 or even earlier (Fortran IV/66). For some folks the last 30 plus years haven’t happened.
Regarding the LANL report:
There is some of that, yes. They do like C++ and they don’t like Fortran. That part is true. But that is not the whole story. You have to ask why they don’t like Fortran. They presented some arguments in the article, and it’s true that maybe some of the arguments are not well made. But the underlying reason why they don’t like Fortran is not just perception and not just irrational dislike. I think the main reason is the much less robust toolchain and much less support for GPU. We have to fix that.
I must warn against dismissing these reports as “they just don’t like Fortran and there is nothing we can do”. I know you are not doing it, but I see an inclination to do this on this Discourse forum.
Instead, what I strongly recommend doing is to just ask people like that why they don’t like Fortran, and dig deep and truly listen. I have done this many many times. What I have found is that in every single case there are true real reasons underneath and when they see that you understand the reasons and see your plan to fix them, they do change their mind on this.
Yes, I think that’s all true, and most technical arguments there are probably valid (I don’t have sufficient experience to claim either way).
A counter-argument is: Pick a team that likes Fortran over C++ despite the warts (we know many people who love Fortran for Fortran itself and prefer it for their work, technical pros and cons aside). Such a team would overcome most if not all challenges presented in the report.
Yes, the reported technical challenges are real and must be addressed so that the likeability and usability of Fortran keep increasing in the near future.
I think most of the technical challenges can be overcome except one: GPU support. Here is an example where I asked why they migrated from Fortran to C++:
convert CTU hydro subroutines to C++ · Issue #525 · AMReX-Astro/Castro · GitHub
the issue was “Initially we built our own method to offload Fortran routines to GPUs, but it is too much to maintain, so we chose to move to C++ so we can take advantage of the features of AMReX and lessen our own development burden.”
Can it be overcome? Yes, but at some point the development and maintenance burden just becomes too high. Currently using Fortran for GPU in practice often creates more pain than using C++. Our job is to turn it around: using Fortran should be less pain than C++ for GPUs.
Reading a bit fast through the thread, it seems they raise the issue of function inlining for obtaining maximum GPU performance. This is indeed a valid point since it is not evident at first. If one has the module functions and the caller implementation in the same file, ifort and gfortran manage to inline. Yet if they reside in different files the story is a bit more complicated.
Nonetheless, at least with ifort, they do provide some flags and directives to help the compiler achieve inlining INLINE, FORCEINLINE, and NOINLINE
It would be interesting to come around a minimal benchmark showing for iso-task and close performance, how many code lines (tweaks to standard features) are needed with C++ and with Fortran
Wondering how well does nvfortran manges function inlining? Given it is the most used compiler for offloading on GPUs at the moment I guess it should be a good benchmark
@Federchen , what is the URL for this code? I could not find it.
Still experimental, no complete code example yet and, of course, works only on CPU yet.
It will take some time, and some bug fix with the Intel compilers.
Besides, I am also trying out something similar with collective subroutines instead of coarrays newly. This could become another branch for targeting GPUs (hopefully). Also very experimental and only on CPU yet.
Aha, thank you! I will try and have a look within the coming few days.
Can you inline a Fortran function with C++ functions in the same GPU kernel?
I don’t know if that is what they needed, we should ask them. I don’t have time to pursue this right now, but once we get to implement GPU backends in LFortran, then we’ll definitely have to pursue it.
Looking at Section 4 of HPC Compilers User's Guide Version 23.11 for ARM, OpenPower, x86 seems like the team from nvidia put a big effort in that direction.
The NVIDIA HPC compilers provide two categories of inlining:
- Automatic function inlining – In C++ and C, you can inline static functions with the inline keyword by using the -Mautoinline option, which is included with -fast.
- Function inlining – You can inline functions which were extracted to the inline libraries in Fortran, C++ and C. There are two ways of enabling function inlining: with and without the lib suboption. For the latter, you create inline libraries, for example using the nvfortran compiler driver and the -o and -Mextract options.
Out of curiosity I tried playing with their saxpy example that compares do concurrent and added a call to a cpp implementation:
extern"C" {
void saxpy_cpp(float x[], float y[], int & n, float & a)
{
for (int i = 0; i < n; ++i)
y[i] = a*x[i] + y[i];
}
}
module m_saxpy
use iso_c_binding
interface
subroutine saxpy_cpp(x,y,n,a) bind(c)
import c_int, c_float
real(kind=c_float) :: x(:), y(:)
real(kind=c_float) :: a
integer(kind=c_int) :: n
end subroutine
end interface
contains
subroutine saxpy_concurrent(x,y,n,a)
real,dimension(:) :: x, y
real :: a
integer :: n, i
do concurrent (i = 1: n)
y(i) = a*x(i)+y(i)
enddo
end subroutine
subroutine saxpy_do(x,y,n,a)
real,dimension(:) :: x, y
real :: a
integer :: n, i
do i = 1, n
y(i) = a*x(i)+y(i)
enddo
end subroutine
end module
program main
use m_saxpy
real,allocatable :: x(:), x2(:), x3(:), y(:)
real :: a = 2.0
integer :: n, err
integer :: c0, c1, c2, c3, c4, c5, cpar, cseq, ccpp
n = 5e7
allocate(x(n) , source=1.0)
allocate(x2(n), source=1.0)
allocate(x3(n), source=1.0)
allocate(y(n) , source=[(real(i),i=1,n)])
!$acc enter data copyin(x,x2,x3, y, n, a)
call system_clock( count=c0 )
call saxpy_do(x, y, n, a)
call system_clock( count=c1 )
call system_clock( count=c2 )
call saxpy_cpp(x2, y, n, a)
call system_clock( count=c3 )
call system_clock( count=c4 )
call saxpy_concurrent(x3, y, n, a)
call system_clock( count=c5 )
!$acc exit data delete(x,x2,x3, y, n, a)
cseq = c1 - c0
ccpp = c3 - c2
cpar = c5 - c4
err = 0
if( any(x.ne.x2) .or. any(x.ne.x3) ) err = 1
print *, cseq, ' microseconds do'
print *, ccpp, ' microseconds do cpp'
print *, cpar, ' microseconds do concurrent'
if(err .eq. 0) then
print *, "SAXPY: Test PASSED"
else
print *, "SAXPY: Test FAILED"
endif
end program
with that in a fpm project, used the following flags:
export LD_LIBRARY_PATH=/usr/lib/wsl/lib:$LD_LIBRARY_PATH
export FPM_CXX=/opt/nvidia/hpc_sdk/Linux_x86_64/2023/compilers/bin/nvcc
export FPM_FC=/opt/nvidia/hpc_sdk/Linux_x86_64/2023/compilers/bin/nvfortran
fpm run --flag "-fast -stdpar=gpu -acc=gpu -gpu=cc80,cuda12.0,nomanaged -Minline -Minfo=accel"
And got the following results on a RTX A6000:
29633 microseconds do
122100 microseconds do cpp
1745 microseconds do concurrent
SAXPY: Test PASSED
Not sure if this lack of performance of the cpp implementation is due to lack of inlinement or something else …
One can read this in their manual
4.2. Invoking Procedure Inlining
To invoke the procedure inliner, use the -Minline option. If you do not specify an inline library, the compiler performs a special prepass on all source files named on the compiler command line before it compiles any of them. This pass extracts procedures that meet the requirements for inlining and puts them in a temporary inline library for use by the compilation pass.
And regarding the restrictions for inlinement:
The following types of C and C++ functions cannot be inlined:
- Functions which accept a variable number of arguments
Certain C/C++ functions can only be inlined into the file that contains their definition:
- Static functions
- Functions which call a static function
- Functions which reference a static variable
Reading through it I don’t see any explicit limitation, but I might be missing something… would have to look at the ASM code to have a better idea
@certik FYI, I took the question to the nvidia forum [Fortran][C++] Question on cross-language function inlinement - #2 by MatColgrove - nvc, nvc++ and nvfortran - NVIDIA Developer Forums
@hkvzjal it looks like cross language inlining is not supported. I don’t know if that is what they needed. It goes back to my comment here: An evaluation of risks associated with relying on Fortran for mission critical codes for the next 15 years - #4 by certik, where I suggest that mixing of languages generally makes everything worse, you have all these interface problems. It’s better to choose just one language and do things in it. This is a tough problem, since they are using Amrex, which is in C++, and it might be hard to create performing Fortran wrappers so that one can use it with Fortran projects and run well on GPUs. I think long term there should be a way to do that, but I can easily see how this becomes a maintenance burden, and just doing it in C++ is a lot simpler.