Please, No More Loops (Than Necessary): New Patterns in Fortran 2023

Here are the compiler switches.

gfortran -O2 -fopenmp ch3305.f90 -o ch3305_gfortran.exe

ifort /nologo /Qdiag-disable:10448 -O2 -heap-arrays -openmp ch3305.f90 -o ch3305_ifort.exe

ifx /nologo -O2 -heap-arrays -openmp ch3305.f90 -o ch3305_ifx.exe

nagfor -O4 -fopenmp ch3305.f90 -o ch3305_nag.exe

nvfortran -O2 -fopenmp ch3305.f90 -o ch3305_nvidia.out

I’ve left out the Cray ones.

Here is the source file

include ‘integer_kind_module.f90’
include ‘precision_module.f90’

include ‘timing_module.f90’

program ch3305

use timing_module
use precision_module
use omp_lib

use iso_fortran_env

implicit none

! Original/Book 10,000,000
!
! Latest version has a variable memory footprint
!

!
! This version checks the memory allocation
! for each loop
!

integer :: n
integer :: allocation_status

integer, parameter :: start_size = 12810241024
integer, parameter :: loop_count = 10
integer, parameter :: n_types = 4
integer :: i
integer :: j
integer :: k
integer :: l
integer :: nthreads

real (dp), allocatable, dimension (:slight_smile: :: x
real (dp), allocatable, dimension (:slight_smile: :: y
real (dp), allocatable, dimension (:slight_smile: :: z

real, dimension (n_types, loop_count) :: timing_details = 0.0
real, dimension (n_types) :: t_sum = 0.0
real, dimension (n_types) :: t_average = 0.0
real :: reset = 0.0

character (15), dimension (n_types) :: heading_1 = &
[ ’ Whole array ', &
’ Do loop ', &
’ Do concurrent ', &
’ openmp ’ ]

print * , ’ ’
print *,compiler_version()
print *,compiler_options()
print *, ’ ’

nthreads = omp_get_max_threads()
open (unit=20, file=‘ch3305.dat’)
print 100, nthreads
100 format (’ Nthreads = ', i3)

!
! Dynamic allocation
!
! The loop count l depends on the amount
! of memory the system has.
!
! I use native Windows and Linux installs
! and for these systems the memory is the
! actual physical memory.
!
! I also use
!
! Linux under wsl
! Linux under hyper-v
!
! Both of these have less than the physical memory available.
!
! Adjust l accordingly
!
! I use the following values with the systems I have
!
! 128 GB system - l=8
! 32 GB system - l=6
! 16 GB system - l=3

l=8

do k=1,l

print *,‘’
call start_timing()
print *,‘’

n = k * start_size 

print *,''
print *,' Problem size = ' , k*128 , ' * 1024 * 1024'
print *,''

allocate (x(n),stat=allocation_status)

if (allocation_status > 0) then
print *,’ Allocation error’
print *,’ Program terminates’
stop 10
end if

allocate (y(n),stat=allocation_status)

if (allocation_status > 0) then
print *,’ Allocation error’
print *,’ Program terminates’
stop 20
end if

allocate (z(n),stat=allocation_status)

if (allocation_status > 0) then
print *,’ Allocation error’
print *,’ Program terminates’
stop 30
end if

!
! Initialisation
!

call random_number(x)
call random_number(y)
z = 0.0_dp

print 110, time_difference()
110 format (’ Initialise time = ', f12.6)
write (20, 120) x(1), y(1), z(1)
120 format (3(2x,f6.3))
print *, ’ ’

do j = 1, loop_count

print 130, j
130 format (' Iteration = ', i3)

!
! Whole array syntax
!

z = x + y
timing_details(1, j) = time_difference()
write (20, 120) x(1), y(1), z(1)
z = 0.0_dp
reset = time_difference()

!
! Simple traditional do loop
!

do i = 1, n
  z(i) = x(i) + y(i)
end do
timing_details(2, j) = time_difference()
z = 0.0_dp
reset = time_difference()

!
! do concurrent loop
!

do concurrent (i=1:n)
  z(i) = x(i) + y(i)
end do
timing_details(3, j) = time_difference()
write (20, 120) x(1), y(1), z(1)
z = 0.0_dp
reset = time_difference()

!
! OpenMP parallel loop
!

!$omp parallel do
do i = 1, n
z(i) = x(i) + y(i)
end do
!$omp end parallel do

timing_details(4, j) = time_difference()
write (20, 120) x(1), y(1), z(1)
z = 0.0_dp
reset = time_difference()

end do

print 140
140 format (15x, ’ Sum Average’)

do i = 1, n_types
t_sum(i) = sum(timing_details(i,1:loop_count))
t_average(i) = t_sum(i)/loop_count
print 150, heading_1(i), t_sum(i), t_average(i)
150 format (a, 2(3x,f12.6))
end do

deallocate (x)
deallocate (y)
deallocate (z)

print *, ’ ’
call end_timing()
print *,’ ’

end do

close (20)

end program

1 Like

I will play a bit of devil’s advocate of the devil here. Let’s go. :slight_smile:

Pure is a true nightmare, in my opinion. Need one print for a quick debug? Be ready to go through layers of code to prune the “pure” virus and never put it back again. For some reason printing to the screen has been determined to be the root of all evil. Pure rules are too tight to be a tool for “humans”, but so loose that they do not help compilers much. I still wait for any case when adding “pure” to the function increased speed of the code.

Array operations are a bit of anti-pattern. Not only they are slower, you also risk stack overflows which you have to tightly guard with compiler options (try to pack a huge array in ifx without -heap-arrays enabled – guaranteed stack overflow). Where is Fortran memory’ safety now?

Elemental functions bring a promise of a faster execuction. But in practice, they are quite useless. You need to enable buggy link-time optimizations for them to introduce any benefits if they are stored in a separate module, where they should be. In fact, to make any use of them, you need to always do (pardon pseudocode)

interface f
module procedure f_rank0, f_rank1, f_rank2
end interface

elemental subroutine f_rank0(x)
...
end subroutine

subroutine f_rank1(x(:))
call f_rank0(x(:))
end

subroutine f_rank2(x(:,:))
call f_rank0(x(:,:))
end

...

…for your code to actually use vectorized instructions reliably outside of the module. Maybe if compilers automatically generated invocations for all possible ranks of inputs it would be more practical. But they don’t.

LTO? Never trust it. C++ does it much better, whatever you need inlined can be in the headers.

Tl;dr: do not trust array operations in your code, unless you really enjoy random segfaults. I really got good and CI/CD trying to make sure my code even runs on all available compilers. Unthinkable in other languages.

Braces himself for the backlash.

Take with a grain if salt, have a good day. :slight_smile:

Dominik

EDIT: I did not mean to depreciate work by Damian Rouson by this post. All his points are completely valid. I just indended to present opposite view of the same language feature. :slight_smile:

6 Likes

FWIW, if printing within a pure procedure is an issue, you can fool the compiler by declaring a pure interface that redirects to an impure subroutine. This has been used in Fortran-debug-utils for instance.

It’s a bit hacky and may be considered as bad practice so you may want to include such piece of code within preprocessor macro like #ifdef _DEBUG … #endif

2 Likes

I used to do

#ifdef NDEBUG
pure &
#endif
subroutine xxx

But that made indenting with findent very annoying and generally confused linters :slight_smile:

1 Like

You can always, you know, use a proper debugger instead of resorting to printing variables.

2 Likes

Printing variables is a completely valid debugging method. :slight_smile: All depends on the circumstance.

3 Likes

Not trying to pick up a fight or anything, but we already established that printing is not always a valid debugging method in Fortran. Arguing about Fortran not being more like C++ is not constructive either.

1 Like

@gronki I doubt you’ll get much backlash – certainly not from me because my primary motivations are mostly orthogonal to performance so each of your points is valid – they just come from a different motivation. FWIW, I’ve had a lot of projects centered around refactoring, modernizing, and/or parallellizing code that other people wrote. In that context, I’m confident that the features I’m discussing have reduced the time investment by an order of magnitude when I encounter these features or can introduce them. One way to think about it is that I look at this from a user perspective even more than a developer prospective – because I’m a user of most code that I write these days. From that perspective, think about how little trouble it is to use intrinsic functions and then recall that intrinsic functions are all pure. I suspect that life with impure intrinsic functions would be a lot more painful. Imagine if intrinsic functions were allowed to do I/O or image control or modify variables not in the argument list. My goal is for most of the procedures I write to feel to the user (and the compiler) as if they were intrinsic functions.

I wrote “mostly” above because there are ways in which pure very much aligns with performance. One is that it’s much easier to parallelize across calls to pure procedures, which is presumably why the standard stipulates that any procedure called inside do concurrent must be pure. In my experience with multithreading do concurrent on CPUs, we’ve gotten the same performance as with explicit OpenMP directives but with less code and potentially more portable code (no additional code for switching to a GPU or switching to a different parallel programming model). See our CADL paper.

Regarding printing for debugging purposes, I agree and think this should be addressed in a future standard. At a minimum, I think Fortran needs assertions and assertions should be allowed to print inside pure procedures when compiled in a debug mode. For the time being, we’ve developed two libraries that offer rich output in pure procedures at the cost of error termination, which is allowed in pure procedures. See Assert and Julienne and our USRSE paper.

7 Likes

Also, as mentioned, I know very little about the internals of compilers, but gfortran developers have told me the the compiler tries to detect whether a procedure could have been declared pure even if it wasn’t declared so. If so, then the procedure gets marked implicitly pure. I suspect that’s reasonably strong evidence that it’s useful to compilers. I think the primary reason they do this is optimization. From this perspective, I suspect a user gets most of the optimization benefit of pure by simply writing procedures that could be declared pure without necessarily adding the pure keyword.

I’ll make an analogy to the author Michael Pollan, who says, “Strive to be like the person who takes dietary supplements.” The point is that empirical evidence suggests that most dietary supplements don’t actually make us healthier and yet most people who take dietary supplements are healthier – because they tend to be the same people who adopt various other habits (diet, exercise, etc.) that make them healthy.

So I recommend,”Strive to write procedures that could be declared pure whether you use the keyword or not.” Then you can insert print statements to debug whenever you like, but life will be easier for both the compiler and the users (if the users know from documentation that most procedures aren’t doing anything that would prevent pure except maybe conditionally when compiled in a “debug” mode), where “user” might include your future self.

5 Likes

This assumes that a “proper debugger” actually exists. In my experience it doesn’t. Any debugger that defaults to printing output in hexadecimal (see gdb) instead of human readable numbers is not my definition of a “proper debugger”. The last debugger I used that was as useful to me as just putting in print statements was Intel’s native debugger they use to ship with ifort. Don’t know why they decided to replace it with their own version of gdb.

3 Likes

That was the best debugger I’ve used, as well. Nothing has matched it since.

2 Likes

I completely agree with gdb being awful, provided it doesn’t crash upon start up —although using it indirectly is somewhat “less horrible”.

Your mention of the Intel debugger reminded me that, back when I still used Windows (about 20 years ago), there was this fantastic thing called Intel Array Visualizer… that, for reasons unknown (to me at least) got discontinued.

1 Like

The most powerful debugger I’ve used is the Totalview debugger but thats targeted at large scale HPC applications using MPI and I think OpenMP (also supports GPUs). There is a free student license (must be full time student) good for 1 year. Also a developer edition but I can’t find what the license cost for that. I know that at one time the version installed on your typical Cray/HPE system was $$$$$.

1 Like

Exactly. However a compiler can be fixed to allow you to print temporarily, perhaps with a compiler option.

Yes, but as a user you want the compiler to check your function that you didn’t make a mistake. So the keyword is good, but compilers just need to allow you to temporarily break it (not just for printing, but to call any other function that is not pure, for debugging). Also: pure is not enough, since the function can read global variables, which is not good (the function is not deterministic, so hard to parallelize/reorder, etc.). Even better is simple, which disallows reading or writing into global variables.

I don’t think array operations need to be slower. They don’t seem to be in LFortran, and if they are, we’ll fix it.

Regarding safety, compilers should have a Debug mode where your code is 100% safe from memory issues. We are far from perfect in LFortran, but our checker already found bugs that other compilers didn’t, e.g.: fix: check before passing unallocated struct member in string concat by jinangshah21 · Pull Request #1229 · fortran-lang/fpm · GitHub. Maybe fpm was only compiled with GFortran, but still. Eventually I would like it to be perfect, impossible to segfault in Debug mode.

Not in LFortran: we store ALL module code in the .mod file and load it, so the compiler has access to the elemental function and with optimizations it eventually gets fully inlined. Other compilers could also implement this mode. (And yes, we support separate compilation too, but then it has the issues you mentioned.)


So yes, I agree with @gronki. But I think it’s all fixable and we are working hard to fix this.

12 Likes

This is fantastic @certik ! If only all compiler developers and vendors (and users!) took this viewpoint, which is obviously far from the reality according to the discussions here.

4 Likes

Hooray!!! Thank you, @certik! It sounds like LFortran is helping to break us out of the infinite loop in which compilers optimize what most programmers write so most programmers write what compilers optimize. At some point, I hope new programming patterns and features get enough attention that it becomes safe to use them without fear of hurting performance.

I recall attending a tutorial by someone from Intel around a decade ago in which the presenter recommended writing array statements. I think that was what emboldened me to do it more.

5 Likes

Totally agree @rouson ! I put this as “let the professionals do the professional thing”. In terms of optimizing array operations, the professionals should be the compiler developers, not the Fortran programmers.

Let the professionals do the professional thing. This is why we have so many subdomains in science/technology/engineering, and this is how scientists/technologists/engineers can focus on advancing their own subdomains while benefiting from the advances in other subdomains that they do not need/want to know anything about.

Excuse me for self-quoting again:

3 Likes

It looks like you both agree that printing is not always good for debugging :grinning_face:. Could you @gronki or @jwmwalrus please perhaps elaborate a little bit, like, in what scenarios, printing may not be a valid debugging method in Fortran? Thank you!
I actually always print the results when debugging the code :rofl:

PS.
You may reply in this thread How to fix "Heisenbug"? so we do not hijack this thread on this topic. Thanks!

I’ll frame it differently: printing should be the last resource debugging method in any programming language.

Example: the only situation in which printing is the only option is when you build your application in Debug mode (-O0) and everything works well, but then on release mode (-O2 or -O3) it crashes or segfaults. It’s tedious, but it happens.

A proper development stack should enable you to press F5 on your keyboard and it automatically attaches your process to the editor and you should be able to set break points and see all your variables. This is standard practice, on all languages. Or you launch your application and set a getchar somewhere, attach your process to the editor and then go on debugging.

2 Likes

I agree with no more unnecessary loops. As users, we do can implement this “no more unnecessary loops” idea. But at the end of day, it requires the compilers can actually optimize the “no more unnecessary loops” code.

I guess the question is, in terms of optimizing array operations (so that no more unnecessary loops), are the compiler developers really putting efforts on this (after all, someone need to do the actual job)? If not, people may just still use loops for speed, if loops in some cases are really fast.

I feel like someone need to actually gather the compiler developers together, and convince them to put efforts on this “no more unnecessary loops” optimization area :grinning_face: