Ideas from High Performance Fortran

Beliavsky · February 2, 2024, 8:31pm

In response to my tweet about the Fortran: Array Language video, wws73 replied

F[ortran] 95 took some things from [High Performance Fortran], but IMHO not enough. They should’ve also incorporated the parallel prefix/suffix and sort/grade functions.

The article High Performance Fortran Programming - Detailed Notes (1996) by A C Marshall says this:

Prefix and Suffix Functions

Both classes of functions are related, Prefix functions go from left to right along an array. The difference between PRODUCT_PREFIX and PRODUCT is that the latter returns a single value whereas the former returns an array result. This array contains the value obtained by multiplying the current element by all the elements to the left of it. In the case of a 2D (or more) example, the array is traversed in array element order.

Suffix functions are exactly the same except that the array is traversed from right to left (or for multi-dimensional arrays in reverse array element order).

Prefix functions scan left to right along an array. Each element of the result depends upon the preceding elements (in array element order).

and gives examples. Does anyone remember why they were not incorporated into Fortran? I don’t have a strong opinion that they should be, but maybe stdlib could add such functions for all the basic types?

Regarding sort and grade, Beginner’s Guide to HPF – High Performance Fortran (1998) by Erin M. Miller says

HPF provides four functions for performing multidimensional array sorts which include GRADE_DOWN,
GRADE_UP, SORT_DOWN and SORT_UP.
GRADE_DOWN and GRADE_UP both return permutations of indices that arrange the array
elements in descending or ascending order, respectively. They do not return a sorted array. For
example, if A = [ 13 20 17 11 ] then GRADE UP(A) = [ 4 1 3 2 ].
The general form of these two functions is:
GRADE_UP(array[, dim]), and
GRADE_DOWN(array[, dim])
where array is an array of any type, and dim is an optional integer scalar indicating the dimension
along which the sorting is to be done.
The SORT_DOWN and SORT_UP functions, on the other hand, do return the actual sorted array.
Thus, for A above, the call SORT_UP(A) returns [ 11 13 17 20 ].
The general form of these two functions is:
SORT_UP(array[, dim]), and
SORT_DOWN(array[, dim])
where array is, again, an array of any type, and dim is an optional integer scalar indicating the
dimension along which the sorting is to be done.

These also seem like candidates for stdlib, if not the language proper.

everythingfunctional · February 5, 2024, 2:14am

I’ve got a proposal in to add the generic “scan” version of these.

https://j3-fortran.org/doc/year/23/23-235r2.txt

wspector · February 18, 2024, 5:04pm

LOL - I (aka wws73) just noticed this thread last night. It deserves more than I could fit in a tweet. I was exposed to APL in college, and my then-developing programming mind picked up on some of it’s unique array processing features. Though not in Iverson’s original book, both scan and grade operators were built into APL interpreters by the late 1960s.

The scan operations, especially prefix sums, are seen fairly often in certain types of codes. Loops with the form:

do, i=2, n
a(i) = a(i-1) + …
end do

On a computer with vector/SIMD parallelism, a compiler might not vectorize the loop because of the dependence between loop iterations. However there are well-known algorithms for computing these reductions in parallel. Cray used to have routines (e.g., FOLR and friends) in the scientific library to incorporate some variations of them.

Adding the segmentation argument is an interesting twist because it allows a problem which might traditionally be represented in a number of small arrays (and not necessarily all the same size) into one very large array - with parallel speedups to match. Blelloch et al used this technique a lot on the Connection Machines where one might have literally thousands of processors, with the data arrays divided amongst them, controlled in SIMD fashion.

For a couple places to start, check out:

Prefix sum - Wikipedia

and:

Segmented scan - Wikipedia

Sorting operations have been available in many programming languages, both by operator and by library calls, for decades. Strangely, Fortran is one of the few that does not. I’ve occasionally bugged a few people on the Fortran standards committee about this for years, but nothing seems to get done. The procedures that were placed in stdlib are a good start. But I really would like versions that are functions, rather than subroutines. Also to do the grade operation, the stdlib routine that returns a permutation array also sorts the input array - which means that if I were to write a functional wrapper around it, I’d need to copy the input array to a temporary.

Years ago when I was working on ESMF, I had the need for a sort procedure. There were several scattered around the code. I refactored them into a single routine called ESMF_UtilSort (see: 5 Infrastructure: Utilities). The guts of it are basically the SLASRT routine from LAPACK. There are also a few relevant procedures in the SLATEC library that could have been used.

Why didn’t the Fortran committee put the scan and sorting procedures into F95? My guess is that due to the way they are defined, a simple preprocessor approach for each type/kind/rank, and DIM argument variation could generate literally millions of variations. So some sort of template approach is needed at compile time to generate in-line code for the variant needed. It can be, and has been, solved. For example, see:

I wrote a 1D and selected 2D subset of the routines for personal use. I used fypp to generate the various routines. Even these numbered in the hundreds.

Not mentioned in my tweet are the HPF combining scatter functions. The scalar version of the loops these represent are essentially:

do, i=1, size (index)
a(index(i)) = a(index(i)) + …
end do

There are two scenarios. First, when the index array array elements are unique from one another, the loop is fairly easily vectorizable/parallizable. The idiom is often used in sparse matrix operations, where you are only working on a subset of the array. On the old Cray machines, a special compiler directive would tell the compiler it was safe to vectorize the loop. However if the programmer used the HPF procedure SUM_SCATTER, the intent is also clear.

The second scenario is where elements of the index array may, and very likely do, have replicated values. For example if one is building a histogram. I’m not sure the HPF definition of SUM_SCATTER guarantees consistent results in this case, but could be wrong on that point. Interestingly, the Cray compilers eventually figured a clever way to automatically vectorize the loop anyway, and this technique could be encapsulated in a SUM_SCATTER-like routine.

certik · February 19, 2024, 12:06am

I think the grade_up functions are what NumPy calls argsort, here is my Fortran implementation of argsort: fortran-utils/src/sorting.f90 at b43bd24cd421509a5bc6d3b9c3eeae8ce856ed88 · certik/fortran-utils · GitHub. My implementation is O(n^2), so it’s slow, obviously it should be sped up. But argsort is a handy function, if you need to sort both eigenvalues and eigenvectors for example. However, I think it’s cheap to also return the sorted array, so one might as well do that.

wspector · February 19, 2024, 2:15am

Thank you for mentioning the numpy argsort function. Yes, it looks like it would do the same job as GRADE_UP. I will take a look at it to see how it is implemented. It is interesting that they give the choice of three different algorithms, plus stability. HPF requires that GRADE_UP and GRADE_DOWN be stable.

egio · February 19, 2024, 12:34pm

I think that the stdlib sorting module provides stable sorting algorithms:

The sorting algorithm is similar to the Tim sort (used in Python) that is at worst O(n log(n)).

wspector · February 19, 2024, 4:23pm

When William Clodius wrote the stdlib sorting module, I offered a few suggestions. He accepted a couple of them, but not all. In particular I’d like to see function versions.

Consider the case where one wants to sort an array of some derived type, based on some key field. It would be natural to simply write:

type(mytype_t), allocatable :: mydata(:), myresult( : )
:
myresult = mydata(GRADE_UP (mydata%key))

However with the current stdlib sort_index routine, one would need to write:

type(mytype_t), allocatable :: mydata(:), myresult( : )
character(:), allocatable :: mykeys (: )
integer, allocatable :: indices(: )
:
mykeys = mydata%keys
call sort_index (mykeys, index=indices, reverse=.false.)
result = mydata(indices)

By comparison, the C qsort eliminates some of the above steps because the caller provides the sizeof(mytype_t) and an appropriate comparison function that knows which field to use as a key. A disadvantage of the C qsort is that since the comparison function is user-provided, it gets called on each and every comparison with the extra overhead of the call. The C++ STL sort also requires a user-provided comparison function, but is able to in-line the code. So it runs faster. But in neither case can they be used in a functional sense in array expressions - like a Fortran version could support.

Getting back to the HPF specs, my opinion is that the full HPF functionality with multi-dimensional array possibilities and the DIM argument is not needed. Simple support of 1D arrays would suffice.

wspector · March 1, 2024, 7:20pm

Update:

Looking through a couple of my old APL books, the classic Sandra Pakin APL\360 book (1972 edition) has the grade up/down operators, but not the scan operator.

Applied APL Programming by Wilbur LePage (1978) does have scan - with 1D, 2D, and 3D examples. The 2D and 3D examples show prefix summing equivalent to using DIM=1, DIM=2, and in the 3D case DIM=3. He devotes almost three pages to it. Of course grade up/down are also present.

As I mentioned a few posts ago, Iversons original book on APL doesn’t have either the grade or scan operators. He does devote quite a bit of prose to sorting though. And he uses the term “scan” when talking about search algorithms rather than prefix algorithms.

It is fun revisiting some of these old texts. (Yes, my wife would rather I throw them all away…)

But back to my original comment on twitter. I think that sort/grade functions would be used by lots of Fortran programmers in lots of applications. In my experience, there are sort routines (and not necessarily efficient ones) embedded in many codes. Having a way to do sorting available in standard form would allow refactoring out code in favor of using the intrinsic capability.

As far as scan operators go, the prefix/suffix functions would help complete the Data Parallel (Data parallelism - Wikipedia) programming model that Fortran 90 started, and Fortran-95 enhanced. Data Parallel programming can require a bit different mind-think than many are used to. So maybe these routines wouldn’t be that popular at first. But ya gotta start somewhere.

Given the implications in implementation, I am more ambivalent about supporting multi-dimensional arrays and the DIM argument. The combining scatter functions also very much fall in this camp. But these problems were solved 30 years ago, and could be solved again. Sucks that several of the major compiler vendors supported all of this back then, but don’t now. (E.g., PGI had pghpf, NAG had a -hpf option, IBM and DEC had compilers back in the day whose descendants still exist, etc.)

Topic		Replies	Views
Need a fast and simple sorting function/subroutine Help	5	1833	May 17, 2022
Reduction Functions Language enhancement	3	345	March 1, 2024
Array comprehensions in Fortran Help	23	2862	January 12, 2022
Modern Fortran sample code Help	19	3251	January 27, 2023
Improving Fortran standardization process (lessons from C++23 getting multidimensional arrays)	42	3998	September 15, 2022

Ideas from High Performance Fortran

Related topics