Ideas from High Performance Fortran

LOL - I (aka wws73) just noticed this thread last night. It deserves more than I could fit in a tweet. I was exposed to APL in college, and my then-developing programming mind picked up on some of it’s unique array processing features. Though not in Iverson’s original book, both scan and grade operators were built into APL interpreters by the late 1960s.

The scan operations, especially prefix sums, are seen fairly often in certain types of codes. Loops with the form:

do, i=2, n
a(i) = a(i-1) + …
end do

On a computer with vector/SIMD parallelism, a compiler might not vectorize the loop because of the dependence between loop iterations. However there are well-known algorithms for computing these reductions in parallel. Cray used to have routines (e.g., FOLR and friends) in the scientific library to incorporate some variations of them.

Adding the segmentation argument is an interesting twist because it allows a problem which might traditionally be represented in a number of small arrays (and not necessarily all the same size) into one very large array - with parallel speedups to match. Blelloch et al used this technique a lot on the Connection Machines where one might have literally thousands of processors, with the data arrays divided amongst them, controlled in SIMD fashion.

For a couple places to start, check out:

Prefix sum - Wikipedia

and:

Segmented scan - Wikipedia

Sorting operations have been available in many programming languages, both by operator and by library calls, for decades. Strangely, Fortran is one of the few that does not. I’ve occasionally bugged a few people on the Fortran standards committee about this for years, but nothing seems to get done. The procedures that were placed in stdlib are a good start. But I really would like versions that are functions, rather than subroutines. Also to do the grade operation, the stdlib routine that returns a permutation array also sorts the input array - which means that if I were to write a functional wrapper around it, I’d need to copy the input array to a temporary.

Years ago when I was working on ESMF, I had the need for a sort procedure. There were several scattered around the code. I refactored them into a single routine called ESMF_UtilSort (see: 5 Infrastructure: Utilities). The guts of it are basically the SLASRT routine from LAPACK. There are also a few relevant procedures in the SLATEC library that could have been used.

Why didn’t the Fortran committee put the scan and sorting procedures into F95? My guess is that due to the way they are defined, a simple preprocessor approach for each type/kind/rank, and DIM argument variation could generate literally millions of variations. So some sort of template approach is needed at compile time to generate in-line code for the variant needed. It can be, and has been, solved. For example, see:

I wrote a 1D and selected 2D subset of the routines for personal use. I used fypp to generate the various routines. Even these numbered in the hundreds.

Not mentioned in my tweet are the HPF combining scatter functions. The scalar version of the loops these represent are essentially:

do, i=1, size (index)
a(index(i)) = a(index(i)) + …
end do

There are two scenarios. First, when the index array array elements are unique from one another, the loop is fairly easily vectorizable/parallizable. The idiom is often used in sparse matrix operations, where you are only working on a subset of the array. On the old Cray machines, a special compiler directive would tell the compiler it was safe to vectorize the loop. However if the programmer used the HPF procedure SUM_SCATTER, the intent is also clear.

The second scenario is where elements of the index array may, and very likely do, have replicated values. For example if one is building a histogram. I’m not sure the HPF definition of SUM_SCATTER guarantees consistent results in this case, but could be wrong on that point. Interestingly, the Cray compilers eventually figured a clever way to automatically vectorize the loop anyway, and this technique could be encapsulated in a SUM_SCATTER-like routine.

2 Likes