Fortran is dead – Long live Fortran!

Torsten Hoefler’s random access spontaneous talk given at the 42nd anniversary Salishan Conference on High-Speed Computing in 2023. Discusses how to lift Fortran code to a data-centric representation to optimize it for accelerator devices. Work led by Alexandru Calotoiu in SPCL.

7 Likes

For the loop nest at 8:05 Torsten says that not a single compiler can parallelize that loop automatically, but I remain confused if he’s talking about vectorization of the inner loop, or doing the reduction on the outer loop in parallel

The loop appears to be taken out of the dwarf-p-cloud miniapp, specifically from lines 2655 - 2658 of the file cloudsc.F90 (which contains a single ~ 2800 line long subroutine :eyes: ):

  ! Backsubstitution 
  !  step 1 
  DO JN=2,NCLV
    DO JM = 1,JN-1
      ZQXN(KIDIA:KFDIA,JN)=ZQXN(KIDIA:KFDIA,JN)-ZQLHS(KIDIA:KFDIA,JN,JM) &
       &  *ZQXN(KIDIA:KFDIA,JM)
    ENDDO
  ENDDO

Rewriting the array expression as a loop,

    DO JM = 1,JN-1
       DO JL = KIDIA, KFDIA
           ZQXN(JL,JN) = ZQXN(JL,JN) - factor*ZQXN(JL,JM)
       ENDDO
    ENDDO

gives a loop nest resembling what is shown in the slide.

1 Like

I just saw this talk today and I wondered the same thing…

2 Likes

I think the problem for auto-vectorization is that the compiler might not be able to deduce there is no aliasing, because the JM terminates at JN - 1, and the results are accumulated into column JN. One way to fix this would be using an IVDEP directive, or using an !$omp simd directive. Edit: do concurrent (DC) would also work (assuming the Fortran compiler sees DC as an IVDEP annotation and not some other parallelization mode).

1 Like

A very similar problem occurs during the trailing panel update in LU factorization. Levesque & Williamson write about this in “A Guidebook to Fortran on Supercomputers”:

In the next few examples we will examine cases where the compiler may not know if a loop is recursive. Loop nest 42030 is a traditionally coded Gaussian elimination scheme. Most compilers will try to vectorize the inner loop where row subtraction is taking place. Note that A(J,K)is a function of A(J,I). Although both I and J are fixed values within that inner loop a compiler must attempt to determine whether K ever assumes the value of I (on other than the last iteration), since this would be recursive and must not be vectorized.

There are two approaches for the compiler in optimizing this case. The first approach is to be smart enough to recognize that there is no possibility for recursion, since the initial (and lowest) value of K is I+1. The other approach is to generate one code sequence for executing the loop in scalar mode and another sequence to execute it in vector mode, then test at runtime whether K will assume the value of I in the inner loop. This second approach is called conditional compilation and is used quite extensively by CFT 1.15 [Cray Fortran].

To assure that the compiler does fully vectorizes the restructured loop 42031 we have inserted directives (for Alliant, Cray, and NEC in this case) that inform the compiler to ignore potential recursion within the loop. Figure 4.10 illustrates that the performance improves by 50-100% for longer vector lengths. The performance gain results from the elimination of the execution time test by the compiler.

C     THE ORIGINAL
C   GAUSS ELIMINATION
      DO 43020 I = 1, MATDIM
         A(I,I) = 1. / A(I,I)
         DO 43020 J = I+1, MATDIM
            A(J,I) = A(J,I) * A(I,I)
            DO 43020 K = I+1, MATDIM
               A(J,K) = A(J,K) - A(J,I) * A(I,K)
43020 CONTINUE

C     THE RESTRUCTURED
C   GAUSS ELIMINATION
      DO 43021 I = 1, MATDIM
         A(I,I) = 1. / A(I,I)
         DO 43021 J = I+1, MATDIM
            A(J,I) = A(J,I) * A(I,I)
CVD$ NODEPCHK
CDIR$ IVDEP
*VDIR NODEP
            DO 43021 K = I+1, MATDIM
               A(J,K) = A(J,K) - A(J,I) * A(I,K)
43021 CONTINUE

This corresponds to the BLAS routine DGER used by DGETF2.

1 Like

I wrote fortran code for all of those machines in the 1980s and early 1990s, and I used those same compiler directives many times.

I think is this case, a slight rewrite of the code allows the compiler to vectorize the innermost loop.

      DO 43020 I = 1, MATDIM
         A(I,I) = 1. / A(I,I)
         DO 43020 J = I+1, MATDIM
            AJI = A(J,I) * A(I,I)
            A(J,I) = AJI
            DO 43020 K = I+1, MATDIM
               A(J,K) = A(J,K) - AJI * A(I,K)
43020 CONTINUE

The compiler will likely not even store the AJI variable in memory, it will just live in a register for its lifetime. Of course, one would like for a fortran compiler to recognize all cases of vectorization, or the lack of recursion in this case, but especially in the 1980s that did not always occur. However, if this is still the situation in the language almost 40 years later, perhaps there should be some standard way to tell the compiler what to do.

do concurrent? Array expressions?

!$omp simd also belongs to a standard. If I remember correctly, the Intel Fortran compiler has -qopenmp-simd turned ON by default.

The compiler would need to prove that I+1 .. MATDIM and I are disjoint sets, hence the loop can be safely vectorized. I guess there is a name for this type of reasoning (loop aliasing analysis?). Maybe @certik knows?


Here is a slight variation of Levesque’s loop and the DGER routine:

subroutine trailing_update(m,n,A,lda,k)
    integer, parameter :: dp = kind(1.0d0)
    integer, intent(in) :: m, n, lda, k
    real(dp), intent(inout) :: A(lda,n)
    integer :: i, j

    do j = k+1,n
        do i = k+1,m
            A(i,j) = A(i,j) - A(i,k)*A(k,j)
        end do
    end do

end subroutine

When compiling with gfortran -O2 -ftree-vectorize -march=skylake, it creates both a scalar and a vector path (just like Cray CFT did):

If you add !GCC$ IVDEP or !$omp simd the scalar path dissappears:

Even the Intel compiler (ifx -O3 -xSKYLAKE) struggles to create a vector version. Instead it unrolls 4 iterations and uses scalar instructions:

.LBB0_5:
        vmovsd  xmm0, qword ptr [r13 + 8*rsi - 8]
        vmovsd  xmm1, qword ptr [r11 + rbp - 24]
        vfnmadd213sd    xmm1, xmm0, qword ptr [r14 + rbp - 24]
        vmovsd  qword ptr [r14 + rbp - 24], xmm1
        vmovsd  xmm0, qword ptr [r13 + 8*rsi - 8]
        vmovsd  xmm1, qword ptr [r11 + rbp - 16]
        vfnmadd213sd    xmm1, xmm0, qword ptr [r14 + rbp - 16]
        vmovsd  qword ptr [r14 + rbp - 16], xmm1
        vmovsd  xmm0, qword ptr [r13 + 8*rsi - 8]
        vmovsd  xmm1, qword ptr [r11 + rbp - 8]
        vfnmadd213sd    xmm1, xmm0, qword ptr [r14 + rbp - 8]
        vmovsd  qword ptr [r14 + rbp - 8], xmm1
        vmovsd  xmm0, qword ptr [r13 + 8*rsi - 8]
        vmovsd  xmm1, qword ptr [r11 + rbp]
        vfnmadd213sd    xmm1, xmm0, qword ptr [r14 + rbp]
        vmovsd  qword ptr [r14 + rbp], xmm1
        add     rbp, 32
        cmp     r15, rbp
        jne     .LBB0_5

When you add !dir$ ivdep or !$omp simd, then ifx also does the right thing:

        vbroadcastsd    ymm1, xmm0
        xor     ebx, ebx
.LBB0_7:
        vmovupd ymm2, ymmword ptr [r9 + 8*rbx]
        vfnmadd213pd    ymm2, ymm1, ymmword ptr [r10 + 8*rbx]
        vmovupd ymmword ptr [r10 + 8*rbx], ymm2
        add     rbx, 4
        cmp     rbx, r8
        jb      .LBB0_7

One more difference worthy of notice, :white_check_mark: the Intel compiler broadcasts the A(k,j) value before the hot loop . :cross_mark: gfortran does the scalar-to-vector broadcast on each iteration (despite the fact that rax is constant in that loop).

What is interesting is IFORT vectorizes this loop nest even without directives:

..B1.14:                        # Preds ..B1.12 ..B1.10
        vbroadcastsd ymm3, xmm1                                 #10.38
        vbroadcastsd ymm2, xmm0                                 #10.38
        movsxd    rcx, esi                                      #9.9
        mov       r8, QWORD PTR [-48+rsp]                       #9.9[spill]
..B1.15:                        # Preds ..B1.15 ..B1.14
        vmovupd   ymm4, YMMWORD PTR [r8+rdx*8]                  #10.31
        vfnmadd213pd ymm4, ymm3, YMMWORD PTR [rbx+rdx*8]        #10.13
        vmovupd   YMMWORD PTR [rbx+rdx*8], ymm4                 #10.13
        vmovupd   ymm5, YMMWORD PTR [r8+rdx*8]                  #10.31
        vfnmadd213pd ymm5, ymm2, YMMWORD PTR [rdi+rdx*8]        #10.13
        vmovupd   YMMWORD PTR [rdi+rdx*8], ymm5                 #10.13
        vmovupd   ymm6, YMMWORD PTR [32+r8+rdx*8]               #10.31
        vfnmadd213pd ymm6, ymm3, YMMWORD PTR [32+rbx+rdx*8]     #10.13
        vmovupd   YMMWORD PTR [32+rbx+rdx*8], ymm6              #10.13
        vmovupd   ymm7, YMMWORD PTR [32+r8+rdx*8]               #10.31
        vfnmadd213pd ymm7, ymm2, YMMWORD PTR [32+rdi+rdx*8]     #10.13
        vmovupd   YMMWORD PTR [32+rdi+rdx*8], ymm7              #10.13
        vmovupd   ymm8, YMMWORD PTR [64+r8+rdx*8]               #10.31
        vfnmadd213pd ymm8, ymm3, YMMWORD PTR [64+rbx+rdx*8]     #10.13
        vmovupd   YMMWORD PTR [64+rbx+rdx*8], ymm8              #10.13
        vmovupd   ymm9, YMMWORD PTR [64+r8+rdx*8]               #10.31
        vfnmadd213pd ymm9, ymm2, YMMWORD PTR [64+rdi+rdx*8]     #10.13
        vmovupd   YMMWORD PTR [64+rdi+rdx*8], ymm9              #10.13
        vmovupd   ymm10, YMMWORD PTR [96+r8+rdx*8]              #10.31
        vfnmadd213pd ymm10, ymm3, YMMWORD PTR [96+rbx+rdx*8]    #10.13
        vmovupd   YMMWORD PTR [96+rbx+rdx*8], ymm10             #10.13
        vmovupd   ymm11, YMMWORD PTR [96+r8+rdx*8]              #10.31
        vfnmadd213pd ymm11, ymm2, YMMWORD PTR [96+rdi+rdx*8]    #10.13
        vmovupd   YMMWORD PTR [96+rdi+rdx*8], ymm11             #10.13
        add       rdx, 16                                       #9.9
        cmp       rdx, rcx                                      #9.9
        jb        ..B1.15       # Prob 82%    

Roughly speaking it interleaves the updates for two iterations in the second dimension (notice the vbroadcastsd), and unrolling 16 iterations along the first array dimension. It also uses different registers, presumably to avoid (false) instruction scheduling dependencies. I wonder what heuristics it uses to determine these unrolling factors. Obviously register availability is one.

ifx does something similar as IFORT, when you use

    do concurrent (j = k+1:n, i=k+1:m)
        A(i,j) = A(i,j) - A(i,k)*A(k,j)
    end do

But instead of chunks 16 x 2, it iterates across chunks of size 4 x 8. Also note the repeates use of ymm0:

        vmovapd xmmword ptr [rsp + 48], xmm8
        vbroadcastsd    ymm8, xmm8
        vbroadcastsd    ymm9, xmm1
        vbroadcastsd    ymm10, xmm2
        vbroadcastsd    ymm11, xmm3
        vbroadcastsd    ymm12, xmm4
        vbroadcastsd    ymm13, xmm5
        vbroadcastsd    ymm14, xmm6
        vbroadcastsd    ymm15, xmm7
        xor     eax, eax
.LBB0_9:
        vmovupd ymm0, ymmword ptr [rbp + 8*rax]
        lea     rdx, [r11 + 8*rax]
        vfnmadd213pd    ymm0, ymm8, ymmword ptr [r11 + 8*rax]
        vmovupd ymmword ptr [r11 + 8*rax], ymm0
        vmovupd ymm0, ymmword ptr [rbp + 8*rax]
        vfnmadd213pd    ymm0, ymm9, ymmword ptr [rcx + rdx]
        vmovupd ymmword ptr [rcx + rdx], ymm0
        add     rdx, rcx
        vmovupd ymm0, ymmword ptr [rbp + 8*rax]
        vfnmadd213pd    ymm0, ymm10, ymmword ptr [rcx + rdx]
        vmovupd ymmword ptr [rcx + rdx], ymm0
        lea     rdx, [rdx + rcx]
        vmovupd ymm0, ymmword ptr [rbp + 8*rax]
        vfnmadd213pd    ymm0, ymm11, ymmword ptr [rcx + rdx]
        vmovupd ymmword ptr [rcx + rdx], ymm0
        lea     rdx, [rdx + rcx]
        vmovupd ymm0, ymmword ptr [rbp + 8*rax]
        vfnmadd213pd    ymm0, ymm12, ymmword ptr [rcx + rdx]
        vmovupd ymmword ptr [rcx + rdx], ymm0
        lea     rdx, [rdx + rcx]
        vmovupd ymm0, ymmword ptr [rbp + 8*rax]
        vfnmadd213pd    ymm0, ymm13, ymmword ptr [rcx + rdx]
        vmovupd ymmword ptr [rcx + rdx], ymm0
        add     rdx, rcx
        vmovupd ymm0, ymmword ptr [rbp + 8*rax]
        vfnmadd213pd    ymm0, ymm14, ymmword ptr [rcx + rdx]
        vmovupd ymmword ptr [rcx + rdx], ymm0
        add     rdx, rcx
        vmovupd ymm0, ymmword ptr [rbp + 8*rax]
        vfnmadd213pd    ymm0, ymm15, ymmword ptr [rcx + rdx]
        vmovupd ymmword ptr [rcx + rdx], ymm0
        add     rax, 4
        cmp     rax, r13
        jbe     .LBB0_9

So three compilers, three different strategies. Two compilers (gfortran and ifx) only vectorize when the programmer asserts it is safe.

Addendum: gfortran moves the broadcast before the loop if written like @RonShepard suggested:

    do j = k+1,n
        tmp = A(k,j)
        do i = k+1,m
            A(i,j) = A(i,j) - A(i,k)*tmp
        end do
    end do
        vbroadcastsd    ymm1, xmm2
        xor     edi, edi
.L6:
        vmovupd ymm0, YMMWORD PTR [rcx+rdi]
        vfnmadd213pd    ymm0, ymm1, YMMWORD PTR [rax+rdi]
        vmovupd YMMWORD PTR [rax+rdi], ymm0
        add     rdi, 32
        cmp     rdx, rdi
        jne     .L6

The iteration space is traversed in chunks of 4 x 1, same as ifx did.

3 Likes

This is obvious to a human, but I have no idea how optimizers deduce such things, or whether this is part of the language-specific front end or part of the general back end of a modern compiler.

I did think about do concurrent, but my feeling is that construct is more about concurrent stripmining and subblocking that it is about recursion. However, the lack of recursion is indeed part of the semantics of that construct, so maybe that is sufficient? I also considered OpenMP, and I think that also works, but it implies multithreaded execution, which is also above and beyond just telling the compiler that no recursion is occurring within the loop. And, of course, OpenMP is not part of the language standard, it is a separate standard. Array syntax actually does the opposite of what the programmer needs; it tells the compiler if it can’t figure it out, then allocate temporary arrays for the operands and for the result and do everything with those temporary arrays. In this case, with just two floating point operations per loop cycle that workspace allocation and data movement would be a performance disaster.

Thanks much for your effort within compiler explorer. I don’t understand the details of the machine instructions, but I think I can follow your discussion and understand the consequences of the various compiler choices.

1 Like

I don’t know either yet. After beta we’ll focus on optimizations more with LFortran. I have many ideas though. :slight_smile:

IIRC, CFT 1.14 could also do it. At the time, Cray was also developing the next generation compiler - CFT77. There was a period where CFT could do tricks that CFT77 couldn’t - and vice versa. CFT77 was a globally optimizing compiler, whereas CFT mostly optimized within basic blocks, and towards the end, extended basic blocks. So CFT77 generated better scalar code from the start. But eventually CFT77 learned all of CFT’s vector tricks.

What was really fun was when one of the compiler developers grafted the C compiler front end onto the CFT77 optimization/vectorization and code generation. It was one of those little “after midnight” projects which worked a little too well, and ended up causing a bit of a stir…