Optimisation of invariant `if` conditions in a loop

Hello,

I am considering revising some of the CPP macros in a code. However, in some cases, I have loops with this kind of logic:

   do k = 1, nz
     do j = 1, ny
       do i = 1, nx
         diff1 = field(i, j, k) + 1.
         diff2 = field(i, j, k) - 1.
 #if defined(_OPTION_1)
 #if defined(_OPTION_1_PLUS)
         res = diff1 + diff2
 #else
         res = diff1 * diff2
 #endif
 #else
         res = diff1 - diff2
 #endif
         field(i, j, k) = res
       end do
     end do
   end do

I would think that these CPP macros can be replaced with plain if conditions. The compiler could simply write an optimized loop for each of the options since they can be determined beforehand.

In other words, take these two examples (I added OpenACC directives for optional testing on a GPU, e.g., with nvfortran).

Example 1:

Can be compiled and run with:
gfortran -cpp -D_OPTION_1 -D_OPTION_1_PLUS test_cpp.f90 -o test_cpp; ./test_cpp

program p
  implicit none
  integer, parameter :: nx = 5, ny = 5, nz = 5
  integer :: i, j, k
  real :: field(nx, ny, nz)
  real :: diff1, diff2, res
  integer :: iunit
  !
  ! initialize and move data to GPU
  !
  do k = 1, nz
    do j = 1, ny
      do i = 1, nx
        field(i, j, k) = i + j*10. + k*100.
      end do
    end do
  end do
  !$acc enter data copyin(field)
  !
  ! run kernel
  !
  !$acc parallel loop collapse(3) default(present) private(diff1,diff2,res)
  do k = 1, nz
    do j = 1, ny
      do i = 1, nx
        diff1 = field(i, j, k) + 1.
        diff2 = field(i, j, k) - 1.
#if defined(_OPTION_1)
#if defined(_OPTION_1_PLUS)
        res = diff1 + diff2
#else
        res = diff1 * diff2
#endif
#else
        res = diff1 - diff2
#endif
        field(i, j, k) = res
      end do
    end do
  end do
  !
  ! move data to GPU and save
  !
  !$acc update self(field)
  open(newunit=iunit, file="output.bin", form="unformatted", access="stream", status="replace")
  write(iunit) field
  close(iunit)
  print*, 'field(3,3,3) = ', field(3,3,3)
end program p

Example 2

Can be compiled and run with:
gfortran test_nocpp.f90 -o test_nocpp; ./test_nocpp 1 1

program p
  implicit none
  integer, parameter :: nx = 5, ny = 5, nz = 5
  integer :: i, j, k
  real :: field(nx, ny, nz)
  real :: diff1, diff2, res
  integer :: iunit
  integer :: option1, option1_plus
  character(len=32) :: arg
  !
  ! get options from the command line
  !
  if (command_argument_count() >= 1) then
    call get_command_argument(1, arg)
    read(arg, *) option1
  else
    option1 = 0
  endif
  if (command_argument_count() >= 2) then
    call get_command_argument(2, arg)
    read(arg, *) option1_plus
  else
    option1_plus = 0
  endif
  !
  ! initialize and move data to GPU
  !
  do k = 1, nz
    do j = 1, ny
      do i = 1, nx
        field(i, j, k) = i + j*10. + k*100.
      end do
    end do
  end do
  !$acc enter data copyin(field, option1, option1_plus)
  !
  ! run kernel
  !
  !$acc parallel loop collapse(3) default(present) private(diff1,diff2,res)
  do k = 1, nz
    do j = 1, ny
      do i = 1, nx
        diff1 = field(i, j, k) + 1.
        diff2 = field(i, j, k) - 1.
        if (option1 == 1) then
          if (option1_plus == 1) then
            res = diff1 + diff2
          else
            res = diff1 * diff2
          endif
        else
          res = diff1 - diff2
        endif
        field(i, j, k) = res
      end do
    end do
  end do
  !
  ! move data to host and save
  !
  !$acc update self(field)
  open(newunit=iunit, file="output.bin", form="unformatted", access="stream", status="replace")
  write(iunit) field
  close(iunit)
  print*, 'field(3,3,3) = ', field(3,3,3)
end program p

Questions:

  • Would any of the current compilers fail to optimize the two programs to make them equivalent in performance?
  • How would you check this - would one need to read and understand the Intermediate Representation/Assembly?

Thanks!

EDIT: Related: Loop-invariant code motion - Wikipedia

3 Likes

Testing your Fortran-only example (without the preprocessor):

~/fortran/loop_unswitch$ gfortran -O3 -fopt-info p.f90
p.f90:42:12: optimized: loop with 5 iterations completely unrolled (header execution count 178992760)
p.f90:30:12: optimized: loop with 5 iterations completely unrolled (header execution count 178992760)
p.f90:29:10: optimized: loop with 5 iterations completely unrolled (header execution count 35807143)
p.f90:41:10: optimized: unswitching outer loop 4 on 'if' with condition: option1.8_161 == 1
p.f90:41:10: optimized: unswitching outer loop 4 on 'if' with condition: option1_plus.9_163 == 1
...

Seems to be among the -O3 optimizations; can be enabled explicitly at >= -01,

$ gfortran -O2 -funswitch-loops -fopt-info p.f90 
p.f90:42:12: optimized: unswitching outer loop 4 on 'if' with condition: option1.8_28 == 1
p.f90:42:12: optimized: unswitching outer loop 4 on 'if' with condition: option1_plus.9_29 == 1
...

Edit: apologies for the confusing renaming, my file p.f90 was the test_nocpp.f90 source file.

5 Likes

Thanks Ivan!

I see ifx and nvfortran also performing it, at -O2 level (ifx’s default):

$ nvfortran -O2 -acc -Minfo=accel,opt test_nocpp.f90
p:
     29, Outer loop unrolled 5 times (completely unrolled)
     35, Generating enter data copyin(option1,option1_plus,field(:,:,:))
     39, Generating NVIDIA GPU code
         40, !$acc loop gang, vector(128) collapse(3) ! blockidx%x threadidx%x
         41,   ! blockidx%x threadidx%x collapsed
         42,   ! blockidx%x threadidx%x collapsed
     39, Generating default present(field(:,:,:))
     42, Generating implicit firstprivate(option1_plus,option1)
         Invariant if transformation
         Loop unrolled 5 times (completely unrolled)
     61, Generating update self(field(:,:,:))
$ ifx -O2 -qopt-report=1 test_nocpp.f90 && cat test_nocpp.optrpt

Begin optimization report for: MAIN__

LOOP BEGIN at test_nocpp.f90 (28, 3)
    remark #25436: Loop completely unrolled by 5

    LOOP BEGIN at test_nocpp.f90 (29, 5)
        remark #25436: Loop completely unrolled by 5

        LOOP BEGIN at test_nocpp.f90 (30, 7)
            remark #25436: Loop completely unrolled by 5
        LOOP END
    LOOP END
LOOP END

LOOP BEGIN at test_nocpp.f90 (40, 3)
<Predicate Optimized v1>
    remark #25423: Invariant If condition at line 45 hoisted out of this loop
    remark #25436: Loop completely unrolled by 5

    LOOP BEGIN at test_nocpp.f90 (41, 5)
        remark #25436: Loop completely unrolled by 5

        LOOP BEGIN at test_nocpp.f90 (42, 7)
            remark #25436: Loop completely unrolled by 5
        LOOP END
    LOOP END
LOOP END

LOOP BEGIN at test_nocpp.f90 (40, 3)
<Predicate Optimized v2>
    remark #25436: Loop completely unrolled by 5

    LOOP BEGIN at test_nocpp.f90 (41, 5)
        remark #25436: Loop completely unrolled by 5

        LOOP BEGIN at test_nocpp.f90 (42, 7)
            remark #25436: Loop completely unrolled by 5
        LOOP END
    LOOP END
LOOP END
=================================================================

---- Begin Inlining Report ----
COMPILE FUNC: MAIN__

---- End Inlining Report ------
3 Likes

That’s good to know about loop unswitching . I think in this case it is clear that it is worth it.

As a matter of personal taste I think I’d prefer separate loop nests at the expense of loop duplication.

3 Likes

The problem of these speed-up techniques like loop unswitching is that they make the code less maintainable.

As far as I understand, you have to move the conditional outside of the loops and then duplicate the loops. If you need to change the code in the future, you have to remember to do the change twice.

So it’s best if the compiler does the optimization

Indeed. In my example you’d need to repeat the loop three times.

1 Like

I think more and more often programmers will not be modifying code by hand but telling an LLM to change it in some way, so this may not be a big problem.

2 Likes

There are other ways of cutting back on code duplication, but it depends on the context how applicable they are. For instance array expressions:

! F2023 conditional syntax
op = option1 == 1 ? (option1_plus == 1 ? '+' : '*') : '-'

select case(op)
case('+')
  field = (field + 1) + (field - 1)
case('-')
  field = (field + 1) - (field - 1)
case('*')
  field = (field + 1) * (field - 1)
end select

I imagine that @pcosta’s true code is more complicated and that GPU parallelization is desirable, so this solution is probably not applicable.

Another alternative which is a bit more involved would be using a preprocessor like Fypp (I haven’t tested if the Fypp syntax below is actually correct):

abstract interface
   subroutine field_op(field, nx, ny, nz)
      integer, intent(in) :: nx, ny, nz
      real, intent(inout) :: field(nx, ny, nz)
   end subroutine
end interface

procedure(field_op), pointer :: op

! ...

op => option1 == 1 ? (option1_plus == 1 ? PADD : PMULT) : PSUB

call op(field,nx,ny,nz)

! ...

contains

#:def pkernel(NAME, OPERATION)
subroutine NAME(field, nx, ny, nz)
  integer, intent(in) :: nx, ny, nz
  real, intent(inout) :: field(nx, ny, nz)
  integer :: i, j, k
  real :: diff1, diff2, res
  !$acc parallel loop collapse(3) default(present) private(diff1,diff2,res)
  do k = 1, nz
    do j = 1, ny
      do i = 1, nx
        diff1 = field(i, j, k) + 1.
        diff2 = field(i, j, k) - 1.
        res = OPERATION
        field(i, j, k) = res
      end do
    end do
  end do
end subroutine NAME
#:enddef

#! Instantiate the template for different operations
@:pkernel(PADD,  diff1 + diff2)
@:pkernel(PMULT, diff1 * diff2)
@:pkernel(PSUB,  diff1 - diff2)

In any-case I agree with the original KISS approach. Just wanted to throw out some ideas.

Nvfortran and gfortran probably don’t support the F2023 conditional syntax yet.

1 Like

I think enclosing parentheses are required for Fortran’s conditional-expressions —just like in format strings.

As for avoiding code duplication, while keeping it simple/clear, contained procedures should do the trick.

(The optimizer should take care of any perceived overhead.)

2 Likes

Thanks Ivan,

At the end, I kept it simple and asked an LLM to produce the unswitched loop versions (funny that @Beliavsky predicted this). It goes a bit against my wish to avoid repeating code, but since this is very stable code, I’d prefer that over a solution with metaprogramming at this point.

I found a strange behavior in ifx -fast for this code: Re: Performance of `ifx -fast` when invariant if conditions in loops. - Intel Community

2 Likes

I did not test these, but here are two more possible ways to rewrite the code. The first uses a single do concurrent block.

   do concurrent( k=1:nz, j=1:ny, i=1:nx )
      diff1 = field(i, j, k) + 1.
      diff2 = field(i, j, k) - 1.
      if (option1 == 1) then
         if (option1_plus == 1) then
            res = diff1 + diff2
         else
            res = diff1 * diff2
         endif
      else
         res = diff1 - diff2
      endif
      field(i, j, k) = res
   enddo

This version uses three separate do concurrent blocks.

   if (option1 == 1) then
      if (option1_plus == 1) then
         do concurrent( k=1:nz, j=1:ny, i=1:nx )
            diff1 = field(i, j, k) + 1.
            diff2 = field(i, j, k) - 1.
            field(i, j, k) = diff1 + diff2
         enddo
      else
         do concurrent( k=1:nz, j=1:ny, i=1:nx )
            diff1 = field(i, j, k) + 1.
            diff2 = field(i, j, k) - 1.
            field(i, j, k) = diff1 * diff2
         enddo
      endif
   else
      do concurrent( k=1:nz, j=1:ny, i=1:nx )
         diff1 = field(i, j, k) + 1.
         diff2 = field(i, j, k) - 1.
         field(i, j, k) = diff1 - diff2
      enddo
   endif

This second version is only a little longer than the first version. It is possible that the compiler will be more aggressive with optimizations with do concurrent than with a regular loop. Of course, that last one could have just the loop body field(i,j,k)=2.0, but I’m ignoring that. I wonder if a clever compiler would catch that?

All of these could also be written with array syntax.

field = 2.0*field
...
field = field * field - 1.0
...
field = 2.0

However, the danger with this is always that the compiler might allocate memory, perform the rhs of the operations, and then assign the result, which would be relatively more expensive than the do concurrent or the loop based codes, which will be done with no memory allocations.

1 Like

Do concurrent has the benefit it can be done on GPU with the Nvidia Fortran compiler. In gfortran it also serves as a vectorization hint similar to the IVDEP directive:

For DO CONCURRENT constructs this annotation [(IVDEP)] is implicit to all loop control variables.

2 Likes